Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inferior performance of inplace broadcasting over map! for multiple wide matrices #47873

Open
jishnub opened this issue Dec 12, 2022 · 5 comments
Labels
arrays [a, r, r, a, y, s] broadcast Applying a function over a collection performance Must go faster

Comments

@jishnub
Copy link
Contributor

jishnub commented Dec 12, 2022

See https://discourse.julialang.org/t/why-is-a-multi-argument-inplace-map-much-faster-in-this-case-than-a-broadcast/91525/6, the following seems broadly reproducible across a range of platforms:

julia> A = rand(10, 1000); B = copy(A); C = zero(A);

julia> using BenchmarkTools

julia> @btime map!(+, $C, $A, $B);
  8.947 μs (0 allocations: 0 bytes)

julia> @btime $C .= $A .+ $B;
  11.281 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD EPYC 7742 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
  Threads: 1 on 64 virtual cores
Environment:
  LD_LIBRARY_PATH = /home/user/lib:/lib::/home/user/.local/lib
  JULIA_DEPOT_PATH = /scratch/user/julia
  JULIA_REVISE_POLL = 1
  JULIA_NUM_PRECOMPILE_TASKS = 1
  JULIA_EDITOR = vi

This difference goes away for nearly square matrices, and is minimal for tall matrices. On some platforms, broadcasting performs better for the tall and square cases. However, map! seems to consistently do better for the wide case.

@jishnub jishnub changed the title Inferior performance of inplace broadcasting over map! for mumtiple wide matrices Inferior performance of inplace broadcasting over map! for multiple wide matrices Dec 12, 2022
@jishnub jishnub added performance Must go faster arrays [a, r, r, a, y, s] labels Dec 12, 2022
jishnub referenced this issue in JuliaLinearAlgebra/BandedMatrices.jl Dec 12, 2022
@N5N3
Copy link
Member

N5N3 commented Dec 15, 2022

I think we already have a issue to track this problem but I can't find it.
Anyway, the dim1 in this example is too small for vectorization and our broadcast only tries to vectorize the inner most loop.
map! is faster here only because the inputs have linear index. If they are cartesian-indexed, I guess map! would be slower as IIRC it's zip based.

@jishnub
Copy link
Contributor Author

jishnub commented Dec 15, 2022

Such wide matrices are frequently encountered in BandedMatrices. It would be good if broadcasting had a fast path that used linear indexing if all terms are compatible with it.

@N5N3
Copy link
Member

N5N3 commented Dec 15, 2022

We have some related (Edit: but for different purpose) trial in Base (#30973), but I can't say it would be landed in the near future.
For Pkg dev I think Fastbroadcast.jl might be the best solution for now. (Although it might not be that light weight ...)

@giordano
Copy link
Contributor

giordano commented Feb 2, 2023

I think we already have a issue to track this problem but I can't find it.

Do you refer to #28126?

@giordano giordano added the broadcast Applying a function over a collection label Feb 2, 2023
@N5N3
Copy link
Member

N5N3 commented Feb 2, 2023

No the number of broadcasted args in MWE above is too small thus it won't hit #28126.
The solution here is switching to linear-indexing if we can prove:

  1. all args have the same axes (or they have 0 dimension.)
  2. all args are IndexLinear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] broadcast Applying a function over a collection performance Must go faster
Projects
None yet
Development

No branches or pull requests

3 participants