Broadcast and linear indexing #32051

chethega · 2019-05-16T14:53:04Z

It would be nice if we could propagate IndexStyle for broadcasts. The issue is the following:

julia> using BenchmarkTools
julia> n=10_000; d=1; T=Int32; a=rand(T, d*n); b=rand(T, d, n); ac=copy(a); bc=copy(b); inc=Ref(T(3));
julia> @btime broadcast!(+, ac, a, inc); @btime broadcast!(+, bc, b, inc);
  2.221 μs (0 allocations: 0 bytes)
  39.333 μs (0 allocations: 0 bytes)

julia> n=10_000; d=2; T=Float32; a=rand(T, d*n); b=rand(T, d, n); ac=copy(a); bc=copy(b); inc=Ref(T(3));
julia> @btime broadcast!(+, ac, a, inc); @btime broadcast!(+, bc, b, inc);
  4.531 μs (0 allocations: 0 bytes)
  35.019 μs (0 allocations: 0 bytes)

julia> n=10_000; d=8; T=Float64; a=rand(T, d*n); b=rand(T, d, n); ac=copy(a); bc=copy(b); inc=Ref(T(3));
julia> @btime broadcast!(+, ac, a, inc); @btime broadcast!(+, bc, b, inc);
  59.652 μs (0 allocations: 0 bytes)
  88.321 μs (0 allocations: 0 bytes)

Currently, broadcasts always use cartesian indexing. This is slow and prevents a lot of simd.

In the relatively common case that dest and all args support linear indexing, and the only cases of dropped dimensions are zero-dimensional (as above), we should use linear indexing for a significant speedup (up to 20x), especially if the first dimension is small (which is a very common occurence).

The text was updated successfully, but these errors were encountered:

mbauman · 2019-05-16T18:24:53Z

Broadcast supports non-AbstractArrays, so we can't ask all arguments for their IndexStyle. We also definitely cannot do this if any dimensions (other than 0) get "extruded". But that said, I think we could conservatively and statically identify more linear-capable broadcasts that are limited to known good cases. It's "as easy" as adding more eachindex methods.

At the same time, we should also add support for more array-like eachindex method support so folks can explicitly request IndexLinear or IndexCartesian if they need.

chethega · 2019-05-16T19:28:55Z

We could specialize on <:AbstractArray, <:Ref, <:Number, etc. A 20x speedup for <:AbstractArray without nonzero extruded dimensions is imho good enough to justify some code duplication.

I am half of a mind to propose IndexStyle(::Any) = IndexSequential() (or maybe IndexUnknown()), indicating that random access is bad or unsupported, and the iteration protocol should be used. Then, we could always query index styles of objects, and optimize a lot of unrelated code: Just like iterators indicate whether they have a known length and eltype, they would signal whether they like random access or need to carry more state in iteration.

It's "as easy" as adding more eachindex methods.

Ok, maybe I'm just blind, but how?

mbauman · 2019-05-16T19:59:53Z

Oh nevermind, I'm wrong. This can't be done type-stably because the decision to extrude is not in the type domain but on runtime values (sizes).

Thus we cannot have a type stable IndexStyle for broadcasts that returns IndexLinear for "safe" n-dimensional broadcasts. It'd have to be a runtime switch in a particular copyto! implementation... and we already are generating a number of for loops there for different optimizations. In fact, the addition of an extra for loop was precisely what stalled #30973 — it significantly regressed compile times.

mbauman · 2019-05-16T20:15:33Z

Just to be clear (because obviously I forgot what restrictions we had here myself):

Broadcasting evaluates each element with bc[I]
If any extrusion occurs, broadcast absolutely needs to have the indices separated out into independent dimensions to remove/add/replace the extruded dimension as required
Extrusion is determined by run-time values. Broadcasting A .+ B where A and B are both Matrixes isn't safe to do linearly because it might need to swap out one or both of the indices in one or both of the arrays — and that computation requires cartesian indices.

There is still a case where this would be type-stable: it's where there is only one non-zero-dimensional argument and it's IndexLinear and all other arguments are zero-dimensional. In other words, it would apply to none of your original examples (since the RHS could be extruded into the LHS).

chethega · 2019-05-16T21:33:23Z

Extrusion is determined by run-time values. Broadcasting A .+ B where A and B are both Matrixes isn't safe to do linearly because it might need to swap out one or both of the indices in one or both of the arrays — and that computation requires cartesian indices.

In other words, it would apply to none of your original examples (since the RHS could be extruded into the LHS).

In fact, the addition of an extra for loop was precisely what stalled #30973 — it significantly regressed compile times.

Merde, this is bad. After reading up on that PR of yours, I don't know what to do about this. Thanks for the explanation!

In 2.0, we should consider making broadcast extrusion explicit (i.e. missing dims get silently extruded such that array .+ scalar continues to work, but A .+ B of matrices would fail when sizes don't coincide).

mbauman · 2019-05-16T22:01:17Z

I mean, it's a tradeoff. It would suck to lose the ability to do things like A .- mean(A, dims=1) or A ./ sum(A, dims=1).

It is something I've considered, but without the ability to encode singleton dimensions in the type system I think it's a non-starter. Doing this generally was something we talked about in JuliaLang/LinearAlgebra.jl#42 but dismissed as being far too complicated. The introduction of an orthogonal syntax like f.(A, ^B) or f.(A, ⟂B) could ameliorate some of the pain, but I still don't think it'd be worth the massive breakage.

maleadt · 2019-05-17T05:56:50Z

Too bad, I had hoped for something similar for GPU arrays since the run-time index calculations are pretty costly there too.

chethega · 2019-05-17T09:34:39Z

You're right, some of the use cases for the current extrusion behavior are compelling.

So I guess we could prepare a PoC branch that emits both linear and cartesian indexing, with a single hoisted runtime check. That would blow up codesize and compiler latency, but improve runtime perf in most cases. Having such a branch ready would allow us to make more informed decisions on the tradeoffs (code complexity, readability of compilation output, compiler latency, runtime performance), and may be useful for people who prioritize runtime over compile-time perf. Also, the tradeoff may change in the future (compiler latency could get better).

It may be useful to also offer a keyword arg to broadcast / broadcast! that makes extrusions explicit? Something like extrusions = Val(((false,false), (true, false), ())) for NxN .+ 1xN .+ scalar. Keyword args to broadcasts could also allow other fast operations that we currently cannot support, like broadcast(args...; mask = mask::Base.LogicalIndex) (such ops currently need to construct views that incur a large temporary).

chethega mentioned this issue May 16, 2019

changed broadcast! into bitarray algorithm #32048

Merged

mbauman added the broadcast Applying a function over a collection label May 16, 2019

charleskawczynski mentioned this issue Jul 18, 2024

Leverage linear indexing for pointwise kernels CliMA/ClimaCore.jl#1889

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broadcast and linear indexing #32051

Broadcast and linear indexing #32051

chethega commented May 16, 2019

mbauman commented May 16, 2019

chethega commented May 16, 2019

mbauman commented May 16, 2019

mbauman commented May 16, 2019

chethega commented May 16, 2019

mbauman commented May 16, 2019

maleadt commented May 17, 2019

chethega commented May 17, 2019

Broadcast and linear indexing #32051

Broadcast and linear indexing #32051

Comments

chethega commented May 16, 2019

mbauman commented May 16, 2019

chethega commented May 16, 2019

mbauman commented May 16, 2019

mbauman commented May 16, 2019

chethega commented May 16, 2019

mbauman commented May 16, 2019

maleadt commented May 17, 2019

chethega commented May 17, 2019