Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadcast and linear indexing #32051

Open
chethega opened this issue May 16, 2019 · 8 comments
Open

Broadcast and linear indexing #32051

chethega opened this issue May 16, 2019 · 8 comments
Labels
broadcast Applying a function over a collection

Comments

@chethega
Copy link
Contributor

It would be nice if we could propagate IndexStyle for broadcasts. The issue is the following:

julia> using BenchmarkTools
julia> n=10_000; d=1; T=Int32; a=rand(T, d*n); b=rand(T, d, n); ac=copy(a); bc=copy(b); inc=Ref(T(3));
julia> @btime broadcast!(+, ac, a, inc); @btime broadcast!(+, bc, b, inc);
  2.221 μs (0 allocations: 0 bytes)
  39.333 μs (0 allocations: 0 bytes)

julia> n=10_000; d=2; T=Float32; a=rand(T, d*n); b=rand(T, d, n); ac=copy(a); bc=copy(b); inc=Ref(T(3));
julia> @btime broadcast!(+, ac, a, inc); @btime broadcast!(+, bc, b, inc);
  4.531 μs (0 allocations: 0 bytes)
  35.019 μs (0 allocations: 0 bytes)

julia> n=10_000; d=8; T=Float64; a=rand(T, d*n); b=rand(T, d, n); ac=copy(a); bc=copy(b); inc=Ref(T(3));
julia> @btime broadcast!(+, ac, a, inc); @btime broadcast!(+, bc, b, inc);
  59.652 μs (0 allocations: 0 bytes)
  88.321 μs (0 allocations: 0 bytes)

Currently, broadcasts always use cartesian indexing. This is slow and prevents a lot of simd.

In the relatively common case that dest and all args support linear indexing, and the only cases of dropped dimensions are zero-dimensional (as above), we should use linear indexing for a significant speedup (up to 20x), especially if the first dimension is small (which is a very common occurence).

@mbauman
Copy link
Member

mbauman commented May 16, 2019

Broadcast supports non-AbstractArrays, so we can't ask all arguments for their IndexStyle. We also definitely cannot do this if any dimensions (other than 0) get "extruded". But that said, I think we could conservatively and statically identify more linear-capable broadcasts that are limited to known good cases. It's "as easy" as adding more eachindex methods.

At the same time, we should also add support for more array-like eachindex method support so folks can explicitly request IndexLinear or IndexCartesian if they need.

@chethega
Copy link
Contributor Author

We could specialize on <:AbstractArray, <:Ref, <:Number, etc. A 20x speedup for <:AbstractArray without nonzero extruded dimensions is imho good enough to justify some code duplication.

I am half of a mind to propose IndexStyle(::Any) = IndexSequential() (or maybe IndexUnknown()), indicating that random access is bad or unsupported, and the iteration protocol should be used. Then, we could always query index styles of objects, and optimize a lot of unrelated code: Just like iterators indicate whether they have a known length and eltype, they would signal whether they like random access or need to carry more state in iteration.

It's "as easy" as adding more eachindex methods.

Ok, maybe I'm just blind, but how?

@mbauman
Copy link
Member

mbauman commented May 16, 2019

Oh nevermind, I'm wrong. This can't be done type-stably because the decision to extrude is not in the type domain but on runtime values (sizes).

Thus we cannot have a type stable IndexStyle for broadcasts that returns IndexLinear for "safe" n-dimensional broadcasts. It'd have to be a runtime switch in a particular copyto! implementation... and we already are generating a number of for loops there for different optimizations. In fact, the addition of an extra for loop was precisely what stalled #30973 — it significantly regressed compile times.

@mbauman
Copy link
Member

mbauman commented May 16, 2019

Just to be clear (because obviously I forgot what restrictions we had here myself):

  • Broadcasting evaluates each element with bc[I]
  • If any extrusion occurs, broadcast absolutely needs to have the indices separated out into independent dimensions to remove/add/replace the extruded dimension as required
  • Extrusion is determined by run-time values. Broadcasting A .+ B where A and B are both Matrixes isn't safe to do linearly because it might need to swap out one or both of the indices in one or both of the arrays — and that computation requires cartesian indices.

There is still a case where this would be type-stable: it's where there is only one non-zero-dimensional argument and it's IndexLinear and all other arguments are zero-dimensional. In other words, it would apply to none of your original examples (since the RHS could be extruded into the LHS).

@mbauman mbauman added the broadcast Applying a function over a collection label May 16, 2019
@chethega
Copy link
Contributor Author

Extrusion is determined by run-time values. Broadcasting A .+ B where A and B are both Matrixes isn't safe to do linearly because it might need to swap out one or both of the indices in one or both of the arrays — and that computation requires cartesian indices.

In other words, it would apply to none of your original examples (since the RHS could be extruded into the LHS).

In fact, the addition of an extra for loop was precisely what stalled #30973 — it significantly regressed compile times.

Merde, this is bad. After reading up on that PR of yours, I don't know what to do about this. Thanks for the explanation!

In 2.0, we should consider making broadcast extrusion explicit (i.e. missing dims get silently extruded such that array .+ scalar continues to work, but A .+ B of matrices would fail when sizes don't coincide).

@mbauman
Copy link
Member

mbauman commented May 16, 2019

I mean, it's a tradeoff. It would suck to lose the ability to do things like A .- mean(A, dims=1) or A ./ sum(A, dims=1).

It is something I've considered, but without the ability to encode singleton dimensions in the type system I think it's a non-starter. Doing this generally was something we talked about in JuliaLang/LinearAlgebra.jl#42 but dismissed as being far too complicated. The introduction of an orthogonal syntax like f.(A, ^B) or f.(A, ⟂B) could ameliorate some of the pain, but I still don't think it'd be worth the massive breakage.

@maleadt
Copy link
Member

maleadt commented May 17, 2019

Too bad, I had hoped for something similar for GPU arrays since the run-time index calculations are pretty costly there too.

@chethega
Copy link
Contributor Author

You're right, some of the use cases for the current extrusion behavior are compelling.

So I guess we could prepare a PoC branch that emits both linear and cartesian indexing, with a single hoisted runtime check. That would blow up codesize and compiler latency, but improve runtime perf in most cases. Having such a branch ready would allow us to make more informed decisions on the tradeoffs (code complexity, readability of compilation output, compiler latency, runtime performance), and may be useful for people who prioritize runtime over compile-time perf. Also, the tradeoff may change in the future (compiler latency could get better).

It may be useful to also offer a keyword arg to broadcast / broadcast! that makes extrusions explicit? Something like extrusions = Val(((false,false), (true, false), ())) for NxN .+ 1xN .+ scalar. Keyword args to broadcasts could also allow other fast operations that we currently cannot support, like broadcast(args...; mask = mask::Base.LogicalIndex) (such ops currently need to construct views that incur a large temporary).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection
Projects
None yet
Development

No branches or pull requests

3 participants