Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: introduce runtime representation of broadcast fusion #23692

Closed
wants to merge 2 commits into from

Conversation

vtjnash
Copy link
Member

@vtjnash vtjnash commented Sep 13, 2017

This preserves all dot-fusion information in the type-heirarchy, permitting complex runtime-analysis and re-dispatch overrides. Since no information is obscured in a new type, experiments with variations on #22060 and #22053 can now be implemented in pure Julia just by adding new dispatch rules. Also, since no new types are created, these can now be easily serialized, and can be returned from generated functions.

fix #21094
fix #22060
fix #22053
replaces #22063

@vchuravy
Copy link
Member

Awesome! Would be great if it had some tests and examples for actually modifying broadcast behaviour for custom container types.

@vtjnash
Copy link
Member Author

vtjnash commented Sep 13, 2017

Haha. From my description above ("experiments with variations ... can now be implemented"), I was implying that someone else should do that 😛

@ararslan ararslan added the broadcast Applying a function over a collection label Sep 13, 2017
@andyferris
Copy link
Member

andyferris commented Sep 17, 2017

Cool. This is awesome that this is looked at. :) Thanks @vtjnash

Since I'm not sure of any preceding discussions - was this approach vs others considered? For example, one might consider lazy but non-parser-fusing broadcast? This problem seems at first blush possible to deal with using Julia types, dispatch, and dead-simple dot parsing where each dot-call or dot-op makes exactly one broadcast call, and broadcasting with arrays could make a BroadcastedArray lazily evaluated view. The reason I bring this up is because I'm a fan of minimizing parser magic (but then again I'm better at Julia meta programming than at lisp so maybe that's partly me?).

@stevengj
Copy link
Member

@andyferris, see e.g. #19198 for discussion of making broadcast lazy vs materializing.

@@ -2914,7 +2914,7 @@ Base.literal_pow(::typeof(^), ::PR20530, ::Val{p}) where {p} = 2
p = 2
@test x^p == 1
@test x^2 == 2
@test [x,x,x].^2 == [2,2,2]
@test_broken [x, x, x].^2 == [2, 2, 2] # literal_pow violates referential transparency
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What result does this give now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On master, I get [x,x,x].^2 == [2,2,2] but I guess this PR breaks this? Temporarily or permanently?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think literal_pow really played well with broadcast and dot syntax to begin with. It was a little unpredictable when literal_pow or regular ^ would be called...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, on this PR you get [1, 1, 1].

@timholy
Copy link
Member

timholy commented Sep 20, 2017

Probably also the best path for fixing #22932 (comment)

@stevengj
Copy link
Member

What happens in this PR to @. x^2 + 2x + sin(x)? Currently, that gets lowered to broadcast(x -> x^2 + 2x + sin(x), x), i.e. the multiple appearances of x are collapsed into a single argument. Is that still true?

@mbauman
Copy link
Member

mbauman commented Sep 22, 2017

No, they are currently passed as three arguments:

julia> expand(Main, :(@. x^2 + 2x + sin(x)))
:((Base.broadcast)(((Core.apply_type)((Core.getfield)(Base.Broadcast, :Fusion), 3, false))(((Core.getfield)(Base.Broadcast, :FusionCall))(+, (Core.tuple)(((Core.getfield)(Base.Broadcast, :FusionCall))(^, (Core.tuple)(((Core.apply_type)((Core.getfield)(Base.Broadcast, :FusionArg), 1))(), ((Core.getfield)(Base.Broadcast, :FusionConstant))(2))), ((Core.getfield)(Base.Broadcast, :FusionCall))(*, (Core.tuple)(((Core.getfield)(Base.Broadcast, :FusionConstant))(2), ((Core.apply_type)((Core.getfield)(Base.Broadcast, :FusionArg), 2))())), ((Core.getfield)(Base.Broadcast, :FusionCall))(sin, (Core.tuple)(((Core.apply_type)((Core.getfield)(Base.Broadcast, :FusionArg), 3))()))))), x, x, x))

Are there cases where that makes a significant performance difference?

@stevengj
Copy link
Member

stevengj commented Sep 22, 2017

@mbauman, I thought that it did, but now I'm trying a simple example and I'm not seeing much difference:

julia> f!(dest, x) = broadcast!(x -> x^2 + 3x - 5x^3, dest, x);
julia> g!(dest, x) = broadcast!((x,y,z) -> x^2 + 3y - 5z^3, dest, x,x,x);
julia> x = rand(10^4); y = similar(x);

julia> @btime f!($y, $x);
  7.167 μs (0 allocations: 0 bytes)

julia> @btime g!($y, $x);
  7.183 μs (0 allocations: 0 bytes)

(I wonder if the compiler got better since I last tried this?)

@stevengj
Copy link
Member

It would also be nice to see a proof-of-concept implementation of a custom array type for which fusion is disabled, just to make sure that this is feasible with a reasonable amount of code given the new design.

@mbauman
Copy link
Member

mbauman commented Sep 22, 2017

That's precisely what I'm working on right now. I understand this much better after playing with it for a little bit. Here's a simple example:

@. 2x + 1

Expands out into:

julia> d = Base.Broadcast.Fusion{1,false}(
               Base.Broadcast.FusionCall(+,
                   (Base.Broadcast.FusionCall(*,
                       (Base.Broadcast.FusionConstant(2),
                        Base.Broadcast.FusionArg{1}())),
                    Base.Broadcast.FusionConstant(1))))
(a_1) -> +(*(2, a_1), 1)

julia> x = 1:2;

julia> broadcast(d, x)
2-element Array{Int64,1}:
 3
 5

So, now I think you'll need to overload broadcast(::Fusion, ::UnitRange) and dig through the call graph. It doesn't look like it'll be easy…

It'd be really nice if we could pass the objects themselves to the FusionArg and then eagerly evaluate them in cases like this. But that doesn't work at all, because this call graph is defining the scalar function. Still puzzling this through…

Edit: note that it's easy to introspect these things by evaluating a partial expansion:

julia> d = eval(expand(Main, :(@. 4x^2 + 2x + 1)).args[2]) # Expr(call, :broadcast, *THIS FUNCTION*, args…)
(a_1, a_2) -> +(*(4, ^(a_1, 2)), *(2, a_2), 1)

@mbauman
Copy link
Member

mbauman commented Sep 22, 2017

Ok, I don't have the complete thread here yet, but it looks like we'll want to add a recursive pre-walk to make it easy to modify the call tree and its arguments. For example, we want to recursively identify that, e.g., FusionCall(+, ::FusionConstant{Int}, ::FusionArgument{1}) is a representation of unfused_broadcast(+, ::Int, ::UnitRange).

In this case, we'd then swap the FusionCall out in exchange for simply the FusionArgument{1}(), and then eagerly evaluate the first fusion argument to r.start+i:r.stop+i.

Does this plan of attack sound reasonable and achievable? I definitely will need more time to process how this becomes an extensible interface. Others are very welcome to pitch in. :)

@stevengj
Copy link
Member

stevengj commented Sep 23, 2017

@mbauman, if this analysis of the fusion call graph occurs at runtime, is there an issue with type inference? Functions that disable fusion still need to be inferrable.

@vtjnash
Copy link
Member Author

vtjnash commented Sep 25, 2017

FusionCall(+, ::FusionConstant{Int}, ::FusionArgument{1}) is a representation of unfused_broadcast(+, ::Int, ::UnitRange)

It's not – that later information is never produced in this representation.

@mbauman
Copy link
Member

mbauman commented Sep 25, 2017

Right, but that's exactly what we'd need to address something like #22932 (comment). I was simply trying to identify if/how it'd be possible to get there from the current implementation in a way that's easy to extend.

This is very clever and definitely an improvement in many cases… but it blows up fast due to all the nested types. For example, this definition is no longer inferable:

julia> f(x) = @. x^5 - 3x^4 + 2x^3 + 10x^2 - 4x + 64

julia> @code_warntype f(1:10)
Variables:
  #self# <optimized out>
  x::UnitRange{Int64}
  T::Type{#s50} where #s50
…
      return #temp#@_53::Any
  end::Any

# Conflicts:
#	src/julia-syntax.scm
#	test/broadcast.jl
#	test/ranges.jl
@timholy
Copy link
Member

timholy commented Dec 17, 2017

I merged this into master. Here's a fun demo.

code

function Base.broadcast(f, r::UnitRange)
    if all_ops_ursafe(f)
        UnitRange(f(first(r)), f(last(r)))
    elseif all_ops_rsafe(f)
        range(f(first(r)), f(step(r)), length(r))
    else
        broadcast(f, Broadcast.DefaultArrayStyle{1}(), nothing, nothing, r)
    end
end

all_ops_ursafe(f::Broadcast.Fusion) = all_ops_ursafe(f.f)
all_ops_ursafe(f::Broadcast.FusionCall) = all_ops_ursafe(f.f) && all(all_ops_ursafe, f.args)
all_ops_ursafe(::typeof(+)) = true
all_ops_ursafe(::typeof(-)) = true
all_ops_ursafe(f::Broadcast.FusionConstant{<:Real}) = true
all_ops_ursafe(f::Broadcast.FusionArg) = true
all_ops_ursafe(arg) = false

all_ops_rsafe(f::Broadcast.Fusion) = all_ops_rsafe(f.f)
all_ops_rsafe(f::Broadcast.FusionCall) = all_ops_rsafe(f.f) && all(all_ops_rsafe, f.args)
all_ops_rsafe(::typeof(+)) = true
all_ops_rsafe(::typeof(-)) = true
all_ops_rsafe(::typeof(*)) = true
all_ops_rsafe(::typeof(/)) = true
all_ops_rsafe(f::Broadcast.FusionConstant) = true
all_ops_rsafe(f::Broadcast.FusionArg) = true
all_ops_rsafe(arg) = false

demo

julia> using Test

julia> x = 1:3
1:3

julia> @inferred(2 .+ x .+ 1)
4:6

julia> @inferred(2 .* x .+ 1)
3:2:7

julia> @inferred(sin.(x) .+ 1)
3-element Array{Float64,1}:
 1.8414709848078965
 1.9092974268256817
 1.1411200080598671

Overall I'm pretty optimistic about this and think we probably want this for 1.0. My two concerns:

  • this would need updating for the change in keyword arguments, and I'm a bit skeptical that I have time to tackle that within the next 48 hours (I'm in the middle of a move...)
  • the quality of the generated code doesn't seem great, though I didn't have time to check in detail

@vtjnash
Copy link
Member Author

vtjnash commented Dec 18, 2017

I suppose this is OK, but I don't really like this PR anymore, since it seemed like it only addressed part of the problem. I started a new project at master...vtjnash:jn/lazydotfuse to experiment with being fully lazy and it seemed better. It needs updating for the changes to the broadcast API.

@Sacha0
Copy link
Member

Sacha0 commented Dec 18, 2017

Here's a fun demo.

Neat demo @timholy! :)

@timholy
Copy link
Member

timholy commented Dec 18, 2017

I didn't spend a lot of time looking it over, but IIUC jn/lazydotfuse seems to generalize the approach of MappedArrays.jl. As someone whose computing life routinely features 1TB 4d arrays, I love lazy operations. But, the problem is, when do you evaluate? Will users get their hands on a Broadcast object, or is this something that gets evaled at the "end" of broadcasting? (My Scheme is nonexistent, so I can't easily tell from just looking at the code.) If so, won't the fact that a Broadcast is neither <:AbstractArray nor <:Tuple cause downstream problems?

Conversely, this PR doesn't seem to suffer from that problem at all. What specifically seemed missing to you?

@vtjnash
Copy link
Member Author

vtjnash commented Dec 18, 2017

Will users get their hands on a Broadcast object, or is this something that gets evaled at the "end" of broadcasting?

The operations are realized at the "end" of the broadcasting, so the effect of it is transparent to user code.

This PR still does an awkward amount of work in scheme. The other branch eliminates that and moves everything into Julia. The net result is that the default is for Julia to build effectively the same representation as this PR. But allowing libraries to implement more accurate control over the transform.

@StefanKarpinski
Copy link
Member

So it seems like what @vtjnash has done on that branch is exactly what I had envisioned when I proposed doing this with lazy operations instead of as a purely syntactic transformation. The key insight is that we were already effectively deciding where the materialization boundary is in the syntactic approach: the materialization boundary is exactly where dot fusion stops.

@bramtayl
Copy link
Contributor

It might be worth exporting the linked lists as a standard library package?

@KristofferC
Copy link
Member

KristofferC commented Dec 27, 2017

It seems the questions here are diverging further and further from the core of this PR. Please open new, focused, issues instead, and cross reference back to this one. Issues are cheap.

@timholy
Copy link
Member

timholy commented Dec 28, 2017

What do folks think I should do about this? I'm currently blocked on #25267. I could push a branch that has the requisite changes in it, but it is going to fail nanosoldier. Or if it doesn't, it's only because we don't have sufficiently comprehensive broadcasting tests.

I guess the real question is whether we want to merge something that has known performance regressions for the sake of the API change, and assume that the performance fix will come.

@vchuravy
Copy link
Member

I would say do the API changes now and performance improvements during the alpha.

@stevengj
Copy link
Member

stevengj commented Dec 29, 2017

Can't this be done for 1.1? Right now, there is no documented API for makingbroadcast lazy etcetera; the only documented modification is overloading broadcast itself ala #24992. So, adding an API for this later would not be breaking. I'd rather not merge something at the last minute that massively slows things down for the sake of a new API that we've barely tested.

@StefanKarpinski
Copy link
Member

I agree, @stevengj – hooking into broadcast should be explicitly not a stable API in 1.0.

@timholy
Copy link
Member

timholy commented Dec 30, 2017

Agreed this is a tough call. You outline the dangers well. The flip side is that it's a bit unsatisfying to say "Julia 1.0 is here! It's got a stable API except for broadcasting!" I suspect the coming PR is the last remaining component needed to avoid having that kind of caveat, and to me that's sufficiently attractive that I'm going to keep plugging away at this, get the PR submitted, and y'all can decide what to do with it.

If the goal were only to allow broadcasting to be lazy, then I'd say that's not important enough to worry about. But as a refresher, the real problems to be fixed are:

  • it's pretty awful that now we return an Array for (1:5) .+ 1. We could put a bandaid on that by specializing broadcast(::typeof(+), ::UnitRange, ::Real), but due to the lisp-fusion things like 2.*(1:5) .+ 1 are beyond repair, and even the simple fix would behave differently for variables than for literals.
  • the lisp-fusion makes life essentially impossible for, say, the GPU folks or anyone else who calls a C library, who can't make use of a fused operation and can't easily unfuse it.

If we do this later, then my crystal ball suggests it might look something like this:

  • we'll have to add a broadcast_similar method that throws an error if the API documented here is used---without the lisp-fusion it's not clear what to supply for f, so I don't think we can issue a traditional deprecation warning (in the sense that code will continue to function).
  • There seems also to be concern about the impact on literal_pow above, though I'm not far enough along to have anything useful to say about that.
  • The return type of many broadcasting operations that involve ranges will change. To me that's a breaking change that goes beyond "the API for hooking into broadcasting has changed."

If none of these sound serious to others, don't let this stand in the way of the 0.7-alpha release.

@stevengj
Copy link
Member

@timholy, the minimal change to address some of those issues would be something like #22063: a way to disable fusion for certain argument types, without any more complicated stuff like laziness or this PR.

If we ever change broadcast to be fully lazy or whatever, we can still implement isfusing on top of that so that container types still have an easy way to disable fusion.

(Honestly, I don't see the big deal about ranges. Just don't use .* if you want a range result; use * instead.)

@StefanKarpinski
Copy link
Member

The flip side is that it's a bit unsatisfying to say "Julia 1.0 is here! It's got a stable API except for broadcasting!"

It would certainly be better to be able to say "everything is perfect and we won't change it", but we're really out of time to keep working on this and it doesn't seem ready yet. Since broadcasting and dot syntax are such new and novel features, I don't think it's the end of the world to say that:

  • Usage of broadcasting will continue to work from user's perspective the way it does now;
  • Implementation of broadcasting may change and that may break libraries that hook into it deeply.

For the vast majority of Julia users, the "broadcasting API" is how broadcasting behaves and how they use it, not how it's implemented and extended. If/when we do change the internals of broadcasting, we can work closely with packages that hook into it to ensure a smooth transition so that non-library-author Julia users are none the wiser that anything has changed.

@vtjnash
Copy link
Member Author

vtjnash commented Dec 31, 2017

There seems also to be concern about the impact on literal_pow above, though I'm not far enough along to have anything useful to say about that.

It's possible to handle, it just requires some code duplication.

@andreasnoack
Copy link
Member

Do we still need all the literal_pow stuff now that we have constant propagation?

@StefanKarpinski
Copy link
Member

@andreasnoack, please don't derail this conversation to rehash that. I've explained why that's needed elsewhere and will be happy to do so again but not here.

@timholy
Copy link
Member

timholy commented Jan 4, 2018

New version of this PR in #25377.

@stevengj

(Honestly, I don't see the big deal about ranges. Just don't use .* if you want a range result; use * instead.)

We're deprecating a + 1 for other AbstractVectors, so telling people to use + for ranges and .+ for all other AbstractVectors is not a satisfying solution. We seem to have missed deprecating those methods for ranges, but I do that as a separate commit in #25377.

That fact then makes the alternative less attractive:

the minimal change to address some of those issues would be something like #22063: a way to disable fusion for certain argument types, without any more complicated stuff like laziness or this PR.

So we'd have to disable fusion for all ranges, and that presumably means ending fusion for quite a few operations. Totally doable, of course, just unattractive.

@stevengj
Copy link
Member

stevengj commented Jan 4, 2018

@timholy, as you mentioned above, we could define broadcast(::typeof(+), ::Range, ::Real) so that people could get a range from (1:5) .+ 1 etcetera. If they combine this with multiplication, they just need to use *, e.g. 2 * (1:5) .+ 1 and not 2 .* (1:5) .+ 1 if they want a range result. Anything more complicated with a range would still produce an an array. This doesn't seem so bad to me — how common is it to want to do complex nested broadcast arithmetic with ranges to get another range?

@timholy
Copy link
Member

timholy commented Jan 4, 2018

One problem with broadcast(::typeof(+), ::Range, ::Real) is it doesn't work for r::Range .+ n::Number when n is literal, because literals get fused into f. On Julia 0.6:

julia> Base.broadcast(::typeof(+), ::UnitRange, ::Int) = "crazy"

julia> (1:5) .+ 1
5-element Array{Int64,1}:
 2
 3
 4
 5
 6

julia> n = 1
1

julia> (1:5) .+ n
"crazy"

But I agree that deep nesting is rare, and I could live without preserving the range in such cases.

@stevengj
Copy link
Member

stevengj commented Jan 4, 2018

Okay, then let's just define + number for ranges. Ranges are a special type unto themselves, not a linear-algebra object, so I don't see a problem with defining special operations on them.

@timholy
Copy link
Member

timholy commented Jan 4, 2018

We have that currently, so that wouldn't be a change. But that does take us back to

We're deprecating a + 1 for other AbstractVectors, so telling people to use + for ranges and .+ for all other AbstractVectors is not a satisfying solution.

That said, not everything in 1.0 is going to be satisfying. Still, this problem and others are fixed in #25377. It's just a question of whether people are scared by it, and of course to finish up the loose ends.

As an alternative, what about removing the literal-slurping from the lisp fusion code? Are there big negatives? That would at least make broadcasting work in simple cases.

@stevengj
Copy link
Member

stevengj commented Jan 4, 2018

Still, this problem and others are fixed in #25377.

At a huge performance cost, I thought you said?

@stevengj
Copy link
Member

stevengj commented Jan 4, 2018

As an alternative, what about removing the literal-slurping from the lisp fusion code? Are there big negatives? That would at least make broadcasting work in simple cases.

Last I checked, there was a significant performance penalty to simple expressions if you remove the literal-slurping. But this doesn't seem to be the case any more?

julia> f!(x) = broadcast!(x -> x^2, x, x);
julia> g!(x) = broadcast!((x,p) -> x^p, x, x,2);
julia> x = zeros(10^3);
julia> @btime f!($x); @btime g!($x);
  82.644 ns (0 allocations: 0 bytes)
  85.150 ns (0 allocations: 0 bytes)

julia> f!(x) = broadcast!(x -> 3x^2 + 4x + 5, x, x);
julia> g!(x) = broadcast!((x,c1,c2,c3,c4) -> c1*x^c2 + c3*x + c4,x, x,3,2,4,5);
julia> @btime f!($x); @btime g!($x);
  183.832 ns (0 allocations: 0 bytes)
  183.849 ns (0 allocations: 0 bytes)

I'm not sure what changed?

Another problem with removing literal-slurping is that it eliminates the type-stability of Unitful expressions involving literal powers. One possibility would be to only do literal-slurping for .^ with integer literal powers.

@timholy
Copy link
Member

timholy commented Jan 4, 2018

At a huge performance cost, I thought you said?

Not as bad now. For a = rand(2,2), master gives

julia> @btime broadcast!(+, $b, $a, 1);
  12.715 ns (0 allocations: 0 bytes)

and #25377 gives

julia> @btime broadcast!(+, $b, $a, 1);
  17.991 ns (0 allocations: 0 bytes)

@timholy
Copy link
Member

timholy commented Jan 4, 2018

One possibility would be to only do literal-slurping for literal powers.

If we don't do #25377, that would be an improvement. Then, #22063 might be enough to prevent the GPU people from hating us. You could conceivably rework it around BroadcastStyle since that already has promote_type-like behavior (see the docs about binary rules and combine_styles).

Still less flexible than #25377, but it's a much smaller change, and it automatically fixes the literal_pow problem.

@inline (f::FusionCall)(args...) = f.f(map(a -> a(args...), f.args)...)
# TODO: calling _apply on map _apply is not handled by inference
# for now, we unroll some cases and generate others, to help it out
#@inline (f::FusionApply)(args...) = Core._apply(f.f, map(a -> a(args...), f.args)...)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equivalently, I think this is also f.f(flatten(map(a -> a(args...), f.args)...)...)
where flatten(args...) = Core._apply(tuple, args...)

mbauman added a commit that referenced this pull request Apr 23, 2018
This patch represents the combined efforts of four individuals, over 60
commits, and an iterated design over (at least) three pull requests that
spanned nearly an entire year (closes #22063, #23692, #25377 by superceding
them).

This introduces a pure Julia data structure that represents a fused broadcast
expression.  For example, the expression `2 .* (x .+ 1)` lowers to:

```julia
julia> Meta.@lower 2 .* (x .+ 1)
:($(Expr(:thunk, CodeInfo(:(begin
      Core.SSAValue(0) = (Base.getproperty)(Base.Broadcast, :materialize)
      Core.SSAValue(1) = (Base.getproperty)(Base.Broadcast, :make)
      Core.SSAValue(2) = (Base.getproperty)(Base.Broadcast, :make)
      Core.SSAValue(3) = (Core.SSAValue(2))(+, x, 1)
      Core.SSAValue(4) = (Core.SSAValue(1))(*, 2, Core.SSAValue(3))
      Core.SSAValue(5) = (Core.SSAValue(0))(Core.SSAValue(4))
      return Core.SSAValue(5)
  end)))))
```

Or, slightly more readably as:

```julia
using .Broadcast: materialize, make
materialize(make(*, 2, make(+, x, 1)))
```

The `Broadcast.make` function serves two purposes. Its primary purpose is to
construct the `Broadcast.Broadcasted` objects that hold onto the function, the
tuple of arguments (potentially including nested `Broadcasted` arguments), and
sometimes a set of `axes` to include knowledge of the outer shape. The
secondary purpose, however, is to allow an "out" for objects that _don't_ want
to participate in fusion. For example, if `x` is a range in the above `2 .* (x
.+ 1)` expression, it needn't allocate an array and operate elementwise — it
can just compute and return a new range. Thus custom structures are able to
specialize `Broadcast.make(f, args...)` just as they'd specialize on `f`
normally to return an immediate result.

`Broadcast.materialize` is identity for everything _except_ `Broadcasted`
objects for which it allocates an appropriate result and computes the
broadcast. It does two things: it `initialize`s the outermost `Broadcasted`
object to compute its axes and then `copy`s it.

Similarly, an in-place fused broadcast like `y .= 2 .* (x .+ 1)` uses the exact
same expression tree to compute the right-hand side of the expression as above,
and then uses `materialize!(y, make(*, 2, make(+, x, 1)))` to `instantiate` the
`Broadcasted` expression tree and then `copyto!` it into the given destination.

All-together, this forms a complete API for custom types to extend and
customize the behavior of broadcast (fixes #22060). It uses the existing
`BroadcastStyle`s throughout to simplify dispatch on many arguments:

* Custom types can opt-out of broadcast fusion by specializing
  `Broadcast.make(f, args...)` or `Broadcast.make(::BroadcastStyle, f, args...)`.

* The `Broadcasted` object computes and stores the type of the combined
  `BroadcastStyle` of its arguments as its first type parameter, allowing for
  easy dispatch and specialization.

* Custom Broadcast storage is still allocated via `broadcast_similar`, however
  instead of passing just a function as a first argument, the entire
  `Broadcasted` object is passed as a final argument. This potentially allows
  for much more runtime specialization dependent upon the exact expression
  given.

* Custom broadcast implmentations for a `CustomStyle` are defined by
  specializing `copy(bc::Broadcasted{CustomStyle})` or
  `copyto!(dest::AbstractArray, bc::Broadcasted{CustomStyle})`.

* Fallback broadcast specializations for a given output object of type `Dest`
  (for the `DefaultArrayStyle` or another such style that hasn't implemented
  assignments into such an object) are defined by specializing
  `copyto(dest::Dest, bc::Broadcasted{Nothing})`.

As it fully supports range broadcasting, this now deprecates `(1:5) + 2` to
`.+`, just as had been done for all `AbstractArray`s in general.

As a first-mover proof of concept, LinearAlgebra uses this new system to
improve broadcasting over structured arrays. Before, broadcasting over a
structured matrix would result in a sparse array. Now, broadcasting over a
structured matrix will _either_ return an appropriately structured matrix _or_
a dense array. This does incur a type instability (in the form of a
discriminated union) in some situations, but thanks to type-based introspection
of the `Broadcasted` wrapper commonly used functions can be special cased to be
type stable.  For example:

```julia
julia> f(d) = round.(Int, d)
f (generic function with 1 method)

julia> @inferred f(Diagonal(rand(3)))
3×3 Diagonal{Int64,Array{Int64,1}}:
 0  ⋅  ⋅
 ⋅  0  ⋅
 ⋅  ⋅  1

julia> @inferred Diagonal(rand(3)) .* 3
ERROR: return type Diagonal{Float64,Array{Float64,1}} does not match inferred return type Union{Array{Float64,2}, Diagonal{Float64,Array{Float64,1}}}
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] top-level scope

julia> @inferred Diagonal(1:4) .+ Bidiagonal(rand(4), rand(3), 'U') .* Tridiagonal(1:3, 1:4, 1:3)
4×4 Tridiagonal{Float64,Array{Float64,1}}:
 1.30771  0.838589   ⋅          ⋅
 0.0      3.89109   0.0459757   ⋅
  ⋅       0.0       4.48033    2.51508
  ⋅        ⋅        0.0        6.23739
```

In addition to the issues referenced above, it fixes:

* Fixes #19313, #22053, #23445, and #24586: Literals are no longer treated
  specially in a fused broadcast; they're just arguments in a `Broadcasted`
  object like everything else.

* Fixes #21094: Since broadcasting is now represented by a pure Julia
  datastructure it can be created within `@generated` functions and serialized.

* Fixes #26097: The fallback destination-array specialization method of
  `copyto!` is specifically implemented as `Broadcasted{Nothing}` and will not
  be confused by `nothing` arguments.

* Fixes the broadcast-specific element of #25499: The default base broadcast
  implementation no longer depends upon `Base._return_type` to allocate its
  array (except in the empty or concretely-type cases). Note that the sparse
  implementation (#19595) is still dependent upon inference and is _not_ fixed.

* Fixes #25340: Functions are treated like normal values just like arguments
  and only evaluated once.

* Fixes #22255, and is performant with 12+ fused broadcasts. Okay, that one was
  fixed on master already, but this fixes it now, too.

* Fixes #25521.

* The performance of this patch has been thoroughly tested through its
  iterative development process in #25377. There remain [two classes of
  performance regressions](#25377) that Nanosoldier flagged.

* #25691: Propagation of constant literals sill lose their constant-ness upon
  going through the broadcast machinery. I believe quite a large number of
  functions would need to be marked as `@pure` to support this -- including
  functions that are intended to be specialized.

(For bookkeeping, this is the squashed version of the [teh-jn/lazydotfuse](#25377)
branch as of a1d4e7e. Squashed and separated
out to make it easier to review and commit)

Co-authored-by: Tim Holy <[email protected]>
Co-authored-by: Jameson Nash <[email protected]>
Co-authored-by: Andrew Keller <[email protected]>
Keno pushed a commit that referenced this pull request Apr 27, 2018
This patch represents the combined efforts of four individuals, over 60
commits, and an iterated design over (at least) three pull requests that
spanned nearly an entire year (closes #22063, #23692, #25377 by superceding
them).

This introduces a pure Julia data structure that represents a fused broadcast
expression.  For example, the expression `2 .* (x .+ 1)` lowers to:

```julia
julia> Meta.@lower 2 .* (x .+ 1)
:($(Expr(:thunk, CodeInfo(:(begin
      Core.SSAValue(0) = (Base.getproperty)(Base.Broadcast, :materialize)
      Core.SSAValue(1) = (Base.getproperty)(Base.Broadcast, :make)
      Core.SSAValue(2) = (Base.getproperty)(Base.Broadcast, :make)
      Core.SSAValue(3) = (Core.SSAValue(2))(+, x, 1)
      Core.SSAValue(4) = (Core.SSAValue(1))(*, 2, Core.SSAValue(3))
      Core.SSAValue(5) = (Core.SSAValue(0))(Core.SSAValue(4))
      return Core.SSAValue(5)
  end)))))
```

Or, slightly more readably as:

```julia
using .Broadcast: materialize, make
materialize(make(*, 2, make(+, x, 1)))
```

The `Broadcast.make` function serves two purposes. Its primary purpose is to
construct the `Broadcast.Broadcasted` objects that hold onto the function, the
tuple of arguments (potentially including nested `Broadcasted` arguments), and
sometimes a set of `axes` to include knowledge of the outer shape. The
secondary purpose, however, is to allow an "out" for objects that _don't_ want
to participate in fusion. For example, if `x` is a range in the above `2 .* (x
.+ 1)` expression, it needn't allocate an array and operate elementwise — it
can just compute and return a new range. Thus custom structures are able to
specialize `Broadcast.make(f, args...)` just as they'd specialize on `f`
normally to return an immediate result.

`Broadcast.materialize` is identity for everything _except_ `Broadcasted`
objects for which it allocates an appropriate result and computes the
broadcast. It does two things: it `initialize`s the outermost `Broadcasted`
object to compute its axes and then `copy`s it.

Similarly, an in-place fused broadcast like `y .= 2 .* (x .+ 1)` uses the exact
same expression tree to compute the right-hand side of the expression as above,
and then uses `materialize!(y, make(*, 2, make(+, x, 1)))` to `instantiate` the
`Broadcasted` expression tree and then `copyto!` it into the given destination.

All-together, this forms a complete API for custom types to extend and
customize the behavior of broadcast (fixes #22060). It uses the existing
`BroadcastStyle`s throughout to simplify dispatch on many arguments:

* Custom types can opt-out of broadcast fusion by specializing
  `Broadcast.make(f, args...)` or `Broadcast.make(::BroadcastStyle, f, args...)`.

* The `Broadcasted` object computes and stores the type of the combined
  `BroadcastStyle` of its arguments as its first type parameter, allowing for
  easy dispatch and specialization.

* Custom Broadcast storage is still allocated via `broadcast_similar`, however
  instead of passing just a function as a first argument, the entire
  `Broadcasted` object is passed as a final argument. This potentially allows
  for much more runtime specialization dependent upon the exact expression
  given.

* Custom broadcast implmentations for a `CustomStyle` are defined by
  specializing `copy(bc::Broadcasted{CustomStyle})` or
  `copyto!(dest::AbstractArray, bc::Broadcasted{CustomStyle})`.

* Fallback broadcast specializations for a given output object of type `Dest`
  (for the `DefaultArrayStyle` or another such style that hasn't implemented
  assignments into such an object) are defined by specializing
  `copyto(dest::Dest, bc::Broadcasted{Nothing})`.

As it fully supports range broadcasting, this now deprecates `(1:5) + 2` to
`.+`, just as had been done for all `AbstractArray`s in general.

As a first-mover proof of concept, LinearAlgebra uses this new system to
improve broadcasting over structured arrays. Before, broadcasting over a
structured matrix would result in a sparse array. Now, broadcasting over a
structured matrix will _either_ return an appropriately structured matrix _or_
a dense array. This does incur a type instability (in the form of a
discriminated union) in some situations, but thanks to type-based introspection
of the `Broadcasted` wrapper commonly used functions can be special cased to be
type stable.  For example:

```julia
julia> f(d) = round.(Int, d)
f (generic function with 1 method)

julia> @inferred f(Diagonal(rand(3)))
3×3 Diagonal{Int64,Array{Int64,1}}:
 0  ⋅  ⋅
 ⋅  0  ⋅
 ⋅  ⋅  1

julia> @inferred Diagonal(rand(3)) .* 3
ERROR: return type Diagonal{Float64,Array{Float64,1}} does not match inferred return type Union{Array{Float64,2}, Diagonal{Float64,Array{Float64,1}}}
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] top-level scope

julia> @inferred Diagonal(1:4) .+ Bidiagonal(rand(4), rand(3), 'U') .* Tridiagonal(1:3, 1:4, 1:3)
4×4 Tridiagonal{Float64,Array{Float64,1}}:
 1.30771  0.838589   ⋅          ⋅
 0.0      3.89109   0.0459757   ⋅
  ⋅       0.0       4.48033    2.51508
  ⋅        ⋅        0.0        6.23739
```

In addition to the issues referenced above, it fixes:

* Fixes #19313, #22053, #23445, and #24586: Literals are no longer treated
  specially in a fused broadcast; they're just arguments in a `Broadcasted`
  object like everything else.

* Fixes #21094: Since broadcasting is now represented by a pure Julia
  datastructure it can be created within `@generated` functions and serialized.

* Fixes #26097: The fallback destination-array specialization method of
  `copyto!` is specifically implemented as `Broadcasted{Nothing}` and will not
  be confused by `nothing` arguments.

* Fixes the broadcast-specific element of #25499: The default base broadcast
  implementation no longer depends upon `Base._return_type` to allocate its
  array (except in the empty or concretely-type cases). Note that the sparse
  implementation (#19595) is still dependent upon inference and is _not_ fixed.

* Fixes #25340: Functions are treated like normal values just like arguments
  and only evaluated once.

* Fixes #22255, and is performant with 12+ fused broadcasts. Okay, that one was
  fixed on master already, but this fixes it now, too.

* Fixes #25521.

* The performance of this patch has been thoroughly tested through its
  iterative development process in #25377. There remain [two classes of
  performance regressions](#25377) that Nanosoldier flagged.

* #25691: Propagation of constant literals sill lose their constant-ness upon
  going through the broadcast machinery. I believe quite a large number of
  functions would need to be marked as `@pure` to support this -- including
  functions that are intended to be specialized.

(For bookkeeping, this is the squashed version of the [teh-jn/lazydotfuse](#25377)
branch as of a1d4e7e. Squashed and separated
out to make it easier to review and commit)

Co-authored-by: Tim Holy <[email protected]>
Co-authored-by: Jameson Nash <[email protected]>
Co-authored-by: Andrew Keller <[email protected]>
@DilumAluthge DilumAluthge deleted the jn/Fusion branch March 25, 2021 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection
Projects
None yet