Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove inline annotations from broadcast kernels #35675

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mbauman
Copy link
Member

@mbauman mbauman commented Apr 30, 2020

In my cursory spot-tests, it appears as though we no longer need to force inlining the whole way through to the innermost loops of broadcast. This is huge, and in my naive understanding I think it'll greatly improve codegen times and sizes. It'll also allow for embedding more alternative loop designs as they no longer need to be inlined into the same function body — solving my reservations in #30973.

I've kept the preparatory @inlines for now, just removing them on the actual implementation.

Making this an early PR just to allow a Nanosoldier run.

@mbauman mbauman added the broadcast Applying a function over a collection label Apr 30, 2020
@mbauman
Copy link
Member Author

mbauman commented Apr 30, 2020

@nanosoldier runbenchmarks(ALL, vs = ":master")

@mbauman
Copy link
Member Author

mbauman commented Apr 30, 2020

On master:

2 $ time make -C test broadcast
    JULIA test/broadcast
Test  (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB)
broadcast  (1) |        started at 2020-04-30T16:43:19.276
broadcast  (1) |    47.61 |   1.28 |  2.7 |    5773.20 |   478.94

Test Summary: | Pass  Total
  Overall     |  439    439
    SUCCESS

real	0m51.426s
user	0m50.828s
sys	0m0.784s

On this PR:

$ time make -C test broadcast
    JULIA test/broadcast
Test  (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB)
broadcast  (1) |        started at 2020-04-30T16:43:02.881
broadcast  (1) |    37.41 |   1.06 |  2.8 |    4296.83 |   386.69

Test Summary: | Pass  Total
  Overall     |  439    439
    SUCCESS

real	0m41.086s
user	0m40.656s
sys	0m0.649s

@@ -1102,15 +1102,15 @@ struct BitMaskedBitArray{N,M}
mask::BitArray{M}
BitMaskedBitArray{N,M}(parent, mask) where {N,M} = new(parent, mask)
end
@inline function BitMaskedBitArray(parent::BitArray{N}, mask::BitArray{M}) where {N,M}
function BitMaskedBitArray(parent::BitArray{N}, mask::BitArray{M}) where {N,M}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function BitMaskedBitArray(parent::BitArray{N}, mask::BitArray{M}) where {N,M}
@inline function BitMaskedBitArray(parent::BitArray{N}, mask::BitArray{M}) where {N,M}

This one was overzealous — we should keep the force @inline here for the@boundscheck. It currently still inlines, but I think it's good practice to keep this here in case something else in here gets big enough to lose the auto-inline.

@StefanKarpinski
Copy link
Member

Beautiful!

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@timholy
Copy link
Member

timholy commented May 1, 2020

Very impressive gain!

@StefanKarpinski
Copy link
Member

Making the inlining cost model more accurate by making reality match the model FTW.

@mbauman
Copy link
Member Author

mbauman commented May 1, 2020

Well that turned out exceptionally well! I wasn't expecting any run-time performance gains but was fearful of losses. Looks like there are just a handful of extra allocations in a few cases... but it's in exchange for a 20%+ compile time speedup in the unit tests. I'll see if anything can be done about those allocations, but I think the fact that this will allow for future runtime improvements (like #30973) will make the broadcast superusers more than pleased.

@mbauman
Copy link
Member Author

mbauman commented May 5, 2020

Ok, the regressions are limited to broadcast expressions that use literal pows (like x.^2) or types (like round.(Int, x)). These end up using Refs inside tuples — like (Ref(^), x, Ref(Val(2))) and (Ref{Type{Int}}(Int), x). The problem is that these Refs are properly mutable and thus must be allocated because an inner (non-inlined) function could pull them out and store them somewhere. In short, this exasperates the need for a true immutable scalar type for broadcast; #18379.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented May 5, 2020

Maybe ImmutableRef would be a good name for it? The need for immutability here also informs the whole &x syntax: for the ccall usage we specifically want that to create a mutable box; for the broadcast usage we want it to create an immutable box. Good thing we didn't pick yet!

@vtjnash
Copy link
Member

vtjnash commented May 5, 2020

The main confusion with that might be that it's not really a Ref unless it's mutable (as that's the Julia keyword for pinning something to the heap to make it reference-able). It might more accurately be Scalarize, but I think Some might fit here too?

@mbauman
Copy link
Member Author

mbauman commented May 5, 2020

Oh interesting. Some is a new idea. It's exactly the struct we need and a half-decent (short!) word, but it doesn't have the getindex/length/size methods we need. Before settling on Ref I had once called this a struct ZeroDimArray{T} <: AbstractArray{T,0} in the broadcast internals. Scalar (or some variation thereof) sure seems to be the crowd favorite word here, but gosh do I find it confusing. Imagine adding this Scalar into the mix in this FAQ, for example. Singleton is another option that was recently proposed.

I reviewed all the issues/discourse threads I could find to look for other alternatives, but the above is just about the universe of what I found (Nullable anyone? hah). All the typical verbiage tends have mutable associations (e.g., Box).

@mbauman
Copy link
Member Author

mbauman commented May 5, 2020

What about using a lazy Fill a la FillArrays? Inspired by #33206, which gives:

julia> show(reshape([1], ()))
fill(1)

It actually fits linguistically, too — you can imagine the Fill()'ed argument as filling in across the broadcasted shape. The downside is that it's not as simple anymore — it covers far more than just a zero-dimensional array — and coordinating with FillArrays makes this more complicated yet.

@mbauman
Copy link
Member Author

mbauman commented May 5, 2020

Back to the issue at hand, I suppose this points towards another form of performance regression that this would introduce — it'd block constant propagation in the same vein that Ref blocked the constant propagation in the motivating example in #35591.

@vtjnash
Copy link
Member

vtjnash commented May 5, 2020

but it doesn't have the methods we need

Mostly, neither did Ref until it got co-opted for this use. And Some is the new name for the Nullable container. The Singleton doesn't seem quite right (it's not the only item in the set)

@tkf
Copy link
Member

tkf commented May 6, 2020

Having x[] for Some would be nice also if we have more Union{Nothing,Some}-based API like #34821 and #35699 (comment).

x = get(dict, key)
if x !== nothing
    f(x[])  # nicer than `f(something(x))`?
else
    ...
end

@rfourquet
Copy link
Member

I just proposed somewhere else

what about Only? only is already used in a container context with only one element, and there would be the property only(Only(x)) === x.

But Some and an hypothetical Only seem redundant. Anyway, only dropping an idea here.

@rfourquet rfourquet changed the title Remove inline annations from broadcast kernels Remove inline annotations from broadcast kernels May 6, 2020
@mbauman mbauman force-pushed the mb/outline-broadcast branch from 18b6944 to b688db5 Compare January 27, 2021 19:44
@vtjnash
Copy link
Member

vtjnash commented Apr 12, 2021

bump?

@mbauman mbauman force-pushed the mb/outline-broadcast branch from b688db5 to 5b8bba6 Compare April 12, 2021 18:54
@mbauman
Copy link
Member Author

mbauman commented Apr 12, 2021

I've rebased, but in doing so I was reminded that the blocker here was performance of Ref-ified scalars. The path forward here is through #39184

vtjnash added a commit that referenced this pull request Nov 23, 2021
N5N3 pushed a commit to N5N3/julia that referenced this pull request Nov 30, 2021
This removes the dependence on inlining for performance, so we also
remove `@inline`, since it can harm performance.

make Some type a zero-dim broadcast container (e.g. a scalar)

Replaces JuliaLang#35778
Replaces JuliaLang#39184
Fixes JuliaLang#39151
Refs JuliaLang#35675
Refs JuliaLang#43200
N5N3 pushed a commit to N5N3/julia that referenced this pull request Dec 1, 2021
This removes the dependence on inlining for performance, so we also
remove `@inline`, since it can harm performance.

make Some type a zero-dim broadcast container (e.g. a scalar)

Replaces JuliaLang#35778
Replaces JuliaLang#39184
Fixes JuliaLang#39151
Refs JuliaLang#35675
Refs JuliaLang#43200
N5N3 pushed a commit to N5N3/julia that referenced this pull request Dec 1, 2021
This removes the dependence on inlining for performance, so we also
remove `@inline`, since it can harm performance.

make Some type a zero-dim broadcast container (e.g. a scalar)

Replaces JuliaLang#35778
Replaces JuliaLang#39184
Fixes JuliaLang#39151
Refs JuliaLang#35675
Refs JuliaLang#43200
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants