-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove inline annotations from broadcast kernels #35675
base: master
Are you sure you want to change the base?
Conversation
@nanosoldier |
On master:
On this PR:
|
@@ -1102,15 +1102,15 @@ struct BitMaskedBitArray{N,M} | |||
mask::BitArray{M} | |||
BitMaskedBitArray{N,M}(parent, mask) where {N,M} = new(parent, mask) | |||
end | |||
@inline function BitMaskedBitArray(parent::BitArray{N}, mask::BitArray{M}) where {N,M} | |||
function BitMaskedBitArray(parent::BitArray{N}, mask::BitArray{M}) where {N,M} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function BitMaskedBitArray(parent::BitArray{N}, mask::BitArray{M}) where {N,M} | |
@inline function BitMaskedBitArray(parent::BitArray{N}, mask::BitArray{M}) where {N,M} |
This one was overzealous — we should keep the force @inline
here for the@boundscheck
. It currently still inlines, but I think it's good practice to keep this here in case something else in here gets big enough to lose the auto-inline.
Beautiful! |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
Very impressive gain! |
Making the inlining cost model more accurate by making reality match the model FTW. |
Well that turned out exceptionally well! I wasn't expecting any run-time performance gains but was fearful of losses. Looks like there are just a handful of extra allocations in a few cases... but it's in exchange for a 20%+ compile time speedup in the unit tests. I'll see if anything can be done about those allocations, but I think the fact that this will allow for future runtime improvements (like #30973) will make the broadcast superusers more than pleased. |
Ok, the regressions are limited to broadcast expressions that use literal pows (like |
Maybe |
The main confusion with that might be that it's not really a |
Oh interesting. I reviewed all the issues/discourse threads I could find to look for other alternatives, but the above is just about the universe of what I found ( |
What about using a lazy julia> show(reshape([1], ()))
fill(1) It actually fits linguistically, too — you can imagine the |
Back to the issue at hand, I suppose this points towards another form of performance regression that this would introduce — it'd block constant propagation in the same vein that |
Mostly, neither did |
Having x = get(dict, key)
if x !== nothing
f(x[]) # nicer than `f(something(x))`?
else
...
end |
I just proposed somewhere else
But |
18b6944
to
b688db5
Compare
bump? |
b688db5
to
5b8bba6
Compare
I've rebased, but in doing so I was reminded that the blocker here was performance of |
This removes the dependence on inlining for performance, so we also remove `@inline`, since it can harm performance. make Some type a zero-dim broadcast container (e.g. a scalar) Replaces JuliaLang#35778 Replaces JuliaLang#39184 Fixes JuliaLang#39151 Refs JuliaLang#35675 Refs JuliaLang#43200
This removes the dependence on inlining for performance, so we also remove `@inline`, since it can harm performance. make Some type a zero-dim broadcast container (e.g. a scalar) Replaces JuliaLang#35778 Replaces JuliaLang#39184 Fixes JuliaLang#39151 Refs JuliaLang#35675 Refs JuliaLang#43200
This removes the dependence on inlining for performance, so we also remove `@inline`, since it can harm performance. make Some type a zero-dim broadcast container (e.g. a scalar) Replaces JuliaLang#35778 Replaces JuliaLang#39184 Fixes JuliaLang#39151 Refs JuliaLang#35675 Refs JuliaLang#43200
In my cursory spot-tests, it appears as though we no longer need to force inlining the whole way through to the innermost loops of broadcast. This is huge, and in my naive understanding I think it'll greatly improve codegen times and sizes. It'll also allow for embedding more alternative loop designs as they no longer need to be inlined into the same function body — solving my reservations in #30973.
I've kept the preparatory
@inline
s for now, just removing them on the actual implementation.Making this an early PR just to allow a Nanosoldier run.