-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workaround #28126, support SIMDing broadcast in more cases #30973
Conversation
This comment has been minimized.
This comment has been minimized.
Ok, let's try this again. @nanosoldier |
# any array sizes within the inner loop. Ideally this really should be something | ||
# that Julia and/or LLVM could figure out and eliminate... and indeed they can | ||
# for limited numbers of arguments. | ||
if _is_static_broadcast_28126(dest, bc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to emphasize how awful this is: we now are generating two independent inner loops for every broadcast expression — one that might be slightly more likely to vectorize (depending on array sizes), and one normal one. They've both gotta inline to avoid allocating our broadcast expression tree.
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
Is it possible to remove all the allocation in here? tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
k26, k27, k28, k29, k30, k31, k32, k33, k34 = (ones(1000) for i in 1:36)
function foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
k26, k27, k28, k29, k30, k31, k32, k33, k34)
dt = 0.5
@. tmp = uprev + dt*(0.00*k1 + 0.01*k2 + 0.02*k3 + 0.04*k5 + 0.06*k7 + 0.07*k8 + 0.09*k10 + 0.10*k11 + 0.11*k12 + 0.12*k13 + 0.13*k14 + 0.14*k15 + 0.15*k16 + 0.16*k17 + 0.17*k18 + 0.18*k19 + 0.19*k20 + 0.20*k21 + 0.21*k22 + 0.22*k23 + 0.23*k24 + 0.24*k25 + 0.25*k26 + 0.26*k27 + 0.27*k28 + 0.28*k29 + 0.29*k30 + 0.30*k31 + 0.31*k32 + 0.32*k33 + 0.33*k34)
nothing
end
foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34)
@allocated foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
k26, k27, k28, k29, k30, k31, k32, k33, k34) julia> @allocated foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
k26, k27, k28, k29, k30, k31, k32, k33, k34)
7072 This isn't some arbitrary example. It is a line in the |
Perhaps some compilation time comparisons before after this PR would be useful to see if there is a trade off. |
This is an ugly performance hack around issue #28126 in some limited (but common) cases. The problem in short: when given many arrays of the same size, LLVM has difficulty hoisting the decision of whether a given dimension should be "extruded" out of the loop. This extra indirection in the index computation seems to foil the array bounds aliasing checks, which stymies SIMDification. The solution: check to see if _Julia_ can statically decide whether or not to extrude any dimensions in a given broadcast expression -- and if so, use a special array wrapper that flags that none of the dimensions in that array need to be extruded out in order to perform the broadcast.
* Inline the whole broadcast expression to avoid allocation `map` has an inline limit of 16. To make sure that the whole broadcast tree gets inlined properly, I added the `_inlined_map` function. I am not sure if it is a good idea, but worth trying. This PR solves the issue which I have mentioned in 2693778#issuecomment-461248258 ```julia julia> @allocated foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34) 0 ``` * Fix CI failure * Stricter test (vectorization & no allocation for a 9-array bc) * rm `_inlined_map`
So I tried a bunch of compile time tests (Julia build time, using JuliaDB/DataFrames/Plots, master (9f69b2e)
mb/hack28126 (8aa5907)
That's about a 20% regression. :-\ |
Not sure what to do here; is it worth trying to revive this? Does the problem still exist? |
The PR to revive (first) is #35675. Then the concern in #30973 (comment) vanishes. |
Yes. using FastBroadcast, BenchmarkTools
const tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
k26, k27, k28, k29, k30, k31, k32, k33, k34 = (ones(1000) for i in 1:36)
function foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
k26, k27, k28, k29, k30, k31, k32, k33, k34)
dt = 0.5
@. tmp = uprev + dt*(0.00*k1 + 0.01*k2 + 0.02*k3 + 0.04*k5 + 0.06*k7 + 0.07*k8 + 0.09*k10 + 0.10*k11 + 0.11*k12 + 0.12*k13 + 0.13*k14 + 0.14*k15 + 0.15*k16 + 0.16*k17 + 0.17*k18 + 0.18*k19 + 0.19*k20 + 0.20*k21 + 0.21*k22 + 0.22*k23 + 0.23*k24 + 0.24*k25 + 0.25*k26 + 0.26*k27 + 0.27*k28 + 0.28*k29 + 0.29*k30 + 0.30*k31 + 0.31*k32 + 0.32*k33 + 0.33*k34)
nothing
end
function foo_fb(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
k26, k27, k28, k29, k30, k31, k32, k33, k34)
dt = 0.5
@.. tmp = uprev + dt*(0.00*k1 + 0.01*k2 + 0.02*k3 + 0.04*k5 + 0.06*k7 + 0.07*k8 + 0.09*k10 + 0.10*k11 + 0.11*k12 + 0.12*k13 + 0.13*k14 + 0.14*k15 + 0.15*k16 + 0.16*k17 + 0.17*k18 + 0.18*k19 + 0.19*k20 + 0.20*k21 + 0.21*k22 + 0.22*k23 + 0.23*k24 + 0.24*k25 + 0.25*k26 + 0.26*k27 + 0.27*k28 + 0.28*k29 + 0.29*k30 + 0.30*k31 + 0.31*k32 + 0.32*k33 + 0.33*k34)
nothing
end
@benchmark foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34)
@benchmark foo_fb(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34) I get: julia> @benchmark foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 60.687 μs … 2.011 ms ┊ GC (min … max): 0.00% … 93.25%
Time (median): 61.530 μs ┊ GC (median): 0.00%
Time (mean ± σ): 63.451 μs ± 41.939 μs ┊ GC (mean ± σ): 1.56% ± 2.29%
▁▅▆██▇▆▅▃▂▂▂▂▃▃▂▂▁ ▁ ▁▁▂▁▁▁▁▁ ▁ ▂
██████████████████████▆▇▆▆▆▇▆▇▇▇██████████████████▇▇▇▇▆▅▆▄▆ █
60.7 μs Histogram: log(frequency) by time 70.8 μs <
Memory estimate: 29.80 KiB, allocs estimate: 625.
julia> @benchmark foo_fb(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 25.045 μs … 2.118 ms ┊ GC (min … max): 0.00% … 96.72%
Time (median): 25.929 μs ┊ GC (median): 0.00%
Time (mean ± σ): 28.123 μs ± 50.510 μs ┊ GC (mean ± σ): 4.66% ± 2.56%
▁▅▇██▇▆▅▄▃▂▁ ▁▁▁▁ ▁ ▁ ▁ ▁▁▁▁▁▁ ▂
██████████████▇██████████▇▆▇▆▆▇███▇█████████████▇██▇▇▇▆▆▆▆▆ █
25 μs Histogram: log(frequency) by time 35.8 μs <
Memory estimate: 31.94 KiB, allocs estimate: 663. Adding parenthesis removes the allocations: julia> function foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
k26, k27, k28, k29, k30, k31, k32, k33, k34)
dt = 0.5
@. tmp = uprev + dt*(0.00*k1 + (0.01*k2 + (0.02*k3 + (0.04*k5 + (0.06*k7 + (0.07*k8 + (0.09*k10 + (0.10*k11 + (0.11*k12 + (0.12*k13 + (0.13*k14 + (0.14*k15 + (0.15*k16 + (0.16*k17 + (0.17*k18 + (0.18*k19 + (0.19*k20 + (0.20*k21 + (0.21*k22 + (0.22*k23 + (0.23*k24 + (0.24*k25 + (0.25*k26 + (0.26*k27 + (0.27*k28 + (0.28*k29 + (0.29*k30 + (0.30*k31 + (0.31*k32 + (0.32*k33 + (0.33*k34)))))))))))))))))))))))))))))))
nothing
end
foo (generic function with 1 method)
julia> function foo_fb(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
k26, k27, k28, k29, k30, k31, k32, k33, k34)
dt = 0.5
@.. tmp = uprev + dt*(0.00*k1 + (0.01*k2 + (0.02*k3 + (0.04*k5 + (0.06*k7 + (0.07*k8 + (0.09*k10 + (0.10*k11 + (0.11*k12 + (0.12*k13 + (0.13*k14 + (0.14*k15 + (0.15*k16 + (0.16*k17 + (0.17*k18 + (0.18*k19 + (0.19*k20 + (0.20*k21 + (0.21*k22 + (0.22*k23 + (0.23*k24 + (0.24*k25 + (0.25*k26 + (0.26*k27 + (0.27*k28 + (0.28*k29 + (0.29*k30 + (0.30*k31 + (0.31*k32 + (0.32*k33 + (0.33*k34)))))))))))))))))))))))))))))))
nothing
end
foo_fb (generic function with 1 method)
julia> @benchmark foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 45.152 μs … 88.212 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 45.210 μs ┊ GC (median): 0.00%
Time (mean ± σ): 45.274 μs ± 620.213 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▆██▅ ▁▂▁ ▂
▇█████▄▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▇█████▇▇ █
45.2 μs Histogram: log(frequency) by time 46.3 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark foo_fb(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34)
BechmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min … max): 2.546 μs … 6.567 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.560 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.565 μs ± 63.992 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃█▇▂
▂▂▄▇████▆▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂ ▃
2.55 μs Histogram: frequency by time 2.68 μs <
Memory estimate: 0 bytes, allocs estimate: 0. |
I think there's no need to limit this optimization to inputs with the same axes. |
Seems no longer needed |
This is an ugly performance hack around issue #28126 in some limited (but common) cases. The problem in short: when given many arrays of the same size, LLVM has difficulty hoisting the decision of whether a given dimension should be "extruded" out of the loop. This extra indirection in the index computation seems to foil the array bounds aliasing checks, which stymies SIMDification. The solution: check to see if Julia can statically decide whether or not to extrude any dimensions in a given broadcast expression -- and if so, use a special array wrapper that flags that none of the dimensions in that array need to be extruded out in order to perform the broadcast.
No idea if tests will pass or if this actually is a good idea in general.
@nanosoldier
runbenchmarks(ALL, vs = ":master")
(edit: whoops, sorry Nanosoldier — should have waited for CI before sending you down a dead-end alley)