Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround #28126, support SIMDing broadcast in more cases #30973

Closed
wants to merge 5 commits into from

Conversation

mbauman
Copy link
Member

@mbauman mbauman commented Feb 6, 2019

This is an ugly performance hack around issue #28126 in some limited (but common) cases. The problem in short: when given many arrays of the same size, LLVM has difficulty hoisting the decision of whether a given dimension should be "extruded" out of the loop. This extra indirection in the index computation seems to foil the array bounds aliasing checks, which stymies SIMDification. The solution: check to see if Julia can statically decide whether or not to extrude any dimensions in a given broadcast expression -- and if so, use a special array wrapper that flags that none of the dimensions in that array need to be extruded out in order to perform the broadcast.

No idea if tests will pass or if this actually is a good idea in general.

@nanosoldier runbenchmarks(ALL, vs = ":master") (edit: whoops, sorry Nanosoldier — should have waited for CI before sending you down a dead-end alley)

@mbauman mbauman added broadcast Applying a function over a collection compiler:simd instruction-level vectorization labels Feb 6, 2019
@nanosoldier

This comment has been minimized.

base/broadcast.jl Outdated Show resolved Hide resolved
@mbauman
Copy link
Member Author

mbauman commented Feb 6, 2019

Ok, let's try this again. @nanosoldier runbenchmarks(ALL, vs = ":master")

# any array sizes within the inner loop. Ideally this really should be something
# that Julia and/or LLVM could figure out and eliminate... and indeed they can
# for limited numbers of arguments.
if _is_static_broadcast_28126(dest, bc)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to emphasize how awful this is: we now are generating two independent inner loops for every broadcast expression — one that might be slightly more likely to vectorize (depending on array sizes), and one normal one. They've both gotta inline to avoid allocating our broadcast expression tree.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@YingboMa
Copy link
Contributor

YingboMa commented Feb 7, 2019

Is it possible to remove all the allocation in here?

tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
     k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
     k26, k27, k28, k29, k30, k31, k32, k33, k34 = (ones(1000) for i in 1:36)

function foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
             k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
             k26, k27, k28, k29, k30, k31, k32, k33, k34)
  dt = 0.5
  @. tmp = uprev + dt*(0.00*k1 + 0.01*k2 + 0.02*k3 + 0.04*k5 + 0.06*k7 + 0.07*k8 + 0.09*k10 + 0.10*k11 + 0.11*k12 + 0.12*k13 + 0.13*k14 + 0.14*k15 + 0.15*k16 + 0.16*k17 + 0.17*k18 + 0.18*k19 + 0.19*k20 + 0.20*k21 + 0.21*k22 + 0.22*k23 + 0.23*k24 + 0.24*k25 + 0.25*k26 + 0.26*k27 + 0.27*k28 + 0.28*k29 + 0.29*k30 + 0.30*k31 + 0.31*k32 + 0.32*k33 + 0.33*k34)
  nothing
end

foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34)
@allocated foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
             k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
             k26, k27, k28, k29, k30, k31, k32, k33, k34)
julia> @allocated foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
                    k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
                    k26, k27, k28, k29, k30, k31, k32, k33, k34)
7072

This isn't some arbitrary example. It is a line in the Feagin14 method. https://github.com/JuliaDiffEq/OrdinaryDiffEq.jl/blob/master/src/perform_step/feagin_rk_perform_step.jl#L737

@KristofferC KristofferC added the triage This should be discussed on a triage call label Feb 7, 2019
@KristofferC
Copy link
Member

Perhaps some compilation time comparisons before after this PR would be useful to see if there is a trade off.

mbauman and others added 5 commits March 14, 2019 16:32
This is an ugly performance hack around issue #28126 in some limited (but common) cases. The problem in short: when given many arrays of the same size, LLVM has difficulty hoisting the decision of whether a given dimension should be "extruded" out of the loop. This extra indirection in the index computation seems to foil the array bounds aliasing checks, which stymies SIMDification. The solution: check to see if _Julia_ can statically decide whether or not to extrude any dimensions in a given broadcast expression -- and if so, use a special array wrapper that flags that none of the dimensions in that array need to be extruded out in order to perform the broadcast.
* Inline the whole broadcast expression to avoid allocation

`map` has an inline limit of 16. To make sure that the whole broadcast
tree gets inlined properly, I added the `_inlined_map` function. I am not
sure if it is a good idea, but worth trying.

This PR solves the issue which I have mentioned in
2693778#issuecomment-461248258

```julia
julia> @allocated foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
                    k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
                    k26, k27, k28, k29, k30, k31, k32, k33, k34)
0
```

* Fix CI failure

* Stricter test (vectorization & no allocation for a 9-array bc)

* rm `_inlined_map`
@mbauman
Copy link
Member Author

mbauman commented Mar 14, 2019

So I tried a bunch of compile time tests (Julia build time, using JuliaDB/DataFrames/Plots, plot(rand(10)), various Julia tests), but the only one that showed any change was test/broadcast:

master (9f69b2e)

$ time make -C test broadcast
make: Entering directory '/home/mbauman/julia/test'
    JULIA test/broadcast
Test  (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB)
running testset broadcast...
broadcast  (1) |    44.72 |   2.39 |  5.3 |    5557.00 |   344.13

Test Summary: | Pass  Total
  Overall     |  423    423
    SUCCESS
make: Leaving directory '/home/mbauman/julia/test'

real	0m49.138s
user	0m48.738s
sys	0m1.190s

mb/hack28126 (8aa5907)

$ time make -C test broadcast
make: Entering directory '/home/mbauman/julia/test'
    JULIA test/broadcast
Test  (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB)
running testset broadcast...
broadcast  (1) |    53.04 |   2.99 |  5.6 |    7046.15 |   370.40

Test Summary: | Pass  Total
  Overall     |  424    424
    SUCCESS
make: Leaving directory '/home/mbauman/julia/test'

real	0m57.454s
user	0m56.996s
sys	0m1.190s

That's about a 20% regression. :-\

@KristofferC KristofferC mentioned this pull request Feb 20, 2020
56 tasks
@DilumAluthge DilumAluthge added backport 1.6 Change should be backported to release-1.6 and removed backport 1.0 labels Jul 15, 2021
@JeffBezanson JeffBezanson removed the triage This should be discussed on a triage call label Jul 15, 2021
@JeffBezanson
Copy link
Member

Not sure what to do here; is it worth trying to revive this? Does the problem still exist?

@mbauman
Copy link
Member Author

mbauman commented Jul 15, 2021

The PR to revive (first) is #35675. Then the concern in #30973 (comment) vanishes.

@chriselrod
Copy link
Contributor

chriselrod commented Jul 15, 2021

Not sure what to do here; is it worth trying to revive this? Does the problem still exist?

Yes.

using FastBroadcast, BenchmarkTools
const tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
     k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
     k26, k27, k28, k29, k30, k31, k32, k33, k34 = (ones(1000) for i in 1:36)

function foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
             k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
             k26, k27, k28, k29, k30, k31, k32, k33, k34)
  dt = 0.5
  @. tmp = uprev + dt*(0.00*k1 + 0.01*k2 + 0.02*k3 + 0.04*k5 + 0.06*k7 + 0.07*k8 + 0.09*k10 + 0.10*k11 + 0.11*k12 + 0.12*k13 + 0.13*k14 + 0.14*k15 + 0.15*k16 + 0.16*k17 + 0.17*k18 + 0.18*k19 + 0.19*k20 + 0.20*k21 + 0.21*k22 + 0.22*k23 + 0.23*k24 + 0.24*k25 + 0.25*k26 + 0.26*k27 + 0.27*k28 + 0.28*k29 + 0.29*k30 + 0.30*k31 + 0.31*k32 + 0.32*k33 + 0.33*k34)
  nothing
end
function foo_fb(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
             k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
             k26, k27, k28, k29, k30, k31, k32, k33, k34)
  dt = 0.5
  @.. tmp = uprev + dt*(0.00*k1 + 0.01*k2 + 0.02*k3 + 0.04*k5 + 0.06*k7 + 0.07*k8 + 0.09*k10 + 0.10*k11 + 0.11*k12 + 0.12*k13 + 0.13*k14 + 0.14*k15 + 0.15*k16 + 0.16*k17 + 0.17*k18 + 0.18*k19 + 0.19*k20 + 0.20*k21 + 0.21*k22 + 0.22*k23 + 0.23*k24 + 0.24*k25 + 0.25*k26 + 0.26*k27 + 0.27*k28 + 0.28*k29 + 0.29*k30 + 0.30*k31 + 0.31*k32 + 0.32*k33 + 0.33*k34)
  nothing
end

@benchmark foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34)
@benchmark foo_fb(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34)

I get:

julia> @benchmark foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min  max):  60.687 μs   2.011 ms  ┊ GC (min  max): 0.00%  93.25%
 Time  (median):     61.530 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   63.451 μs ± 41.939 μs  ┊ GC (mean ± σ):  1.56% ±  2.29%

  ▁▅▆██▇▆▅▃▂▂▂▂▃▃▂▂▁                ▁ ▁▁▂▁▁▁▁▁   ▁            ▂
  ██████████████████████▆▇▆▆▆▇▆▇▇▇██████████████████▇▇▇▇▆▅▆▄▆ █
  60.7 μs      Histogram: log(frequency) by time      70.8 μs <

 Memory estimate: 29.80 KiB, allocs estimate: 625.

julia> @benchmark foo_fb(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min  max):  25.045 μs   2.118 ms  ┊ GC (min  max): 0.00%  96.72%
 Time  (median):     25.929 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   28.123 μs ± 50.510 μs  ┊ GC (mean ± σ):  4.66% ±  2.56%

  ▁▅▇██▇▆▅▄▃▂▁     ▁▁▁▁ ▁          ▁ ▁ ▁▁▁▁▁▁                 ▂
  ██████████████▇██████████▇▆▇▆▆▇███▇█████████████▇██▇▇▇▆▆▆▆▆ █
  25 μs        Histogram: log(frequency) by time      35.8 μs <

 Memory estimate: 31.94 KiB, allocs estimate: 663.

Adding parenthesis removes the allocations:

julia> function foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
                    k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
                    k26, k27, k28, k29, k30, k31, k32, k33, k34)
         dt = 0.5
         @. tmp = uprev + dt*(0.00*k1 + (0.01*k2 + (0.02*k3 + (0.04*k5 + (0.06*k7 + (0.07*k8 + (0.09*k10 + (0.10*k11 + (0.11*k12 + (0.12*k13 + (0.13*k14 + (0.14*k15 + (0.15*k16 + (0.16*k17 + (0.17*k18 + (0.18*k19 + (0.19*k20 + (0.20*k21 + (0.21*k22 + (0.22*k23 + (0.23*k24 + (0.24*k25 + (0.25*k26 + (0.26*k27 + (0.27*k28 + (0.28*k29 + (0.29*k30 + (0.30*k31 + (0.31*k32 + (0.32*k33 + (0.33*k34)))))))))))))))))))))))))))))))
         nothing
       end
foo (generic function with 1 method)

julia> function foo_fb(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
                    k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
                    k26, k27, k28, k29, k30, k31, k32, k33, k34)
         dt = 0.5
         @.. tmp = uprev + dt*(0.00*k1 + (0.01*k2 + (0.02*k3 + (0.04*k5 + (0.06*k7 + (0.07*k8 + (0.09*k10 + (0.10*k11 + (0.11*k12 + (0.12*k13 + (0.13*k14 + (0.14*k15 + (0.15*k16 + (0.16*k17 + (0.17*k18 + (0.18*k19 + (0.19*k20 + (0.20*k21 + (0.21*k22 + (0.22*k23 + (0.23*k24 + (0.24*k25 + (0.25*k26 + (0.26*k27 + (0.27*k28 + (0.28*k29 + (0.29*k30 + (0.30*k31 + (0.31*k32 + (0.32*k33 + (0.33*k34)))))))))))))))))))))))))))))))
         nothing
       end
foo_fb (generic function with 1 method)

julia> @benchmark foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min  max):  45.152 μs   88.212 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     45.210 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   45.274 μs ± 620.213 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▆██▅                                                 ▁▂▁    ▂
  ▇█████▄▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▇█████▇▇ █
  45.2 μs       Histogram: log(frequency) by time      46.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark foo_fb(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34)
BechmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min  max):  2.546 μs   6.567 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.560 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.565 μs ± 63.992 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▃█▇▂
  ▂▂▄▇████▆▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂ ▃
  2.55 μs        Histogram: frequency by time        2.68 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

@N5N3
Copy link
Member

N5N3 commented Nov 30, 2021

I think there's no need to limit this optimization to inputs with the same axes.
Something like ExtrudedWithoutDim1 should be enough, as @simd only tries to vectorlize the innermost loop.

@vtjnash
Copy link
Member

vtjnash commented Aug 24, 2023

Seems no longer needed

@vtjnash vtjnash deleted the mb/hack28126 branch August 24, 2023 19:24
@KristofferC KristofferC removed the backport 1.6 Change should be backported to release-1.6 label Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection compiler:simd instruction-level vectorization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants