Workaround #28126, support SIMDing broadcast in more cases #30973

mbauman · 2019-02-06T07:23:12Z

This is an ugly performance hack around issue #28126 in some limited (but common) cases. The problem in short: when given many arrays of the same size, LLVM has difficulty hoisting the decision of whether a given dimension should be "extruded" out of the loop. This extra indirection in the index computation seems to foil the array bounds aliasing checks, which stymies SIMDification. The solution: check to see if Julia can statically decide whether or not to extrude any dimensions in a given broadcast expression -- and if so, use a special array wrapper that flags that none of the dimensions in that array need to be extruded out in order to perform the broadcast.

No idea if tests will pass or if this actually is a good idea in general.

@nanosoldier runbenchmarks(ALL, vs = ":master") (edit: whoops, sorry Nanosoldier — should have waited for CI before sending you down a dead-end alley)

base/broadcast.jl

mbauman · 2019-02-06T16:05:51Z

Ok, let's try this again. @nanosoldier runbenchmarks(ALL, vs = ":master")

mbauman · 2019-02-06T16:40:17Z

base/broadcast.jl

+    # any array sizes within the inner loop. Ideally this really should be something
+    # that Julia and/or LLVM could figure out and eliminate... and indeed they can
+    # for limited numbers of arguments.
+    if _is_static_broadcast_28126(dest, bc)


I just want to emphasize how awful this is: we now are generating two independent inner loops for every broadcast expression — one that might be slightly more likely to vectorize (depending on array sizes), and one normal one. They've both gotta inline to avoid allocating our broadcast expression tree.

nanosoldier · 2019-02-06T23:04:29Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

YingboMa · 2019-02-07T00:50:38Z

Is it possible to remove all the allocation in here?

tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
     k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
     k26, k27, k28, k29, k30, k31, k32, k33, k34 = (ones(1000) for i in 1:36)

function foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
             k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
             k26, k27, k28, k29, k30, k31, k32, k33, k34)
  dt = 0.5
  @. tmp = uprev + dt*(0.00*k1 + 0.01*k2 + 0.02*k3 + 0.04*k5 + 0.06*k7 + 0.07*k8 + 0.09*k10 + 0.10*k11 + 0.11*k12 + 0.12*k13 + 0.13*k14 + 0.14*k15 + 0.15*k16 + 0.16*k17 + 0.17*k18 + 0.18*k19 + 0.19*k20 + 0.20*k21 + 0.21*k22 + 0.22*k23 + 0.23*k24 + 0.24*k25 + 0.25*k26 + 0.26*k27 + 0.27*k28 + 0.28*k29 + 0.29*k30 + 0.30*k31 + 0.31*k32 + 0.32*k33 + 0.33*k34)
  nothing
end

foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34)
@allocated foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
             k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
             k26, k27, k28, k29, k30, k31, k32, k33, k34)

julia> @allocated foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
                    k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
                    k26, k27, k28, k29, k30, k31, k32, k33, k34)
7072

This isn't some arbitrary example. It is a line in the Feagin14 method. https://github.com/JuliaDiffEq/OrdinaryDiffEq.jl/blob/master/src/perform_step/feagin_rk_perform_step.jl#L737

KristofferC · 2019-02-07T15:52:46Z

Perhaps some compilation time comparisons before after this PR would be useful to see if there is a trade off.

This is an ugly performance hack around issue #28126 in some limited (but common) cases. The problem in short: when given many arrays of the same size, LLVM has difficulty hoisting the decision of whether a given dimension should be "extruded" out of the loop. This extra indirection in the index computation seems to foil the array bounds aliasing checks, which stymies SIMDification. The solution: check to see if _Julia_ can statically decide whether or not to extrude any dimensions in a given broadcast expression -- and if so, use a special array wrapper that flags that none of the dimensions in that array need to be extruded out in order to perform the broadcast.

@allocated

* Inline the whole broadcast expression to avoid allocation `map` has an inline limit of 16. To make sure that the whole broadcast tree gets inlined properly, I added the `_inlined_map` function. I am not sure if it is a good idea, but worth trying. This PR solves the issue which I have mentioned in 2693778#issuecomment-461248258 ```julia julia> @allocated foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34) 0 ``` * Fix CI failure * Stricter test (vectorization & no allocation for a 9-array bc) * rm `_inlined_map`

mbauman · 2019-03-14T20:56:33Z

So I tried a bunch of compile time tests (Julia build time, using JuliaDB/DataFrames/Plots, plot(rand(10)), various Julia tests), but the only one that showed any change was test/broadcast:

master (`9f69b2e`)

$ time make -C test broadcast
make: Entering directory '/home/mbauman/julia/test'
    JULIA test/broadcast
Test  (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB)
running testset broadcast...
broadcast  (1) |    44.72 |   2.39 |  5.3 |    5557.00 |   344.13

Test Summary: | Pass  Total
  Overall     |  423    423
    SUCCESS
make: Leaving directory '/home/mbauman/julia/test'

real	0m49.138s
user	0m48.738s
sys	0m1.190s

mb/hack28126 (`8aa5907`)

$ time make -C test broadcast
make: Entering directory '/home/mbauman/julia/test'
    JULIA test/broadcast
Test  (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB)
running testset broadcast...
broadcast  (1) |    53.04 |   2.99 |  5.6 |    7046.15 |   370.40

Test Summary: | Pass  Total
  Overall     |  424    424
    SUCCESS
make: Leaving directory '/home/mbauman/julia/test'

real	0m57.454s
user	0m56.996s
sys	0m1.190s

That's about a 20% regression. :-\

JeffBezanson · 2021-07-15T19:00:07Z

Not sure what to do here; is it worth trying to revive this? Does the problem still exist?

mbauman · 2021-07-15T19:04:39Z

The PR to revive (first) is #35675. Then the concern in #30973 (comment) vanishes.

chriselrod · 2021-07-15T19:08:10Z

Not sure what to do here; is it worth trying to revive this? Does the problem still exist?

Yes.

using FastBroadcast, BenchmarkTools
const tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
     k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
     k26, k27, k28, k29, k30, k31, k32, k33, k34 = (ones(1000) for i in 1:36)

function foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
             k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
             k26, k27, k28, k29, k30, k31, k32, k33, k34)
  dt = 0.5
  @. tmp = uprev + dt*(0.00*k1 + 0.01*k2 + 0.02*k3 + 0.04*k5 + 0.06*k7 + 0.07*k8 + 0.09*k10 + 0.10*k11 + 0.11*k12 + 0.12*k13 + 0.13*k14 + 0.14*k15 + 0.15*k16 + 0.16*k17 + 0.17*k18 + 0.18*k19 + 0.19*k20 + 0.20*k21 + 0.21*k22 + 0.22*k23 + 0.23*k24 + 0.24*k25 + 0.25*k26 + 0.26*k27 + 0.27*k28 + 0.28*k29 + 0.29*k30 + 0.30*k31 + 0.31*k32 + 0.32*k33 + 0.33*k34)
  nothing
end
function foo_fb(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
             k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
             k26, k27, k28, k29, k30, k31, k32, k33, k34)
  dt = 0.5
  @.. tmp = uprev + dt*(0.00*k1 + 0.01*k2 + 0.02*k3 + 0.04*k5 + 0.06*k7 + 0.07*k8 + 0.09*k10 + 0.10*k11 + 0.11*k12 + 0.12*k13 + 0.13*k14 + 0.14*k15 + 0.15*k16 + 0.16*k17 + 0.17*k18 + 0.18*k19 + 0.19*k20 + 0.20*k21 + 0.21*k22 + 0.22*k23 + 0.23*k24 + 0.24*k25 + 0.25*k26 + 0.26*k27 + 0.27*k28 + 0.28*k29 + 0.29*k30 + 0.30*k31 + 0.31*k32 + 0.32*k33 + 0.33*k34)
  nothing
end

@benchmark foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34)
@benchmark foo_fb(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34)

I get:

julia> @benchmark foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  60.687 μs …  2.011 ms  ┊ GC (min … max): 0.00% … 93.25%
 Time  (median):     61.530 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   63.451 μs ± 41.939 μs  ┊ GC (mean ± σ):  1.56% ±  2.29%

  ▁▅▆██▇▆▅▃▂▂▂▂▃▃▂▂▁                ▁ ▁▁▂▁▁▁▁▁   ▁            ▂
  ██████████████████████▆▇▆▆▆▇▆▇▇▇██████████████████▇▇▇▇▆▅▆▄▆ █
  60.7 μs      Histogram: log(frequency) by time      70.8 μs <

 Memory estimate: 29.80 KiB, allocs estimate: 625.

julia> @benchmark foo_fb(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  25.045 μs …  2.118 ms  ┊ GC (min … max): 0.00% … 96.72%
 Time  (median):     25.929 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   28.123 μs ± 50.510 μs  ┊ GC (mean ± σ):  4.66% ±  2.56%

  ▁▅▇██▇▆▅▄▃▂▁     ▁▁▁▁ ▁          ▁ ▁ ▁▁▁▁▁▁                 ▂
  ██████████████▇██████████▇▆▇▆▆▇███▇█████████████▇██▇▇▇▆▆▆▆▆ █
  25 μs        Histogram: log(frequency) by time      35.8 μs <

 Memory estimate: 31.94 KiB, allocs estimate: 663.

Adding parenthesis removes the allocations:

julia> function foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
                    k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
                    k26, k27, k28, k29, k30, k31, k32, k33, k34)
         dt = 0.5
         @. tmp = uprev + dt*(0.00*k1 + (0.01*k2 + (0.02*k3 + (0.04*k5 + (0.06*k7 + (0.07*k8 + (0.09*k10 + (0.10*k11 + (0.11*k12 + (0.12*k13 + (0.13*k14 + (0.14*k15 + (0.15*k16 + (0.16*k17 + (0.17*k18 + (0.18*k19 + (0.19*k20 + (0.20*k21 + (0.21*k22 + (0.22*k23 + (0.23*k24 + (0.24*k25 + (0.25*k26 + (0.26*k27 + (0.27*k28 + (0.28*k29 + (0.29*k30 + (0.30*k31 + (0.31*k32 + (0.32*k33 + (0.33*k34)))))))))))))))))))))))))))))))
         nothing
       end
foo (generic function with 1 method)

julia> function foo_fb(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
                    k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
                    k26, k27, k28, k29, k30, k31, k32, k33, k34)
         dt = 0.5
         @.. tmp = uprev + dt*(0.00*k1 + (0.01*k2 + (0.02*k3 + (0.04*k5 + (0.06*k7 + (0.07*k8 + (0.09*k10 + (0.10*k11 + (0.11*k12 + (0.12*k13 + (0.13*k14 + (0.14*k15 + (0.15*k16 + (0.16*k17 + (0.17*k18 + (0.18*k19 + (0.19*k20 + (0.20*k21 + (0.21*k22 + (0.22*k23 + (0.23*k24 + (0.24*k25 + (0.25*k26 + (0.26*k27 + (0.27*k28 + (0.28*k29 + (0.29*k30 + (0.30*k31 + (0.31*k32 + (0.32*k33 + (0.33*k34)))))))))))))))))))))))))))))))
         nothing
       end
foo_fb (generic function with 1 method)

julia> @benchmark foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  45.152 μs …  88.212 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     45.210 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   45.274 μs ± 620.213 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▆██▅                                                 ▁▂▁    ▂
  ▇█████▄▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▇█████▇▇ █
  45.2 μs       Histogram: log(frequency) by time      46.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark foo_fb(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25, k26, k27, k28, k29, k30, k31, k32, k33, k34)
BechmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.546 μs …  6.567 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.560 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.565 μs ± 63.992 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▃█▇▂
  ▂▂▄▇████▆▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂ ▃
  2.55 μs        Histogram: frequency by time        2.68 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

N5N3 · 2021-11-30T15:04:45Z

I think there's no need to limit this optimization to inputs with the same axes.
Something like ExtrudedWithoutDim1 should be enough, as @simd only tries to vectorlize the innermost loop.

vtjnash · 2023-08-24T19:24:56Z

Seems no longer needed

mbauman added broadcast Applying a function over a collection compiler:simd instruction-level vectorization labels Feb 6, 2019

mbauman force-pushed the mb/hack28126 branch from 1e6ec76 to ec97968 Compare February 6, 2019 07:25

This comment has been minimized.

Sign in to view

YingboMa reviewed Feb 6, 2019

View reviewed changes

base/broadcast.jl Outdated Show resolved Hide resolved

YingboMa mentioned this pull request Feb 6, 2019

Not fully broadcasting methods SciML/OrdinaryDiffEq.jl#106

Closed

39 tasks

mbauman commented Feb 6, 2019

View reviewed changes

mcabbott mentioned this pull request Feb 6, 2019

Speed of sum(BroadcastArray) JuliaArrays/LazyArrays.jl#16

Open

YingboMa mentioned this pull request Feb 7, 2019

Inline the whole broadcast expression to avoid allocation #30986

Merged

ViralBShah added the backport 1.1 label Feb 7, 2019

KristofferC added the triage This should be discussed on a triage call label Feb 7, 2019

mbauman and others added 5 commits March 14, 2019 16:32

Fixup NonExtruded indexing

3120724

inline and add test case

4261ac8

Move vectorization test to boundscheck

8aa5907

mbauman force-pushed the mb/hack28126 branch from 2543c3d to 8aa5907 Compare March 14, 2019 20:56

mbauman mentioned this pull request May 16, 2019

Broadcast and linear indexing #32051

Open

JeffBezanson added backport 1.0 and removed backport 1.1 labels Dec 30, 2019

KristofferC mentioned this pull request Feb 20, 2020

WIP: Backports release 1.0.6 #34011

Closed

56 tasks

mbauman mentioned this pull request Apr 30, 2020

Remove inline annotations from broadcast kernels #35675

Open

DilumAluthge added backport 1.6 Change should be backported to release-1.6 and removed backport 1.0 labels Jul 15, 2021

JeffBezanson removed the triage This should be discussed on a triage call label Jul 15, 2021

N5N3 mentioned this pull request Dec 1, 2021

Add Expr(:ivdepscope) to support not marking the entire loop body as ivdep #43261

Closed

YingboMa mentioned this pull request Mar 13, 2022

Broadcasting is much slower than a for loop #28126

Open

N5N3 mentioned this pull request Dec 15, 2022

Inferior performance of inplace broadcasting over map! for multiple wide matrices #47873

Open

vtjnash closed this Aug 24, 2023

vtjnash deleted the mb/hack28126 branch August 24, 2023 19:24

KristofferC removed the backport 1.6 Change should be backported to release-1.6 label Oct 11, 2023

charleskawczynski mentioned this pull request Oct 18, 2024

Improve CUDA performance for multi-dimensional arrays CliMA/MultiBroadcastFusion.jl#48

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround #28126, support SIMDing broadcast in more cases #30973

Workaround #28126, support SIMDing broadcast in more cases #30973

mbauman commented Feb 6, 2019 •

edited

Loading

This comment has been minimized.

mbauman commented Feb 6, 2019

mbauman Feb 6, 2019

nanosoldier commented Feb 6, 2019

YingboMa commented Feb 7, 2019

KristofferC commented Feb 7, 2019

mbauman commented Mar 14, 2019 •

edited

Loading

JeffBezanson commented Jul 15, 2021

mbauman commented Jul 15, 2021

chriselrod commented Jul 15, 2021 •

edited

Loading

N5N3 commented Nov 30, 2021

vtjnash commented Aug 24, 2023

Workaround #28126, support SIMDing broadcast in more cases #30973

Workaround #28126, support SIMDing broadcast in more cases #30973

Conversation

mbauman commented Feb 6, 2019 • edited Loading

This comment has been minimized.

mbauman commented Feb 6, 2019

mbauman Feb 6, 2019

Choose a reason for hiding this comment

nanosoldier commented Feb 6, 2019

YingboMa commented Feb 7, 2019

KristofferC commented Feb 7, 2019

mbauman commented Mar 14, 2019 • edited Loading

master (9f69b2e)

mb/hack28126 (8aa5907)

JeffBezanson commented Jul 15, 2021

mbauman commented Jul 15, 2021

chriselrod commented Jul 15, 2021 • edited Loading

N5N3 commented Nov 30, 2021

vtjnash commented Aug 24, 2023

mbauman commented Feb 6, 2019 •

edited

Loading

mbauman commented Mar 14, 2019 •

edited

Loading

master (`9f69b2e`)

mb/hack28126 (`8aa5907`)

chriselrod commented Jul 15, 2021 •

edited

Loading