Add `Expr(:ivdepscope)` to support not marking the entire loop body as `ivdep` #43261

N5N3 · 2021-11-30T05:14:40Z

Motivation:
Currently, @simd ivdep assumes the entire loop is free of "no loop-carried memory dependencies", which limits its usage in our broadcast system.
This PR tries to split julia.ivdep into 2 meta: julia.ivdep.begin and julia.ivdep.end, and makes the simd-loop pass only marks the access within a begin/end block as MD_mem_parallel_loop_access.
With this PR, if we find that all the args in a flat bc::Broadcasted are safe to parallelly loaded, and the dest::AbstractArray is safe to parallelly strored.
Then we can implement the copyto! kernal as:

@eval function copyto!(dest::AbstractArray, bc::Broadcasted)
    @inbounds @simd for I in eachindex(bc)
        $(Expr(:loopinfo, Symbol("julia.ivdep.begin")))    # args are safe to load
        args = _getindex(bc.args, I)
        $(Expr(:loopinfo, Symbol("julia.ivdep.end")))
        temp = bc.f(args...)                                   # bc.f might have side effect
        $(Expr(:loopinfo, Symbol("julia.ivdep.begin")))    # result is safe to store
        dest[I] = temp
        $(Expr(:loopinfo, Symbol("julia.ivdep.end")))
    end
end

If bc.f is free of memory access, then LLVM should FIND this loop vectorlizable and add no runtime check. (and we can make a .+= 1 vectorlized more easily)
If not, then let LLVM checks whether bc.f might have side effect.

Changes in this PR:
This PR only changes the pass inplementation.
And makes @simd ivdep generates a julia.ivdep.begin/end block instead of a single julia.ivdep
The usage and effect of @simd and @simd ivdep are not changed.

I'm not familiar with LLVM and I'm not sure this change is the correct way to make self-inplace broadcast vectorlizable.
All suggestions and comments are welcome.

N5N3 · 2021-12-01T08:27:10Z

After some further trial with:

06a16b8 : removes @inline, and replace Ref with Some, thanks @vtjnash's code
b5d4bc0 : implements the broadcast kernal with local ivdep marks and some optimization inspired by Workaround #28126, support SIMDing broadcast in more cases #30973.

I find that, to make self-inplace broadcast vectorlizable, we only need to tell LLVM "arrayref within bc's getindex and arrayset within dest's setindex! are parallelly-safe during the broadcast kernal". Combining this PR with the above 2 commits, on my desktop I have:

julia> using BenchmarkTools
julia> a = zeros(Float32, 128, 32); a_ = view(a, 1:128, 1:32);
julia> @btime $a .+= $a;
  280.000 ns (0 allocations: 0 bytes)  # on 1.7.0: 1.650 μs (0 allocations: 0 bytes)
julia> @btime $a .+= $a .+ $a .+ $a;
  511.979 ns (0 allocations: 0 bytes)  # on 1.7.0: 3.550 μs (0 allocations: 0 bytes) 
julia> @btime $a_ .+= $a_;
  289.286 ns (0 allocations: 0 bytes)  # on 1.7.0: 1.730 μs (0 allocations: 0 bytes)
julia> @btime $a_ .+= $a_ .+ $a_ .+ $a_;
  609.551 ns (0 allocations: 0 bytes)  # on 1.7.0: 6.440 μs (0 allocations: 0 bytes)

Some simple safety check:

julia> const p = Ref(0);
julia> a = zeros(Float32, 128, 32); b = similar(a);

julia> f(x) = x + (p[] += 0); # f has no side-effect
julia> @btime $b .= f.($a);
  221.526 ns (0 allocations: 0 bytes) # on 1.7.0: 1.440 μs (0 allocations: 0 bytes) 
julia> @btime $a .= f.($a);
  164.899 ns (0 allocations: 0 bytes) # on 1.7.0: 2.033 μs (0 allocations: 0 bytes)

julia> g(x) = x + (p[] = ~p[]); # g has side-effect
julia> @btime $b .= g.($a);
  2.411 μs (0 allocations: 0 bytes) # on 1.7.0: 2.522 μs (0 allocations: 0 bytes)
julia> @btime $a .= g.($a);
  2.411 μs (0 allocations: 0 bytes) # on 1.7.0: 2.511 μs (0 allocations: 0 bytes)

The above example shows that this change is safer than replacing @simd in broadcast kernal with @simd ivdep.
But I didn't look into why 1.7.0 faild to vectorlize b .= f.(a). (1.7.0 could vectorlize b .= f.(a) if we define f(x) as f(x) = x + p[])
Maybe more tests are needed.
Hope this is the correct way to "fix" #43153 safely.

vchuravy

I am not a fan of the current construct. Currently Expr(:loopinfo) has the semantics that it should always be the last instruction in a loop. For a begin/end construct I would rather mimic Base.Experimental.Const and @aliasscope

vchuravy · 2021-12-05T17:39:14Z

base/simdloop.jl

@@ -125,12 +126,12 @@ either case, your inner loop should have the following properties to allow vecto
 * No iteration ever waits on a previous iteration to make forward progress.
 """
 macro simd(forloop)
-    esc(compile(forloop, nothing))
+    esc(compile(forloop, Symbol("julia.ivdep.end")))


Can we keep this nothing? Makes little sense to have non-matching "begin"/"end" construct

IIUC, we'd better have something like Expr(:ivdepscope) and Expr(:popivdepscope) instead.
(and add a @ivdep macro ?).
Edit: Apparently this is beyond my competence. Can't understand why:

julia> @eval f(x) = $(Expr(:ivdepscope, :begin)) f (generic function with 1 method) julia> @code_lowered f(1) CodeInfo( 1 ─ $(Expr(:ivdepscope, :(Main.begin))) └── return nothing ) julia> @eval f(x) = $(Expr(:loopinfo, :begin)) f (generic function with 1 method) julia> @code_lowered f(1) CodeInfo( 1 ─ $(Expr(:loopinfo, :begin)) └── return nothing )

vchuravy

Expr(:loopinfo) has the dedicated meaning to attach metadata to the LLVM loop. The current ivdep implementation uses that since it is for the entire scope of the loop. See #31376 for another use-case.

For scoped ivdep I think the implementation should mirror more #31018

1. introduce `jl_ivdepscope_sym` 2. define `jl_ivdepscope_error` to thrown error message.

make `Expr(:ivdepscope, :begin/end)` lowered to `jl_ivdepscope_func`

1. let `loopinfo_mark` erase `julia.ivdepscope` if it has `julia.simd`. 2. erase `julia.ivdepscope` in unreachable branch even there's no `loopinfo_mark`. (make error message clearer)

N5N3 · 2021-12-07T12:41:15Z

I'm not sure whether this is the correct way to implement scoped ivdep:

A new meta head Expr(:ivdepscope) is introduced, which will be lowered as call void @julia.ivdepscope() with begin/end Metadata:
When loopinfo_marker found julia.simdloop, it will add parallelaccess meta based on a simple begin/end count.
During which the call void @julia.ivdepscope() will also be erased.
If there're remaining julia.ivdepscope() after all optimizaion, user might use @ivdep outside @simd for end. So I make @julia.ivdepscope() fallback to jl_ivdepscope_error to throw error.
User might write invalid code and thus leads to unreachable branch. To recover the original error message, the simdpass will erase call void @julia.ivdepscope() within basicblocks ending with unreachable

Some example:

julia> f(x) = @inbounds for i in eachindex(x)
           Base.@ivdep x[i] += i
       end
f (generic function with 1 method)

julia> f([1,2,3,4])
ERROR: Found ivdepscope outside @simd.
Stacktrace:
 [1] macro expansion
   @ .\simdloop.jl:151 [inlined]
 [2] f(x::Vector{Int64})
   @ Main .\REPL[1]:2
 [3] top-level scope
   @ REPL[2]:1

julia> f((1,2,3,4))
ERROR: MethodError: no method matching setindex!(::NTuple{4, Int64}, ::Int64, ::Int64)
Stacktrace:
 [1] macro expansion
   @ .\simdloop.jl:152 [inlined]
 [2] f(x::NTuple{4, Int64})
   @ Main .\REPL[1]:2
 [3] top-level scope
   @ REPL[3]:1

N5N3 · 2022-01-25T02:59:29Z

I guess we won't need this after #43852.

DilumAluthge requested a review from vchuravy November 30, 2021 07:04

vchuravy reviewed Dec 5, 2021

View reviewed changes

N5N3 closed this Dec 6, 2021

N5N3 reopened this Dec 6, 2021

N5N3 force-pushed the partial-ivdep branch 2 times, most recently from d184d0a to bbf2dbb Compare December 6, 2021 14:10

vchuravy requested changes Dec 6, 2021

View reviewed changes

N5N3 added 4 commits December 7, 2021 20:14

add Expr(:ivdepscope)

1e38f0e

1. introduce `jl_ivdepscope_sym` 2. define `jl_ivdepscope_error` to thrown error message.

modify codegen

3b915fb

make `Expr(:ivdepscope, :begin/end)` lowered to `jl_ivdepscope_func`

modify pass

f1043ed

1. let `loopinfo_mark` erase `julia.ivdepscope` if it has `julia.simd`. 2. erase `julia.ivdepscope` in unreachable branch even there's no `loopinfo_mark`. (make error message clearer)

add @ivdep

5d739dc

N5N3 force-pushed the partial-ivdep branch from bbf2dbb to 5d739dc Compare December 7, 2021 12:22

N5N3 requested a review from vchuravy December 7, 2021 12:41

add throw test

a51d031

N5N3 changed the title ~~Replace julia.ivdep with julia.ivdep.begin/end to support not marking the entire loop body as ivdep~~ Add Expr(:ivdepscope) to support not marking the entire loop body as ivdep Dec 8, 2021

N5N3 closed this Jan 25, 2022

N5N3 deleted the partial-ivdep branch February 17, 2022 06:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `Expr(:ivdepscope)` to support not marking the entire loop body as `ivdep` #43261

Add `Expr(:ivdepscope)` to support not marking the entire loop body as `ivdep` #43261

N5N3 commented Nov 30, 2021 •

edited

Loading

N5N3 commented Dec 1, 2021 •

edited

Loading

vchuravy left a comment

vchuravy Dec 5, 2021

N5N3 Dec 6, 2021 •

edited

Loading

vchuravy left a comment

N5N3 commented Dec 7, 2021 •

edited

Loading

N5N3 commented Jan 25, 2022

Add Expr(:ivdepscope) to support not marking the entire loop body as ivdep #43261

Add Expr(:ivdepscope) to support not marking the entire loop body as ivdep #43261

Conversation

N5N3 commented Nov 30, 2021 • edited Loading

N5N3 commented Dec 1, 2021 • edited Loading

vchuravy left a comment

Choose a reason for hiding this comment

vchuravy Dec 5, 2021

Choose a reason for hiding this comment

N5N3 Dec 6, 2021 • edited Loading

Choose a reason for hiding this comment

vchuravy left a comment

Choose a reason for hiding this comment

N5N3 commented Dec 7, 2021 • edited Loading

N5N3 commented Jan 25, 2022

Add `Expr(:ivdepscope)` to support not marking the entire loop body as `ivdep` #43261

Add `Expr(:ivdepscope)` to support not marking the entire loop body as `ivdep` #43261

N5N3 commented Nov 30, 2021 •

edited

Loading

N5N3 commented Dec 1, 2021 •

edited

Loading

N5N3 Dec 6, 2021 •

edited

Loading

N5N3 commented Dec 7, 2021 •

edited

Loading