Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Expr(:ivdepscope) to support not marking the entire loop body as ivdep #43261

Closed
wants to merge 5 commits into from

Conversation

N5N3
Copy link
Member

@N5N3 N5N3 commented Nov 30, 2021

  • Motivation:
    Currently, @simd ivdep assumes the entire loop is free of "no loop-carried memory dependencies", which limits its usage in our broadcast system.
    This PR tries to split julia.ivdep into 2 meta: julia.ivdep.begin and julia.ivdep.end, and makes the simd-loop pass only marks the access within a begin/end block as MD_mem_parallel_loop_access.
    With this PR, if we find that all the args in a flat bc::Broadcasted are safe to parallelly loaded, and the dest::AbstractArray is safe to parallelly strored.
    Then we can implement the copyto! kernal as:
@eval function copyto!(dest::AbstractArray, bc::Broadcasted)
    @inbounds @simd for I in eachindex(bc)
        $(Expr(:loopinfo, Symbol("julia.ivdep.begin")))    # args are safe to load
        args = _getindex(bc.args, I)
        $(Expr(:loopinfo, Symbol("julia.ivdep.end")))
        temp = bc.f(args...)                                   # bc.f might have side effect
        $(Expr(:loopinfo, Symbol("julia.ivdep.begin")))    # result is safe to store
        dest[I] = temp
        $(Expr(:loopinfo, Symbol("julia.ivdep.end")))
    end
end

If bc.f is free of memory access, then LLVM should FIND this loop vectorlizable and add no runtime check. (and we can make a .+= 1 vectorlized more easily)
If not, then let LLVM checks whether bc.f might have side effect.

  • Changes in this PR:
    This PR only changes the pass inplementation.
    And makes @simd ivdep generates a julia.ivdep.begin/end block instead of a single julia.ivdep
    The usage and effect of @simd and @simd ivdep are not changed.

I'm not familiar with LLVM and I'm not sure this change is the correct way to make self-inplace broadcast vectorlizable.
All suggestions and comments are welcome.

@N5N3
Copy link
Member Author

N5N3 commented Dec 1, 2021

After some further trial with:

  1. 06a16b8 : removes @inline, and replace Ref with Some, thanks @vtjnash's code
  2. b5d4bc0 : implements the broadcast kernal with local ivdep marks and some optimization inspired by Workaround #28126, support SIMDing broadcast in more cases #30973.

I find that, to make self-inplace broadcast vectorlizable, we only need to tell LLVM "arrayref within bc's getindex and arrayset within dest's setindex! are parallelly-safe during the broadcast kernal". Combining this PR with the above 2 commits, on my desktop I have:

julia> using BenchmarkTools
julia> a = zeros(Float32, 128, 32); a_ = view(a, 1:128, 1:32);
julia> @btime $a .+= $a;
  280.000 ns (0 allocations: 0 bytes)  # on 1.7.0: 1.650 μs (0 allocations: 0 bytes)
julia> @btime $a .+= $a .+ $a .+ $a;
  511.979 ns (0 allocations: 0 bytes)  # on 1.7.0: 3.550 μs (0 allocations: 0 bytes) 
julia> @btime $a_ .+= $a_;
  289.286 ns (0 allocations: 0 bytes)  # on 1.7.0: 1.730 μs (0 allocations: 0 bytes)
julia> @btime $a_ .+= $a_ .+ $a_ .+ $a_;
  609.551 ns (0 allocations: 0 bytes)  # on 1.7.0: 6.440 μs (0 allocations: 0 bytes)

Some simple safety check:

julia> const p = Ref(0);
julia> a = zeros(Float32, 128, 32); b = similar(a);

julia> f(x) = x + (p[] += 0); # f has no side-effect
julia> @btime $b .= f.($a);
  221.526 ns (0 allocations: 0 bytes) # on 1.7.0: 1.440 μs (0 allocations: 0 bytes) 
julia> @btime $a .= f.($a);
  164.899 ns (0 allocations: 0 bytes) # on 1.7.0: 2.033 μs (0 allocations: 0 bytes)

julia> g(x) = x + (p[] = ~p[]); # g has side-effect
julia> @btime $b .= g.($a);
  2.411 μs (0 allocations: 0 bytes) # on 1.7.0: 2.522 μs (0 allocations: 0 bytes)
julia> @btime $a .= g.($a);
  2.411 μs (0 allocations: 0 bytes) # on 1.7.0: 2.511 μs (0 allocations: 0 bytes)

The above example shows that this change is safer than replacing @simd in broadcast kernal with @simd ivdep.
But I didn't look into why 1.7.0 faild to vectorlize b .= f.(a). (1.7.0 could vectorlize b .= f.(a) if we define f(x) as f(x) = x + p[])
Maybe more tests are needed.
Hope this is the correct way to "fix" #43153 safely.

Copy link
Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of the current construct. Currently Expr(:loopinfo) has the semantics that it should always be the last instruction in a loop. For a begin/end construct I would rather mimic Base.Experimental.Const and @aliasscope

base/simdloop.jl Outdated
@@ -125,12 +126,12 @@ either case, your inner loop should have the following properties to allow vecto
* No iteration ever waits on a previous iteration to make forward progress.
"""
macro simd(forloop)
esc(compile(forloop, nothing))
esc(compile(forloop, Symbol("julia.ivdep.end")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep this nothing? Makes little sense to have non-matching "begin"/"end" construct

Copy link
Member Author

@N5N3 N5N3 Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we'd better have something like Expr(:ivdepscope) and Expr(:popivdepscope) instead.
(and add a @ivdep macro ?).
Edit: Apparently this is beyond my competence. Can't understand why:

julia> @eval f(x) = $(Expr(:ivdepscope, :begin))
f (generic function with 1 method)

julia> @code_lowered f(1)
CodeInfo(
1$(Expr(:ivdepscope, :(Main.begin)))
└──     return nothing
)

julia> @eval f(x) = $(Expr(:loopinfo, :begin))
f (generic function with 1 method)

julia> @code_lowered f(1)
CodeInfo(
1$(Expr(:loopinfo, :begin))
└──     return nothing
)

@N5N3 N5N3 closed this Dec 6, 2021
@N5N3 N5N3 reopened this Dec 6, 2021
@N5N3 N5N3 force-pushed the partial-ivdep branch 2 times, most recently from d184d0a to bbf2dbb Compare December 6, 2021 14:10
Copy link
Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expr(:loopinfo) has the dedicated meaning to attach metadata to the LLVM loop. The current ivdep implementation uses that since it is for the entire scope of the loop. See #31376 for another use-case.

For scoped ivdep I think the implementation should mirror more #31018

N5N3 added 4 commits December 7, 2021 20:14
1. introduce `jl_ivdepscope_sym`
2. define `jl_ivdepscope_error` to thrown error message.
make `Expr(:ivdepscope, :begin/end)` lowered to `jl_ivdepscope_func`
1. let `loopinfo_mark` erase `julia.ivdepscope` if it has `julia.simd`.
2. erase `julia.ivdepscope` in unreachable branch even there's no `loopinfo_mark`. (make error message clearer)
@N5N3
Copy link
Member Author

N5N3 commented Dec 7, 2021

I'm not sure whether this is the correct way to implement scoped ivdep:

  1. A new meta head Expr(:ivdepscope) is introduced, which will be lowered as call void @julia.ivdepscope() with begin/end Metadata:
  2. When loopinfo_marker found julia.simdloop, it will add parallelaccess meta based on a simple begin/end count.
    During which the call void @julia.ivdepscope() will also be erased.
  3. If there're remaining julia.ivdepscope() after all optimizaion, user might use @ivdep outside @simd for end. So I make @julia.ivdepscope() fallback to jl_ivdepscope_error to throw error.
  4. User might write invalid code and thus leads to unreachable branch. To recover the original error message, the simdpass will erase call void @julia.ivdepscope() within basicblocks ending with unreachable

Some example:

julia> f(x) = @inbounds for i in eachindex(x)
           Base.@ivdep x[i] += i
       end
f (generic function with 1 method)

julia> f([1,2,3,4])
ERROR: Found ivdepscope outside @simd.
Stacktrace:
 [1] macro expansion
   @ .\simdloop.jl:151 [inlined]
 [2] f(x::Vector{Int64})
   @ Main .\REPL[1]:2
 [3] top-level scope
   @ REPL[2]:1

julia> f((1,2,3,4))
ERROR: MethodError: no method matching setindex!(::NTuple{4, Int64}, ::Int64, ::Int64)
Stacktrace:
 [1] macro expansion
   @ .\simdloop.jl:152 [inlined]
 [2] f(x::NTuple{4, Int64})
   @ Main .\REPL[1]:2
 [3] top-level scope
   @ REPL[3]:1

@N5N3 N5N3 requested a review from vchuravy December 7, 2021 12:41
@N5N3 N5N3 changed the title Replace julia.ivdep with julia.ivdep.begin/end to support not marking the entire loop body as ivdep Add Expr(:ivdepscope) to support not marking the entire loop body as ivdep Dec 8, 2021
@N5N3
Copy link
Member Author

N5N3 commented Jan 25, 2022

I guess we won't need this after #43852.

@N5N3 N5N3 closed this Jan 25, 2022
@N5N3 N5N3 deleted the partial-ivdep branch February 17, 2022 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants