-
-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up interpolation codes using EllipsisNotation #214
Comments
Afterwards we can apply the same change in DelayDiffEq to the history function. |
I think it is a good idea to clean up the interpolations and use EllipsisNotation.jl's using EllipsisNotation, BenchmarkTools
function foo!(A, B, C, D, E, F)
@. A = B + 2C + 5D + E^2 + F^3
end
function bar!(A, B, C, D, E, F)
@. A[..] = B[..] + 2C[..] + 5D[..] + E[..]^2 + F[..]^3
end
function foobar!(A, B, C, D, E, F)
@. @views A[..] = B[..] + 2C[..] + 5D[..] + E[..]^2 + F[..]^3
end
N = 10^2; A=rand(N,N); B=rand(N,N); C=rand(N,N); D=rand(N,N); E=rand(N,N); F = rand(N,N); gives on my machine julia> @benchmark foo!(A, B, C, D, E, F)
BenchmarkTools.Trial:
memory estimate: 224 bytes
allocs estimate: 4
--------------
minimum time: 38.356 μs (0.00% GC)
median time: 40.373 μs (0.00% GC)
mean time: 40.360 μs (0.00% GC)
maximum time: 58.256 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark bar!(A, B, C, D, E, F)
BenchmarkTools.Trial:
memory estimate: 391.22 KiB
allocs estimate: 21
--------------
minimum time: 83.906 μs (0.00% GC)
median time: 94.716 μs (0.00% GC)
mean time: 105.742 μs (9.49% GC)
maximum time: 860.058 μs (83.45% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark foobar!(A, B, C, D, E, F)
BenchmarkTools.Trial:
memory estimate: 240 bytes
allocs estimate: 5
--------------
minimum time: 79.951 μs (0.00% GC)
median time: 80.943 μs (0.00% GC)
mean time: 81.167 μs (0.00% GC)
maximum time: 139.661 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1 Just another question: Do we really need all the checks |
We should just add a fast path that avoids the Base.getindex(A,::Val{:..}) = A and see if that makes it free. (to make |
Regarding I just had a look at these interpolants again, and I remember that errors were caused by |
That sounds like it should work. |
I've tried adding function f_loop!(A, B, C, D, E, F)
@boundscheck begin
size(A) == size(B) == size(C) == size(D) == size(E) == size(F)
end
@inbounds @simd for idx in eachindex(A)
A[idx] = B[idx] + 2C[idx] + 5D[idx] + E[idx]^2 + F[idx]^3
end
nothing
end
function f_loop!(A, B, C, D, E)
@boundscheck begin
size(A) == size(B) == size(C) == size(D) == size(E)
end
@inbounds @simd for idx in eachindex(A)
A[idx] = B[idx] + 2C[idx] + 5D[idx] + E[idx]^2
end
nothing
end
function f_noindex!(A, B, C, D, E, F)
@. A = B + 2C + 5D + E^2 + F^3
nothing
end
function f_noindex!(A, B, C, D, E)
@. A = B + 2C + 5D + E^2
nothing
end
function f_index!(A, B, C, D, E, F)
@. A[..] = B[..] + 2C[..] + 5D[..] + E[..]^2 + F[..]^3
nothing
end
function f_index!(A, B, C, D, E)
@. A[..] = B[..] + 2C[..] + 5D[..] + E[..]^2
nothing
end
function f_semiindex!(A, B, C, D, E, F)
@. A = B[..] + 2C[..] + 5D[..] + E[..]^2 + F[..]^3
nothing
end
function f_semiindex!(A, B, C, D, E)
@. A = B[..] + 2C[..] + 5D[..] + E[..]^2
nothing
end
function f_view!(A, B, C, D, E, F)
@. @views A[..] = B[..] + 2C[..] + 5D[..] + E[..]^2 + F[..]^3
nothing
end
function f_view!(A, B, C, D, E)
@. @views A[..] = B[..] + 2C[..] + 5D[..] + E[..]^2
nothing
end
N = 10^2; A=rand(N,N); B=rand(N,N); C=rand(N,N); D=rand(N,N); E=rand(N,N); F = rand(N,N);
@benchmark f_loop!(A, B, C, D, E, F)
@benchmark f_noindex!(A, B, C, D, E, F)
@benchmark f_index!(A, B, C, D, E, F)
@benchmark f_semiindex!(A, B, C, D, E, F)
@benchmark f_view!(A, B, C, D, E, F)
@benchmark f_loop!(A, B, C, D, E)
@benchmark f_noindex!(A, B, C, D, E)
@benchmark f_index!(A, B, C, D, E)
@benchmark f_semiindex!(A, B, C, D, E)
@benchmark f_view!(A, B, C, D, E) This gives julia> @benchmark f_loop!(A, B, C, D, E, F)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 9.758 μs (0.00% GC)
median time: 9.852 μs (0.00% GC)
mean time: 9.889 μs (0.00% GC)
maximum time: 41.211 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark f_noindex!(A, B, C, D, E, F)
BenchmarkTools.Trial:
memory estimate: 224 bytes
allocs estimate: 4
--------------
minimum time: 38.215 μs (0.00% GC)
median time: 40.363 μs (0.00% GC)
mean time: 40.367 μs (0.00% GC)
maximum time: 55.525 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark f_index!(A, B, C, D, E, F)
BenchmarkTools.Trial:
memory estimate: 752 bytes
allocs estimate: 30
--------------
minimum time: 50.090 μs (0.00% GC)
median time: 50.295 μs (0.00% GC)
mean time: 50.422 μs (0.00% GC)
maximum time: 90.744 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark f_semiindex!(A, B, C, D, E, F)
BenchmarkTools.Trial:
memory estimate: 224 bytes
allocs estimate: 4
--------------
minimum time: 38.280 μs (0.00% GC)
median time: 40.192 μs (0.00% GC)
mean time: 40.258 μs (0.00% GC)
maximum time: 67.977 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark f_view!(A, B, C, D, E, F)
BenchmarkTools.Trial:
memory estimate: 192 bytes
allocs estimate: 4
--------------
minimum time: 68.416 μs (0.00% GC)
median time: 69.723 μs (0.00% GC)
mean time: 70.716 μs (0.00% GC)
maximum time: 124.193 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1 and julia> @benchmark f_loop!(A, B, C, D, E)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 7.585 μs (0.00% GC)
median time: 7.698 μs (0.00% GC)
mean time: 7.717 μs (0.00% GC)
maximum time: 15.267 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 4
julia> @benchmark f_noindex!(A, B, C, D, E)
BenchmarkTools.Trial:
memory estimate: 176 bytes
allocs estimate: 4
--------------
minimum time: 9.125 μs (0.00% GC)
median time: 9.289 μs (0.00% GC)
mean time: 9.322 μs (0.00% GC)
maximum time: 41.743 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark f_index!(A, B, C, D, E)
BenchmarkTools.Trial:
memory estimate: 704 bytes
allocs estimate: 30
--------------
minimum time: 26.109 μs (0.00% GC)
median time: 26.317 μs (0.00% GC)
mean time: 26.707 μs (0.00% GC)
maximum time: 64.859 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark f_semiindex!(A, B, C, D, E)
BenchmarkTools.Trial:
memory estimate: 176 bytes
allocs estimate: 4
--------------
minimum time: 9.153 μs (0.00% GC)
median time: 9.338 μs (0.00% GC)
mean time: 9.416 μs (0.00% GC)
maximum time: 41.334 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark f_view!(A, B, C, D, E)
BenchmarkTools.Trial:
memory estimate: 144 bytes
allocs estimate: 3
--------------
minimum time: 33.440 μs (0.00% GC)
median time: 33.649 μs (0.00% GC)
mean time: 33.803 μs (0.00% GC)
maximum time: 68.334 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
|
I think: Base.setindex!(A,v,::Val{:..}) = (A = v)
Yeah this seems to be another case of JuliaLang/julia#22255 |
I see what's going on with the |
Defining |
In the end, the copy for elements on the LHS does not matter for the interpolations, since |
I think we need @mbauman's |
I just ran the code again and got an error - I don't know why I didn't see it before. Changing the EllipsisNotation.jl code to @inline Base.getindex(A::AbstractArray, ::Val{:..}) = A
@inline Base.setindex!(A::AbstractArray, v, ::Val{:..}) = setindex!(A, v, :) results in the following timings julia> @benchmark f_loop!(A, B, C, D, E)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 8.284 μs (0.00% GC)
median time: 8.343 μs (0.00% GC)
mean time: 8.350 μs (0.00% GC)
maximum time: 17.075 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 3
julia> @benchmark f_noindex!(A, B, C, D, E)
BenchmarkTools.Trial:
memory estimate: 176 bytes
allocs estimate: 4
--------------
minimum time: 9.212 μs (0.00% GC)
median time: 9.352 μs (0.00% GC)
mean time: 9.364 μs (0.00% GC)
maximum time: 40.757 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark f_index!(A, B, C, D, E)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 17.467 μs (0.00% GC)
median time: 17.517 μs (0.00% GC)
mean time: 17.556 μs (0.00% GC)
maximum time: 29.877 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark f_semiindex!(A, B, C, D, E)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 17.452 μs (0.00% GC)
median time: 17.506 μs (0.00% GC)
mean time: 17.540 μs (0.00% GC)
maximum time: 40.670 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark f_view!(A, B, C, D, E)
BenchmarkTools.Trial:
memory estimate: 144 bytes
allocs estimate: 3
--------------
minimum time: 33.303 μs (0.00% GC)
median time: 33.513 μs (0.00% GC)
mean time: 33.640 μs (0.00% GC)
maximum time: 114.704 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1 Thus, |
The timings for What kind of benchmarking should be used? Does broadcasting really take more than twice the time as the loop? Does broadcasting use SIMD? |
This is good, with
It should automatically use SIMD when it's necessary and Julia is set to
No, @devmotion 's tests from before show that's not the case, but we're probably hitting JuliaLang/julia#22255 here. The indexing might count as operations here, lowering the number that's allowed even more... |
Okay. Then I'll open a PR to EllipsisNotation.jl with |
What should we do with code such as |
Another important question: If |
This case is a little weird because you give a preallocated vector of length
For this reason and because of the broadcast limit, we may want to wait for some compiler changes in Base before changing this. |
So in the end, we shouldn't change the broadcasting and indexing with |
Yeah, that should be able to remove some branches. Then when views are stack-allocated and the broadcast limit is removed we can collapse the others. |
As @devmotion said, Is |
I think something like this has to be done. I'm not sure about the naming though. |
I conducted some benchmarks on functions of the form of our interpolation functions with different number of polynomial coefficients (neglecting the constant amount of time needed for calculation of those coefficients): https://gist.github.com/devmotion/0866639f85ffb305e0485f00ae7f4025 Benchmarking was done on 0.7-rc2 and compared implementations with loops and broadcasts as well as no indexing and indexing with arrays and ellipsis notation (using the master branch of EllipsisNotation). I haven't checked all results in detail but it seems without indexing the broadcast implementation is only a bit slower than the implementation with loops even for 15 coefficients. As expected, results with ellipsis notation are similar. Indexing with views adds a significant overhead of approximately a factor of 2 compared to an implementation with loops, it seems. |
Right now the default for
idxs
isnothing
. Then in all of the interpolation codes, there needs to be a switch betweenA[idxs]
andA
. But if we use EllipsisNotation.jl'sA[..]
, that is the identity so it compiles toA
, which means we can just default toidxs = ..
and get rid of a ton of the interpolation code.The text was updated successfully, but these errors were encountered: