-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try and force precompilation #8
Conversation
```julia using OrdinaryDiffEq, SnoopCompile function lorenz(du,u,p,t) du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end u0 = [1.0;0.0;0.0] tspan = (0.0,100.0) prob = ODEProblem(lorenz,u0,tspan) alg = Rodas5() tinf = @snoopi_deep solve(prob,alg) Before: InferenceTimingNode: 2.285476/19.503069 on Core.Compiler.Timings.ROOT() with 54 direct children After: InferenceTimingNode: 2.247376/19.289887 on Core.Compiler.Timings.ROOT() with 54 direct children ``` When combo'd with Polyester's to try and force `batch`. Depressing.
Is 0.04 seconds worth it? |
not worth it to merge now. But keeping this up to refer back to. |
It's 0.2 seconds BTW. |
Codecov Report
@@ Coverage Diff @@
## main #8 +/- ##
==========================================
- Coverage 92.36% 92.30% -0.07%
==========================================
Files 1 1
Lines 249 260 +11
==========================================
+ Hits 230 240 +10
- Misses 19 20 +1
Continue to review full report at Codecov.
|
This seems to substantially improve time to-first-div, on this PR: julia> @time using TriangularSolve, LinearAlgebra
[ Info: Precompiling TriangularSolve [d5829a12-d9aa-46ab-831f-fb7c9ab06edf]
24.632061 seconds (8.20 M allocations: 456.805 MiB, 0.58% gc time, 0.09% compilation time)
julia> A = rand(1,1);
julia> B = rand(1, 1);
julia> res = similar(A);
julia> @t TriangularSolve.rdiv!(res, A, UpperTriangular(B))
Time = 0.901624834
1×1 Matrix{Float64}:
0.9805494758920018
julia> @t TriangularSolve.rdiv!(res, A, UnitUpperTriangular(B))
Time = 0.08022014000000001
1×1 Matrix{Float64}:
0.04453566915648233
julia> @t TriangularSolve.ldiv!(res, LowerTriangular(B), A)
Time = 0.195168043
1×1 Matrix{Float64}:
0.9805494758920018
julia> @t TriangularSolve.ldiv!(res, UnitLowerTriangular(B), A)
Time = 0.094892007
1×1 Matrix{Float64}:
0.04453566915648233 Main: julia> @time using TriangularSolve, LinearAlgebra
[ Info: Precompiling TriangularSolve [d5829a12-d9aa-46ab-831f-fb7c9ab06edf]
6.625835 seconds (7.92 M allocations: 430.333 MiB, 1.97% gc time, 0.33% compilation time)
julia> A = rand(1,1);
julia> B = rand(1, 1);
julia> res = similar(A);
julia> @t TriangularSolve.rdiv!(res, A, UpperTriangular(B))
Time = 16.214664716
1×1 Matrix{Float64}:
55.090822966245106
julia> @t TriangularSolve.rdiv!(res, A, UnitUpperTriangular(B))
Time = 0.337600035
1×1 Matrix{Float64}:
0.9878212675272343
julia> @t TriangularSolve.ldiv!(res, LowerTriangular(B), A)
Time = 8.620501925000001
1×1 Matrix{Float64}:
55.090822966245106
julia> @t TriangularSolve.ldiv!(res, UnitLowerTriangular(B), A)
Time = 0.563920552
1×1 Matrix{Float64}:
0.9878212675272343 At least, that's the case on Julia 1.7: julia> versioninfo()
Julia Version 1.7.0-beta4.1
Commit c3f8752251* (2021-08-07 02:51 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz So this seems worth merging. |
The red bars are not precompilable, but the other colors are. The right place to do this precompilation is DiffEqBase; if you click on the bar where my mouse pointer is (EDIT: weird, Ubuntu's screen capture puts it in the wrong spot, it's actually over the lowest green bar and the text is above it), in the REPL you'll see something like this: /tmp/pkgs/packages/DiffEqBase/uQlhE/src/linear_nonlinear.jl:93, MethodInstance for (::DefaultLinSolve)(::Vector{Float64}, ::Any, ::Vector{Float64}, ::Bool) That gives you an indication of what needs to be precompiled. Since the green bar is essentially the full width of the red bar, you'll get the full measure of benefit (the time spent inferring the MethodInstance at the base of the flame is dominated by this particular callee). You could use SnoopCompile's |
I didn't follow that discussion - do you have some hints or references for further information? |
@ranocha, see https://timholy.github.io/SnoopCompile.jl/stable/snoopi_deep_parcel/. Everybody rushes to pick up this hammer, but it's literally the last thing you should do when trying to reduce latency. The right steps are:
I almost actively dislike |
Thank you very much for this detailed and helpful answer, @timholy |
Downstream DiffEq had to end up precompile this call differently, but most users will get the benefit from this PR. |
This PR caused a performance regression for me. julia> @benchmark TriangularSolve.rdiv!($C, $A, UpperTriangular($B), Val(false)) # false means single threaded
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 18.276 μs … 1.964 ms ┊ GC (min … max): 0.00% … 97.10%
Time (median): 21.517 μs ┊ GC (median): 0.00%
Time (mean ± σ): 22.028 μs ± 19.593 μs ┊ GC (mean ± σ): 0.87% ± 0.97%
█▁▁▆▃ ▂ ▁
▃▃▃▃▃▃▃▃▃▇▄█████▇████▇▆▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
18.3 μs Histogram: frequency by time 30.7 μs <
Memory estimate: 6.39 KiB, allocs estimate: 211. Without: julia> @benchmark TriangularSolve.rdiv!($C, $A, UpperTriangular($B), Val(false)) # false means single threaded
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 15.746 μs … 44.410 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 18.713 μs ┊ GC (median): 0.00%
Time (mean ± σ): 18.684 μs ± 1.184 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▁ ▁▂ ▂▅▃ ▅█▆ ▁▂ ▁ ▁
██▃▁▆█▆▃▁▁▁▄▁▃▄▄▄▁▁██▇▃▃▁▃▁▅▃▅███▅▃▄████▅▅▆██▇▅▅▆███▇▇▆▆▆█▇ █
15.7 μs Histogram: log(frequency) by time 20.4 μs <
Memory estimate: 0 bytes, allocs estimate: 0. Notice the memory allocations in particular. =/ julia> versioninfo()
Julia Version 1.8.0-DEV.438
Commit 88a6376e99* (2021-08-28 11:03 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake) Could there be an inference failure for some reason when it precompiles? This problem is annoying to debug, because as soon as the code is |
Wat. |
Hmm, okay, I'll try playing around with type signatures. I noticed that adding a few |
The weird thing is that the functions it helps to inline are I take this to mean that the function calling them isn't specializing? |
Try and force precompilation with the solve and Polyester.batch
Depressing.