Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try and force precompilation #8

Merged
merged 2 commits into from
Aug 12, 2021

Conversation

ChrisRackauckas
Copy link
Member

Try and force precompilation with the solve and Polyester.batch

using OrdinaryDiffEq, SnoopCompile

function lorenz(du,u,p,t)
 du[1] = 10.0(u[2]-u[1])
 du[2] = u[1]*(28.0-u[3]) - u[2]
 du[3] = u[1]*u[2] - (8/3)*u[3]
end

u0 = [1.0;0.0;0.0]
tspan = (0.0,100.0)
prob = ODEProblem(lorenz,u0,tspan)
alg = Rodas5()
tinf = @snoopi_deep solve(prob,alg)

Before:
InferenceTimingNode: 2.285476/19.503069 on Core.Compiler.Timings.ROOT() with 54 direct children

After:
InferenceTimingNode: 2.247376/19.289887 on Core.Compiler.Timings.ROOT() with 54 direct children

Depressing.

```julia
using OrdinaryDiffEq, SnoopCompile

function lorenz(du,u,p,t)
 du[1] = 10.0(u[2]-u[1])
 du[2] = u[1]*(28.0-u[3]) - u[2]
 du[3] = u[1]*u[2] - (8/3)*u[3]
end

u0 = [1.0;0.0;0.0]
tspan = (0.0,100.0)
prob = ODEProblem(lorenz,u0,tspan)
alg = Rodas5()
tinf = @snoopi_deep solve(prob,alg)

Before:
InferenceTimingNode: 2.285476/19.503069 on Core.Compiler.Timings.ROOT() with 54 direct children

After:
InferenceTimingNode: 2.247376/19.289887 on Core.Compiler.Timings.ROOT() with 54 direct children
```

When combo'd with Polyester's to try and force `batch`. Depressing.
@chriselrod
Copy link
Member

Is 0.04 seconds worth it?
How much longer does precipitation take?

@ChrisRackauckas
Copy link
Member Author

not worth it to merge now. But keeping this up to refer back to.

@ChrisRackauckas
Copy link
Member Author

ChrisRackauckas commented Aug 7, 2021

It's 0.2 seconds BTW.

@codecov
Copy link

codecov bot commented Aug 7, 2021

Codecov Report

Merging #8 (e858a46) into main (8ffe7da) will decrease coverage by 0.06%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main       #8      +/-   ##
==========================================
- Coverage   92.36%   92.30%   -0.07%     
==========================================
  Files           1        1              
  Lines         249      260      +11     
==========================================
+ Hits          230      240      +10     
- Misses         19       20       +1     
Impacted Files Coverage Δ
src/TriangularSolve.jl 92.30% <100.00%> (-0.07%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8ffe7da...e858a46. Read the comment docs.

@chriselrod
Copy link
Member

chriselrod commented Aug 9, 2021

not worth it to merge now. But keeping this up to refer back to.

This seems to substantially improve time to-first-div, on this PR:

julia> @time using TriangularSolve, LinearAlgebra
[ Info: Precompiling TriangularSolve [d5829a12-d9aa-46ab-831f-fb7c9ab06edf]
 24.632061 seconds (8.20 M allocations: 456.805 MiB, 0.58% gc time, 0.09% compilation time)

julia> A = rand(1,1);

julia> B = rand(1, 1);

julia> res = similar(A);

julia> @t TriangularSolve.rdiv!(res, A, UpperTriangular(B))
Time = 0.901624834
1×1 Matrix{Float64}:
 0.9805494758920018

julia> @t TriangularSolve.rdiv!(res, A, UnitUpperTriangular(B))
Time = 0.08022014000000001
1×1 Matrix{Float64}:
 0.04453566915648233

julia> @t TriangularSolve.ldiv!(res, LowerTriangular(B), A)
Time = 0.195168043
1×1 Matrix{Float64}:
 0.9805494758920018

julia> @t TriangularSolve.ldiv!(res, UnitLowerTriangular(B), A)
Time = 0.094892007
1×1 Matrix{Float64}:
 0.04453566915648233

Main:

julia> @time using TriangularSolve, LinearAlgebra
[ Info: Precompiling TriangularSolve [d5829a12-d9aa-46ab-831f-fb7c9ab06edf]
  6.625835 seconds (7.92 M allocations: 430.333 MiB, 1.97% gc time, 0.33% compilation time)

julia> A = rand(1,1);

julia> B = rand(1, 1);

julia> res = similar(A);

julia> @t TriangularSolve.rdiv!(res, A, UpperTriangular(B))
Time = 16.214664716
1×1 Matrix{Float64}:
 55.090822966245106

julia> @t TriangularSolve.rdiv!(res, A, UnitUpperTriangular(B))
Time = 0.337600035
1×1 Matrix{Float64}:
 0.9878212675272343

julia> @t TriangularSolve.ldiv!(res, LowerTriangular(B), A)
Time = 8.620501925000001
1×1 Matrix{Float64}:
 55.090822966245106

julia> @t TriangularSolve.ldiv!(res, UnitLowerTriangular(B), A)
Time = 0.563920552
1×1 Matrix{Float64}:
 0.9878212675272343

At least, that's the case on Julia 1.7:

julia> versioninfo()
Julia Version 1.7.0-beta4.1
Commit c3f8752251* (2021-08-07 02:51 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz

So this seems worth merging.

@timholy
Copy link
Contributor

timholy commented Aug 9, 2021

image

The red bars are not precompilable, but the other colors are. The right place to do this precompilation is DiffEqBase; if you click on the bar where my mouse pointer is (EDIT: weird, Ubuntu's screen capture puts it in the wrong spot, it's actually over the lowest green bar and the text is above it), in the REPL you'll see something like this:

/tmp/pkgs/packages/DiffEqBase/uQlhE/src/linear_nonlinear.jl:93, MethodInstance for (::DefaultLinSolve)(::Vector{Float64}, ::Any, ::Vector{Float64}, ::Bool)

That gives you an indication of what needs to be precompiled. Since the green bar is essentially the full width of the red bar, you'll get the full measure of benefit (the time spent inferring the MethodInstance at the base of the flame is dominated by this particular callee).

You could use SnoopCompile's parcel and write to generate these, but as we discussed previously these days I strongly recommend just running some code while the package is building, unless you can't avoid the undesirable side effects. Note that if some of your users are using Vector{Float32} you might consider runing that workload too. (It will increase package build time for everyone, unfortunately, but hopefully that's a relatively rare cost.)

@ranocha
Copy link
Member

ranocha commented Aug 9, 2021

You could use SnoopCompile's parcel and write to generate these, but as we discussed previously these days I strongly recommend just running some code while the package is building, unless you can't avoid the undesirable side effects.

I didn't follow that discussion - do you have some hints or references for further information?

@timholy
Copy link
Contributor

timholy commented Aug 9, 2021

@ranocha, see https://timholy.github.io/SnoopCompile.jl/stable/snoopi_deep_parcel/. Everybody rushes to pick up this hammer, but it's literally the last thing you should do when trying to reduce latency. The right steps are:

  1. check that you're not getting a lot of invalidations (fix them if you are)
  2. look at the inference profile flamegraph(like the image above) and see if anything pops out (in this case, it does, but that's relatively rare and usually means that the package is in great shape except possibly for missing a bit of precompilation). If you only have a few flames that dominate most of the time, like this case, you can jump immediately to step 5.
  3. use pgdsgui to determine whether you are over-specializing, and adjust as needed
  4. solve inference problems (might need to look at the previous step again a bit)
  5. see if you can precompile by doing "regular work" rather than issuing precompile directives (looking at the flamegraph again here is very helpful; after you fix inference problems, see the previous step, hopefully you don't have very many "flames" to check just like in this case). See for example attempt to precompile linsolve SciML/DiffEqBase.jl#698 which "does work" rather than declaring precompile(...).
  6. as a last resort, use parcel and write

I almost actively dislike parcel and write as they often hide problems that are better addressed by other means, and often gives you very little advantage compared to a more thorough analysis (e.g., JuliaIO/JLD2.jl#344). As both SnoopCompile and JET have discovered, snooping on inference is a terrific way to discover things that many packages should probably change anyway.

@ranocha
Copy link
Member

ranocha commented Aug 9, 2021

Thank you very much for this detailed and helpful answer, @timholy

@ChrisRackauckas
Copy link
Member Author

Downstream DiffEq had to end up precompile this call differently, but most users will get the benefit from this PR.

@chriselrod
Copy link
Member

chriselrod commented Aug 29, 2021

This PR caused a performance regression for me.
With precompilation:

julia> @benchmark TriangularSolve.rdiv!($C, $A, UpperTriangular($B), Val(false)) # false means single threaded
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  18.276 μs   1.964 ms  ┊ GC (min  max): 0.00%  97.10%
 Time  (median):     21.517 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.028 μs ± 19.593 μs  ┊ GC (mean ± σ):  0.87% ±  0.97%

             █▁▁▆▃ ▂ ▁
  ▃▃▃▃▃▃▃▃▃▇▄█████▇████▇▆▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  18.3 μs         Histogram: frequency by time        30.7 μs <

 Memory estimate: 6.39 KiB, allocs estimate: 211.

Without:

julia> @benchmark TriangularSolve.rdiv!($C, $A, UpperTriangular($B), Val(false)) # false means single threaded
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  15.746 μs  44.410 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     18.713 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.684 μs ±  1.184 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▁                 ▁▂         ▂▅▃   ▅█▆    ▁▂     ▁         ▁
  ██▃▁▆█▆▃▁▁▁▄▁▃▄▄▄▁▁██▇▃▃▁▃▁▅▃▅███▅▃▄████▅▅▆██▇▅▅▆███▇▇▆▆▆█▇ █
  15.7 μs      Histogram: log(frequency) by time      20.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Notice the memory allocations in particular. =/
This is on:

julia> versioninfo()
Julia Version 1.8.0-DEV.438
Commit 88a6376e99* (2021-08-28 11:03 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake)

Could there be an inference failure for some reason when it precompiles?

This problem is annoying to debug, because as soon as the code is Revised, the allocations disappear and the performance improves.

@ChrisRackauckas ChrisRackauckas deleted the precompile branch August 29, 2021 01:24
@ChrisRackauckas
Copy link
Member Author

Wat.

@ChrisRackauckas
Copy link
Member Author

@chriselrod
Copy link
Member

SciML/OrdinaryDiffEq.jl#1473

Hmm, okay, I'll try playing around with type signatures.

I noticed that adding a few @inlines could reduce the number of allocations to 3-6.

@chriselrod
Copy link
Member

The weird thing is that the functions it helps to inline are @generated, which I thought are always supposed to fully specialize.

I take this to mean that the function calling them isn't specializing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants