Try and force precompilation #8

ChrisRackauckas · 2021-08-07T10:31:48Z

Try and force precompilation with the solve and Polyester.batch

using OrdinaryDiffEq, SnoopCompile

function lorenz(du,u,p,t)
 du[1] = 10.0(u[2]-u[1])
 du[2] = u[1]*(28.0-u[3]) - u[2]
 du[3] = u[1]*u[2] - (8/3)*u[3]
end

u0 = [1.0;0.0;0.0]
tspan = (0.0,100.0)
prob = ODEProblem(lorenz,u0,tspan)
alg = Rodas5()
tinf = @snoopi_deep solve(prob,alg)

Before:
InferenceTimingNode: 2.285476/19.503069 on Core.Compiler.Timings.ROOT() with 54 direct children

After:
InferenceTimingNode: 2.247376/19.289887 on Core.Compiler.Timings.ROOT() with 54 direct children

Depressing.

```julia using OrdinaryDiffEq, SnoopCompile function lorenz(du,u,p,t) du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end u0 = [1.0;0.0;0.0] tspan = (0.0,100.0) prob = ODEProblem(lorenz,u0,tspan) alg = Rodas5() tinf = @snoopi_deep solve(prob,alg) Before: InferenceTimingNode: 2.285476/19.503069 on Core.Compiler.Timings.ROOT() with 54 direct children After: InferenceTimingNode: 2.247376/19.289887 on Core.Compiler.Timings.ROOT() with 54 direct children ``` When combo'd with Polyester's to try and force `batch`. Depressing.

chriselrod · 2021-08-07T15:22:43Z

Is 0.04 seconds worth it?
How much longer does precipitation take?

ChrisRackauckas · 2021-08-07T15:23:43Z

not worth it to merge now. But keeping this up to refer back to.

ChrisRackauckas · 2021-08-07T15:23:56Z

It's 0.2 seconds BTW.

codecov · 2021-08-07T15:28:30Z

Codecov Report

Merging #8 (e858a46) into main (8ffe7da) will decrease coverage by 0.06%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main       #8      +/-   ##
==========================================
- Coverage   92.36%   92.30%   -0.07%     
==========================================
  Files           1        1              
  Lines         249      260      +11     
==========================================
+ Hits          230      240      +10     
- Misses         19       20       +1

Impacted Files	Coverage Δ
src/TriangularSolve.jl	`92.30% <100.00%> (-0.07%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8ffe7da...e858a46. Read the comment docs.

chriselrod · 2021-08-09T02:32:23Z

not worth it to merge now. But keeping this up to refer back to.

This seems to substantially improve time to-first-div, on this PR:

julia> @time using TriangularSolve, LinearAlgebra
[ Info: Precompiling TriangularSolve [d5829a12-d9aa-46ab-831f-fb7c9ab06edf]
 24.632061 seconds (8.20 M allocations: 456.805 MiB, 0.58% gc time, 0.09% compilation time)

julia> A = rand(1,1);

julia> B = rand(1, 1);

julia> res = similar(A);

julia> @t TriangularSolve.rdiv!(res, A, UpperTriangular(B))
Time = 0.901624834
1×1 Matrix{Float64}:
 0.9805494758920018

julia> @t TriangularSolve.rdiv!(res, A, UnitUpperTriangular(B))
Time = 0.08022014000000001
1×1 Matrix{Float64}:
 0.04453566915648233

julia> @t TriangularSolve.ldiv!(res, LowerTriangular(B), A)
Time = 0.195168043
1×1 Matrix{Float64}:
 0.9805494758920018

julia> @t TriangularSolve.ldiv!(res, UnitLowerTriangular(B), A)
Time = 0.094892007
1×1 Matrix{Float64}:
 0.04453566915648233

Main:

julia> @time using TriangularSolve, LinearAlgebra
[ Info: Precompiling TriangularSolve [d5829a12-d9aa-46ab-831f-fb7c9ab06edf]
  6.625835 seconds (7.92 M allocations: 430.333 MiB, 1.97% gc time, 0.33% compilation time)

julia> A = rand(1,1);

julia> B = rand(1, 1);

julia> res = similar(A);

julia> @t TriangularSolve.rdiv!(res, A, UpperTriangular(B))
Time = 16.214664716
1×1 Matrix{Float64}:
 55.090822966245106

julia> @t TriangularSolve.rdiv!(res, A, UnitUpperTriangular(B))
Time = 0.337600035
1×1 Matrix{Float64}:
 0.9878212675272343

julia> @t TriangularSolve.ldiv!(res, LowerTriangular(B), A)
Time = 8.620501925000001
1×1 Matrix{Float64}:
 55.090822966245106

julia> @t TriangularSolve.ldiv!(res, UnitLowerTriangular(B), A)
Time = 0.563920552
1×1 Matrix{Float64}:
 0.9878212675272343

At least, that's the case on Julia 1.7:

julia> versioninfo()
Julia Version 1.7.0-beta4.1
Commit c3f8752251* (2021-08-07 02:51 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz

So this seems worth merging.

timholy · 2021-08-09T09:35:11Z

The red bars are not precompilable, but the other colors are. The right place to do this precompilation is DiffEqBase; if you click on the bar where my mouse pointer is (EDIT: weird, Ubuntu's screen capture puts it in the wrong spot, it's actually over the lowest green bar and the text is above it), in the REPL you'll see something like this:

/tmp/pkgs/packages/DiffEqBase/uQlhE/src/linear_nonlinear.jl:93, MethodInstance for (::DefaultLinSolve)(::Vector{Float64}, ::Any, ::Vector{Float64}, ::Bool)

That gives you an indication of what needs to be precompiled. Since the green bar is essentially the full width of the red bar, you'll get the full measure of benefit (the time spent inferring the MethodInstance at the base of the flame is dominated by this particular callee).

You could use SnoopCompile's parcel and write to generate these, but as we discussed previously these days I strongly recommend just running some code while the package is building, unless you can't avoid the undesirable side effects. Note that if some of your users are using Vector{Float32} you might consider runing that workload too. (It will increase package build time for everyone, unfortunately, but hopefully that's a relatively rare cost.)

ranocha · 2021-08-09T11:32:58Z

You could use SnoopCompile's parcel and write to generate these, but as we discussed previously these days I strongly recommend just running some code while the package is building, unless you can't avoid the undesirable side effects.

I didn't follow that discussion - do you have some hints or references for further information?

timholy · 2021-08-09T17:18:20Z

@ranocha, see https://timholy.github.io/SnoopCompile.jl/stable/snoopi_deep_parcel/. Everybody rushes to pick up this hammer, but it's literally the last thing you should do when trying to reduce latency. The right steps are:

check that you're not getting a lot of invalidations (fix them if you are)
look at the inference profile flamegraph(like the image above) and see if anything pops out (in this case, it does, but that's relatively rare and usually means that the package is in great shape except possibly for missing a bit of precompilation). If you only have a few flames that dominate most of the time, like this case, you can jump immediately to step 5.
use pgdsgui to determine whether you are over-specializing, and adjust as needed
solve inference problems (might need to look at the previous step again a bit)
see if you can precompile by doing "regular work" rather than issuing precompile directives (looking at the flamegraph again here is very helpful; after you fix inference problems, see the previous step, hopefully you don't have very many "flames" to check just like in this case). See for example attempt to precompile linsolve SciML/DiffEqBase.jl#698 which "does work" rather than declaring precompile(...).
as a last resort, use parcel and write

I almost actively dislike parcel and write as they often hide problems that are better addressed by other means, and often gives you very little advantage compared to a more thorough analysis (e.g., JuliaIO/JLD2.jl#344). As both SnoopCompile and JET have discovered, snooping on inference is a terrific way to discover things that many packages should probably change anyway.

ranocha · 2021-08-09T18:03:50Z

Thank you very much for this detailed and helpful answer, @timholy

ChrisRackauckas · 2021-08-12T10:54:52Z

Downstream DiffEq had to end up precompile this call differently, but most users will get the benefit from this PR.

chriselrod · 2021-08-29T01:08:06Z

This PR caused a performance regression for me.
With precompilation:

julia> @benchmark TriangularSolve.rdiv!($C, $A, UpperTriangular($B), Val(false)) # false means single threaded
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  18.276 μs …  1.964 ms  ┊ GC (min … max): 0.00% … 97.10%
 Time  (median):     21.517 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.028 μs ± 19.593 μs  ┊ GC (mean ± σ):  0.87% ±  0.97%

             █▁▁▆▃ ▂ ▁
  ▃▃▃▃▃▃▃▃▃▇▄█████▇████▇▆▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  18.3 μs         Histogram: frequency by time        30.7 μs <

 Memory estimate: 6.39 KiB, allocs estimate: 211.

Without:

julia> @benchmark TriangularSolve.rdiv!($C, $A, UpperTriangular($B), Val(false)) # false means single threaded
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  15.746 μs … 44.410 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     18.713 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.684 μs ±  1.184 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▁                 ▁▂         ▂▅▃   ▅█▆    ▁▂     ▁         ▁
  ██▃▁▆█▆▃▁▁▁▄▁▃▄▄▄▁▁██▇▃▃▁▃▁▅▃▅███▅▃▄████▅▅▆██▇▅▅▆███▇▇▆▆▆█▇ █
  15.7 μs      Histogram: log(frequency) by time      20.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Notice the memory allocations in particular. =/
This is on:

julia> versioninfo()
Julia Version 1.8.0-DEV.438
Commit 88a6376e99* (2021-08-28 11:03 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake)

Could there be an inference failure for some reason when it precompiles?

This problem is annoying to debug, because as soon as the code is Revised, the allocations disappear and the performance improves.

ChrisRackauckas · 2021-08-29T01:25:03Z

Wat.

ChrisRackauckas · 2021-08-29T01:27:36Z

SciML/OrdinaryDiffEq.jl#1473

chriselrod · 2021-08-29T01:39:10Z

SciML/OrdinaryDiffEq.jl#1473

Hmm, okay, I'll try playing around with type signatures.

I noticed that adding a few @inlines could reduce the number of allocations to 3-6.

chriselrod · 2021-08-29T01:45:07Z

The weird thing is that the functions it helps to inline are @generated, which I thought are always supposed to fully specialize.

I take this to mean that the function calling them isn't specializing?

ChrisRackauckas mentioned this pull request Aug 8, 2021

Reduce the first time to solve from 5 seconds to 1 second for Tsit5 SciML/OrdinaryDiffEq.jl#1465

Merged

safer precompile

e858a46

chriselrod approved these changes Aug 9, 2021

View reviewed changes

thchr mentioned this pull request Aug 9, 2021

RFC: Improve some type definitions and decrease specialization JuliaIO/JLD2.jl#344

Closed

chriselrod merged commit 08ca19b into JuliaSIMD:main Aug 12, 2021

ChrisRackauckas mentioned this pull request Aug 12, 2021

22 seconds to 3 and now more: Let's fix all of the DifferentialEquations.jl + universe compile times! SciML/DifferentialEquations.jl#786

Closed

5 tasks

chriselrod mentioned this pull request Aug 13, 2021

Delete convert(::Type{Any}, ::UpperBoundedInteger) JuliaSIMD/LoopVectorization.jl#316

Merged

ChrisRackauckas deleted the precompile branch August 29, 2021 01:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try and force precompilation #8

Try and force precompilation #8

ChrisRackauckas commented Aug 7, 2021

chriselrod commented Aug 7, 2021

ChrisRackauckas commented Aug 7, 2021

ChrisRackauckas commented Aug 7, 2021 •

edited

Loading

codecov bot commented Aug 7, 2021 •

edited

Loading

chriselrod commented Aug 9, 2021 •

edited

Loading

timholy commented Aug 9, 2021 •

edited

Loading

ranocha commented Aug 9, 2021

timholy commented Aug 9, 2021 •

edited

Loading

ranocha commented Aug 9, 2021

ChrisRackauckas commented Aug 12, 2021

chriselrod commented Aug 29, 2021 •

edited

Loading

ChrisRackauckas commented Aug 29, 2021

ChrisRackauckas commented Aug 29, 2021

chriselrod commented Aug 29, 2021

chriselrod commented Aug 29, 2021

Try and force precompilation #8

Try and force precompilation #8

Conversation

ChrisRackauckas commented Aug 7, 2021

chriselrod commented Aug 7, 2021

ChrisRackauckas commented Aug 7, 2021

ChrisRackauckas commented Aug 7, 2021 • edited Loading

codecov bot commented Aug 7, 2021 • edited Loading

Codecov Report

chriselrod commented Aug 9, 2021 • edited Loading

timholy commented Aug 9, 2021 • edited Loading

ranocha commented Aug 9, 2021

timholy commented Aug 9, 2021 • edited Loading

ranocha commented Aug 9, 2021

ChrisRackauckas commented Aug 12, 2021

chriselrod commented Aug 29, 2021 • edited Loading

ChrisRackauckas commented Aug 29, 2021

ChrisRackauckas commented Aug 29, 2021

chriselrod commented Aug 29, 2021

chriselrod commented Aug 29, 2021

ChrisRackauckas commented Aug 7, 2021 •

edited

Loading

codecov bot commented Aug 7, 2021 •

edited

Loading

chriselrod commented Aug 9, 2021 •

edited

Loading

timholy commented Aug 9, 2021 •

edited

Loading

timholy commented Aug 9, 2021 •

edited

Loading

chriselrod commented Aug 29, 2021 •

edited

Loading