Using output from `--trace-compile` from running a workload can sometimes cache more than running the workload itself #48048

KristofferC · 2022-12-30T18:30:00Z

This is a summarized MWE version of the discussion in Ferrite-FEM/Tensors.jl#190.

The following code creates a small package:

mkpath("Ts/src")
write("Ts/src/Ts.jl", """
module Ts
struct T{dim} 
    data::NTuple{dim, Float64}
end

h(::Type{T{dim}}, f) where {dim} = T{dim}((1.0, 1.0))
@inline f(::Type{T{dim}}) where {dim} = h(T{dim}, () -> 1.0)

end
""")

write("Ts/Project.toml", """
name = "Ts"
uuid = "38a634ad-e948-5137-8d70-aa71f2a747f4"
version = "0.1"

[deps]
SnoopPrecompile = "66db9d55-30c0-4569-8b51-7e840670fc0c"
""")

Pkg.activate("Ts")
Pkg.instantiate()

One way to cache compiled code is by using --trace-compile=stderr and put what comes out into the package file. So we run the following code (the [2][1] stuff is to trick constant propagation):

julia --project --trace-compile=stderr --startup-file=no -e 'using Ts; test() = Ts.f(Ts.T{[2][1]}); test();'
precompile(Tuple{typeof(Main.test)})
precompile(Tuple{Type{Ts.T{2}}, Tuple{Float64, Float64}})
precompile(Tuple{typeof(Ts.h), Type{Ts.T{2}}, Function})

The typeof(Main.test) is not interesting but if we put the two others precompile signatures into the package and rerun the workload we see that the methods in Ts are cached and no longer printed

❯ julia --project --trace-compile=stderr --startup-file=no -e 'using Ts; test() = Ts.f(Ts.T{[2][1]}); test();'
precompile(Tuple{typeof(Main.test)})

An alternativ (better) way is to just run this workload in the package itself. SnoopPrecompile has a macro that makes Julia try really hard to compile the stuff in it. So the two precompile signatures are replaced with:

using SnoopPrecompile
@precompile_all_calls f(T{[2][1]})

where again the [2][1] is to try avoid any constant propagation. Running the workload again:

❯ julia --project --trace-compile=stderr --startup-file=no -e 'using Ts; test() = Ts.f(Ts.T{[2][1]}); test();'
precompile(Tuple{typeof(Main.test)})
precompile(Tuple{typeof(Ts.h), Type{Ts.T{2}}, Function})

we see that in this case, the method Ts.h has failed to cache and gets recompiled (the constructor Ts.T managed to cache though). So putting the precompile signature into the package made it cache, but running the workload that emits the precompile signature made it not cache.

There are a few things we can do to fix the discrepancy:

Remove the @inline on f.
Force specialization of f in h with where {F}.

But ideally, running a workload that outputs certain precompile signatures in --trace-compile should be equally good as putting those precompile signatures explicitly into the package.

The text was updated successfully, but these errors were encountered:

KristofferC · 2022-12-30T19:50:01Z

The example in Ferrite-FEM/Tensors.jl#190 is a little bit more complicated and in this case, specializion on f does not help:

module Ts
struct T{dim} 
    data::NTuple{dim, Float64}
end

@generated function h(::Type{T{dim}}, f::F) where {dim, F}
    [:(f($i)) for i=1:dim]
    :(T((f(0), f(0))))
end
@inline f(el::Number, ::Type{T{dim}}) where {dim} = h(T{dim}, i -> el)

using SnoopPrecompile
@precompile_all_calls f(1.0, T{2})

end

timholy · 2022-12-31T00:37:23Z

This appears to be fixed in https://github.com/timholy/SnoopCompile.jl/tree/teh/snoop_pc (the teh/snoop_pc branch of SnoopCompile). However, the issue in Tensors is not; indeed, it's quite a different set of methods that get re-inferred:

julia> using SnoopCompileCore

julia> using Tensors

julia> using SnoopPrecompile

julia> tinf = @time_precompiled @snoopi_deep begin
       @precompile_all_calls begin
       for dim in (2, 3)
               v = ones(Tensor{1, dim, Float64,})
               σ = one(SymmetricTensor{2, dim, Float64})
               F = one(Tensor{2, dim, Float64,})
               E = one(SymmetricTensor{4, dim, Float64})
               C = one(Tensor{4, dim, Float64,})
               v * 1.0
               v ⋅ v
               v ⊗ v
               σ ⋅ v
               F ⋅ v
               σ ⊗ σ
               F ⊗ F
               σ * 1.0
               F * 1.0
               σ ⊡ σ
               F ⊡ F
               E ⊡ σ
               C ⊡ F
               E * 1.0
               C * 1.0

               # TODO: AD?
           end
       end
       end;

julia> using SnoopCompile

julia> tinf
InferenceTimingNode: 0.013773/0.159047 on Core.Compiler.Timings.ROOT() with 1 direct children

julia> tinf.children[1].children
14-element Vector{SnoopCompileCore.InferenceTimingNode}:
 InferenceTimingNode: 0.000637/0.033348 on *([1.0, 1.0]::Vec{2, Float64}, 1.0::Float64) with 2 direct children
 InferenceTimingNode: 0.006718/0.008715 on Tensors.otimes([1.0, 1.0]::Vec{2, Float64}, [1.0, 1.0]::Vec{2, Float64}) with 4 direct children
 InferenceTimingNode: 0.000077/0.031955 on LinearAlgebra.dot([1.0 0.0; 0.0 1.0]::SymmetricTensor{2, 2, Float64, 3}, [1.0, 1.0]::Vec{2, Float64}) with 1 direct children
 InferenceTimingNode: 0.000865/0.003543 on Tensors.otimes([1.0 0.0; 0.0 1.0]::SymmetricTensor{2, 2, Float64, 3}, [1.0 0.0; 0.0 1.0]::SymmetricTensor{2, 2, Float64, 3}) with 6 direct children
 InferenceTimingNode: 0.001037/0.005002 on Tensors.otimes([1.0 0.0; 0.0 1.0]::Tensor{2, 2, Float64, 4}, [1.0 0.0; 0.0 1.0]::Tensor{2, 2, Float64, 4}) with 7 direct children
 InferenceTimingNode: 0.000514/0.000649 on *([1.0 0.0; 0.0 1.0]::SymmetricTensor{2, 2, Float64, 3}, 1.0::Float64) with 1 direct children
 InferenceTimingNode: 0.000372/0.000497 on *([1.0 0.0; 0.0 1.0]::Tensor{2, 2, Float64, 4}, 1.0::Float64) with 1 direct children
 InferenceTimingNode: 0.000165/0.016104 on Tensors.dcontract([1.0 0.0; 0.0 1.0]::SymmetricTensor{2, 2, Float64, 3}, [1.0 0.0; 0.0 1.0]::SymmetricTensor{2, 2, Float64, 3}) with 1 direct children
 InferenceTimingNode: 0.000068/0.004072 on Tensors.dcontract([1.0 0.0; 0.0 1.0]::Tensor{2, 2, Float64, 4}, [1.0 0.0; 0.0 1.0]::Tensor{2, 2, Float64, 4}) with 1 direct children
 InferenceTimingNode: 0.000303/0.006795 on Tensors.dcontract([1.0 0.0; 0.0 0.0;;; 0.0 0.0; 1.0 0.0;;;; 0.0 1.0; 0.0 0.0;;; 0.0 0.0; 0.0 1.0]::Tensor{4, 2, Float64, 16}, [1.0 0.0; 0.0 1.0]::Tensor{2, 2, Float64, 4}) with 4 direct children
 InferenceTimingNode: 0.000459/0.004824 on *([1.0 0.0; 0.0 0.0;;; 0.0 0.5; 0.5 0.0;;;; 0.0 0.5; 0.5 0.0;;; 0.0 0.0; 0.0 1.0]::SymmetricTensor{4, 2, Float64, 9}, 1.0::Float64) with 2 direct children
 InferenceTimingNode: 0.000560/0.018592 on *([1.0 0.0; 0.0 0.0;;; 0.0 0.0; 1.0 0.0;;;; 0.0 1.0; 0.0 0.0;;; 0.0 0.0; 0.0 1.0]::Tensor{4, 2, Float64, 16}, 1.0::Float64) with 2 direct children
 InferenceTimingNode: 0.000060/0.000081 on iterate((2, 3)::Tuple{Int64, Int64}, 2::Int64) with 1 direct children
 InferenceTimingNode: 0.000170/0.000225 on iterate((2, 3)::Tuple{Int64, Int64}, ::Int64) with 1 direct children

timholy · 2022-12-31T09:59:20Z

I added the following to SnoopPrecompile:

tim@diva:~/.julia/dev/SnoopCompile/SnoopPrecompile$ git diff src/SnoopPrecompile.jl
diff --git a/SnoopPrecompile/src/SnoopPrecompile.jl b/SnoopPrecompile/src/SnoopPrecompile.jl
index d0dcd6b..59f4c41 100644
--- a/SnoopPrecompile/src/SnoopPrecompile.jl
+++ b/SnoopPrecompile/src/SnoopPrecompile.jl
@@ -1,5 +1,7 @@
 module SnoopPrecompile

+using MethodAnalysis
+
 export @precompile_all_calls, @precompile_setup

 const verbose = Ref(false)    # if true, prints all the precompiles
@@ -11,7 +13,7 @@ function precompile_roots(roots)
     for child in roots
         mi = child.mi_info.mi
         precompile(mi.specTypes) # TODO: Julia should allow one to pass `mi` directly (would handle `invoke` properly)
-        verbose[] && println(mi)
+        verbose[] && printstyled(mi, '\n'; color=:cyan)
     end
 end

@@ -74,6 +76,11 @@ macro precompile_all_calls(ex::Expr)
     return esc(quote
         if ccall(:jl_generating_output, Cint, ()) == 1 || $SnoopPrecompile.verbose[]
             $ex
+            if $SnoopPrecompile.verbose[]
+                for mi in $methodinstances_owned_by(@__MODULE__)
+                    printstyled(mi, '\n'; color=:green)
+                end
+            end
         end
     end)
 end

and modified the precompile section of Ts.jl to be

using SnoopPrecompile
SnoopPrecompile.verbose[] = true
@precompile_all_calls f(T{[2][1]})

and got the following output (screenshot to show color):

Furthermore:

tim@diva:/tmp/pkgimage/Ts$ ~/src/juliaw/julia --project -q
julia> using PkgCacheInspector

julia> cf = info_cachefile("Ts");

julia> using MethodAnalysis

julia> methodinstances_owned_by(only(cf.modules))
3-element Vector{Core.MethodInstance}:
 MethodInstance for Ts.f(::Type{Ts.T{2}})
 MethodInstance for Ts.h(::Type{Ts.T{2}}, ::Function)
 MethodInstance for Ts.h(::Type{Ts.T{2}}, ::Ts.var"#1#2")

So it's being cached. Finally:

tim@diva:/tmp/pkgimage/Ts$ ~/src/juliaw/julia --project -q
julia> using Ts

julia> using MethodAnalysis

julia> methodinstances_owned_by(Ts)
3-element Vector{Core.MethodInstance}:
 MethodInstance for Ts.f(::Type{Ts.T{2}})
 MethodInstance for Ts.h(::Type{Ts.T{2}}, ::Function)
 MethodInstance for Ts.h(::Type{Ts.T{2}}, ::Ts.var"#1#2")

So it's also being reloaded.

The issue is the abstract method (not the concrete one) gets re-inferred despite its presence in the cache. CC @aviatesk, xref #47960.

KristofferC · 2022-12-31T11:00:46Z

My follow up question is why it seems to work when the explicit precompile invocation is put into the package? To me, that seems like it indicates that Julia has the ability to cache this properly without re-inferring the method, it is "just" that when using the direct workload á la @precompile_all_calls that it fails. So perhaps the difference in how the method gets cached when precompile is printed vs how it is cached when it is directly executed is interesting to look at?

KristofferC added the compiler:latency Compiler latency label Dec 30, 2022

KristofferC mentioned this issue Dec 30, 2022

add a precompile workload Ferrite-FEM/Tensors.jl#190

Merged

timholy mentioned this issue Dec 31, 2022

Re-inference where there shouldn't be #47960

Closed

timholy mentioned this issue Jan 23, 2023

Precompilation in the SnoopPrecompile/Julia 1.9+ world jump-dev/MathOptInterface.jl#2226

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using output from `--trace-compile` from running a workload can sometimes cache more than running the workload itself #48048

Using output from `--trace-compile` from running a workload can sometimes cache more than running the workload itself #48048

KristofferC commented Dec 30, 2022 •

edited

Loading

KristofferC commented Dec 30, 2022

timholy commented Dec 31, 2022 •

edited

Loading

timholy commented Dec 31, 2022

KristofferC commented Dec 31, 2022

Using output from --trace-compile from running a workload can sometimes cache more than running the workload itself #48048

Using output from --trace-compile from running a workload can sometimes cache more than running the workload itself #48048

Comments

KristofferC commented Dec 30, 2022 • edited Loading

KristofferC commented Dec 30, 2022

timholy commented Dec 31, 2022 • edited Loading

timholy commented Dec 31, 2022

KristofferC commented Dec 31, 2022

Using output from `--trace-compile` from running a workload can sometimes cache more than running the workload itself #48048

Using output from `--trace-compile` from running a workload can sometimes cache more than running the workload itself #48048

KristofferC commented Dec 30, 2022 •

edited

Loading

timholy commented Dec 31, 2022 •

edited

Loading