-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allocation profiler #42768
Allocation profiler #42768
Conversation
Not saying this needs to be done for this PR, but it would be nice if there were an option to output profile results directly as in-memory objects instead of serializing to JSON. I'd like to use this for Dagger.jl to track allocations in real-time, but having to emit and parse JSON (even just within memory) is going to add overhead that feels slightly unnecessary. |
Because most allocations do not go through |
Thank you @maleadt! It did feel like I was getting away with this a bit too easily 😅. I now see that generated code calls
Now I understand why the LLVM modifications in the other PR are necessary! Perhaps rebasing that PR is the right approach after all… 🤔 I tried before, and the amount of conflicts scared me away a bit, haha. |
Still not sure how best to capture the
Profile.@profile_allocs foo()
fetch()
# pass to flamegraphs, pprof, etc which would be defined as something like macro profile_allocs(ex)
return quote
try
enable_alloc_profiling()
if status < 0
error(error_codes[status])
end
$(esc(ex))
finally
disable_alloc_profiling()
end
end
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to @vilterp's comment above, I'd like to take a different approach to this PR than we took in the previous attempts, in order to avoid getting stuck as those PRs have been.
I think an effective approach to this would be to merge the allocations profiling functionality in stages. My impression is that the previous attempts at a memory profiler have gotten stuck in trying to work out the best way to capture the allocations coming from codegen (e.g. gc_pool_alloc
). This PR currently does not capture those allocations, but it does still report a meaningfully large proportion of the allocations. From our experiments (measuring against @time
) it seems to capture roughly 80% of the allocations in our heavy benchmarks, so these seems still quite useful as-is.
I'd like us to wrap up the existing functionality (profiling all allocations that go through jl_gc_alloc()
), and get the PR into a good state, and then review it and merge it, even though it doesn't capture all allocations.
We can file an issue describing the fact that the memory profiler doesn't capture all allocations, with some suggested work plans, and then follow-up with more PRs later. I think that if we add some clear documentation, and maybe even print a warning to the console that "This feature is experimental, and currently does not capture all allocations. The reported allocations are accurate, but maybe be missing a large proportion of the total allocations.", we could merge this and still provide some value, without letting this PR bit rot! :)
Does that sound right to others? @vilterp and I will work this week to wrap up the existing functionality and get things tested and cleaned up. :)
Sometimes get
with this strategy, even though I think I'm doing the same thing as Line 88 in cba3abf
Sometimes it's a different type, like |
EDIT: yeah, things are getting GC'd out from under us — if I do |
c857d4c
to
f8a7720
Compare
Weird thing is, it works sometimes, even when full GCs happen during the collection and decode phases:
|
I haven't looked at the code, but it seems like you'd want to record the data at the moment of allocation, in which case
doesn't seem like it would be a concern. But again, I haven't looked at this due to time constraints.. |
Thanks Tim. We are recording backtraces at the point of allocation. When the backtraces are solely composed of instruction pointers, everything seems to work fine. But when they also include pointers to objects on the heap, it seems those objects are sometimes garbage collected before we decode the stack traces. We think these are objects associated with the interpretation (as opposed to JIT) mode. Stack frames for interpreted functions leave pointers in the stack trace to objects like |
👍 You should be able to serialize those in some more permanent form at the time of collection, possibly under a |
Yeah, it does seem like we either need to either:
2 seems nice, but we were hoping to get away with keeping most of the decoding logic in Julia at the end, so we can use functions in |
Ah, bummer! Makes sense though! Thanks @timholy for chiming in. That does make sense now that you explain it, @vilterp and @timholy. thanks! Yeah, i'd be probably fine with either approach. For You could maybe address that by pre-allocating a global Julia vector, @vilterp, and then stuffing the things you need to track into those? That way there's still no allocations? The only weirdness is that Doing option |
Update, since calling Line 109 in a1e9fd3
Here's where the CPU profiler calls it: julia/stdlib/Profile/src/Profile.jl Line 355 in 5ffee12
If it's good enough for the CPU profiler, it's good enough for us! Also, since we'll no longer need to keep the heap-allocated interpreter frame objects rooted to avoid them being GC'd before we decode the stack trace buffer, we can move the buffer back from Julia arrays into C++, which allows the allocation profiler to not allocate anymore. This seems like a sensible thing. We'll still decode the buffers in a Julia stdlib module and return Julia structs, which can then be used by downstream libraries like PProf, or e.g. a simple JSON serializer. Filed #43796 for a more comprehensive fix for this. |
439f06b
to
1474097
Compare
Split off GC logging in to #43511 |
Support for visualizing the results from the allocations profiler in draft PR: JuliaLang/julia#42768. This was basically copy/pasted from https://github.com/vilterp/AllocProfileParser.jl.
Support for visualizing the results from the allocations profiler in draft PR: JuliaLang/julia#42768. This was basically copy/pasted from https://github.com/vilterp/AllocProfileParser.jl.
Support for visualizing the results from the allocations profiler in draft PR: JuliaLang/julia#42768. This was basically copy/pasted from https://github.com/vilterp/AllocProfileParser.jl.
Error:
|
If it's going to take some time to fix |
This reverts commit ef23d6d.
I'm not sure how accessing the current task's TID is considered a safepoint? I think it might need a |
|
We follow a "guilty until declared innocent" approach here :). |
You generally want jl_current_task->tid anyways |
Ah, thanks, it was the extra failure in the buildkite job. Sorry I didn't do a good enough job comparing that with the existing failures. Thanks for catching this, @Keno! 🙏 |
k, the fix has passed |
…er (#46) * Add support for the Allocs profiles produced by Julia's Allocs Profiler Support for visualizing the results from the allocations profiler in draft PR: JuliaLang/julia#42768. This was basically copy/pasted from https://github.com/vilterp/AllocProfileParser.jl. * Rename to PProf.Allocs.pprof() * Fix minimum julia version based on merge-date * Add tests for PProf.Allocs.pprof(); fix typo Co-authored-by: Pete Vilter <[email protected]> Co-authored-by: Pete Vilter <[email protected]> Co-authored-by: Valentin Churavy <[email protected]>
|
||
jl_raw_backtrace_t get_raw_backtrace() { | ||
// A single large buffer to record backtraces onto | ||
static jl_bt_element_t static_bt_data[JL_MAX_BT_SIZE]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not thread safe. You should be (re)using the existing buffer in the ptls to be safe here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops. Yeah, we need a buffer per thread…
Did you mean to (re)use this buffer? https://github.com/vilterp/julia/blob/c12aca890a8ae387d11db0b54351f8b61305c00b/src/julia_threads.h#L247
It seems like we should add our own (here or as a global) to avoid colliding with that… We've shied away from adding stuff to ptls to avoid breaking stuff or adding overhead, but maybe it's better than globals?
Funny our multithreaded test didn't catch this; maybe the stack traces were just garbage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as you avoid re-using this buffer in the interval between an error filling this buffer, and then allocating the exception stack, I think you should be okay here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes me a little nervous to reuse the buffer for two things, but it is space-efficient! I guess it should be fine like you say…
filed #44099 to track this; we'll write up a PR soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The backtrace would be essentially the same, just a bit longer perhaps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @vtjnash for pointing this out! 😅 haha it feels pretty silly that we missed that. Thanks for the help!!
The suggestion to use a buffer in ptls makes sense. I agree with @vilterp that sharing that buffer makes me slightly nervous, but reading through your description that sounds totally fine 👍 cool, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the CPU profiler use that buffer at all? If someone is running CPU profiling and Allocs profiling at the same time, is there a chance it'll interrupt our thread while we were in the middle of using the buffer and we'll get any problems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Profiling using a different buffer entirely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more question: is it okay to write over this buffer, since it seems like it's used to scan for roots?:
Lines 344 to 345 in 7ccc83e
// storing bt_size in ptls ensures roots in bt_data will be found | |
ptls->bt_size = rec_backtrace(ptls->bt_data, JL_MAX_BT_SIZE, skip + 1); |
Lines 618 to 623 in 7ccc83e
// The temporary ptls->bt_data is rooted by special purpose code in the | |
// GC. This exists only for the purpose of preserving bt_data until we | |
// set ptls->bt_size=0 below. | |
jl_push_excstack(&ct->excstack, exception, | |
ptls->bt_data, ptls->bt_size); | |
ptls->bt_size = 0; |
I think that it's fine, from what i can read, but just want to double check one more time.
Also, i've opened a PR for this, here: #44114
## Overview Record the type and stack of every allocation (or only at a given sample interval), and return as Julia objects. Alternate approach to existing alloc profiler PR: JuliaLang#33467 Complementary to garbage profiler PR: JuliaLang#42658 (maybe there's some nice way to meld them) This may be reinventing the wheel from JuliaLang#33467, but I'm not sure why that one needs stuff like LLVM passes. I mimicked some stuff from it, but this was my attempt to get something up and running. Could easily be missing stuff. ## Usage: ```julia using Profile.Allocs res = Allocs.@Profile sample_rate=0.001 my_func() prof = Allocs.fetch() # do something with `prof` ``` See also: JuliaPerf/PProf.jl#46 for support for visualizing these. Co-authored-by: Nathan Daly <[email protected]>
## Overview Record the type and stack of every allocation (or only at a given sample interval), and return as Julia objects. Alternate approach to existing alloc profiler PR: JuliaLang#33467 Complementary to garbage profiler PR: JuliaLang#42658 (maybe there's some nice way to meld them) This may be reinventing the wheel from JuliaLang#33467, but I'm not sure why that one needs stuff like LLVM passes. I mimicked some stuff from it, but this was my attempt to get something up and running. Could easily be missing stuff. ## Usage: ```julia using Profile.Allocs res = Allocs.@Profile sample_rate=0.001 my_func() prof = Allocs.fetch() # do something with `prof` ``` See also: JuliaPerf/PProf.jl#46 for support for visualizing these. Co-authored-by: Nathan Daly <[email protected]>
…ile.print Defines a converter for Allocs to the types implemented by Profile for printing, allowing flamegraph report generation of allocations with just the stdlib tooling. Refs: #42768
…ile.print Defines a converter for Allocs to the types implemented by Profile for printing, allowing flamegraph report generation of allocations with just the stdlib tooling. Make these methods public as well, so that users can access the documentation without getting warnings. Refs: #42768
…ile.print (#51981) Defines a converter for Allocs to the types implemented by Profile for printing, allowing flamegraph report generation of allocations with just the stdlib tooling. Refs: #42768 For old versions of Julia, you could previously get download a copy of this code that could be included as a polyfill to add this: https://gist.github.com/vtjnash/c6aa4db9dafccb0fd28a65f87d6b1adb
Overview
Record the type and stack of every allocation (or only at a given sample interval), and return as Julia objects.
Alternate approach to existing alloc profiler PR: #33467
Complementary to garbage profiler PR: #42658 (maybe there's some nice way to meld them)
This may be reinventing the wheel from #33467, but I'm not sure why that one needs stuff like LLVM passes. I mimicked some stuff from it, but this was my attempt to get something up and running. Could easily be missing stuff.
Usage:
See also: JuliaPerf/PProf.jl#46 for support for visualizing these.
Here are two sample screenshots from analyzing all allocations during function compilation and evaluation:
Output:
Julia structs which can be used by downstream libraries. (See
PProf.jl
: JuliaPerf/PProf.jl#46)Performance
Recording the stack traces with LibUnwind for each allocation has non-trivial overhead, so the allocs profiler comes with a
sample_rate = 0.0001
that controls the probability of recording a sample for each allocation. Withsample_rate = 1
, this example benchmark which allocates heavily (for i in 1:100 fi = Symbol("f$i") ; @eval begin $fi() = 3 ; $fi() end
) spent around 50% of time recording the stack traces (on my MacBook Pro):Since almost all the overhead is in
jl_unw_stepn()
, there's not much more that can be improved in this PR. This is why we let users control the sample rate, based on their particular workload.TODO:
record_allocated_value
frame off the topTYPE_PTR_LOW_THRESHOLD
)Profile
package); ensure tests are runningUnknownType
in allocations results) #43688Co-Authored-By: @NHDaly