-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory profiler (continuation of #31534) #33467
Conversation
ec85f1d
to
884c28e
Compare
src/gc.c
Outdated
// so that things like Interpreter frame objects do not disappear. | ||
jl_gc_queue_bt_buf(gc_cache, &sp, ptls2->bt_data, ptls2->bt_size); | ||
jl_gc_queue_bt_buf(gc_cache, &sp, (uintptr_t *)jl_profile_get_data(), jl_profile_len_data()); | ||
jl_gc_queue_bt_buf(gc_cache, &sp, (uintptr_t *)jl_memprofile_get_bt_data(), jl_memprofile_len_bt_data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done for every thread — should be moved out of the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that we're marking the profile data here as well as the memprofile data.
How big does the profile data buffer get? Should be worried about a performance hit here?
5fa9c30
to
200452c
Compare
Has this addressed review comments on #31534? |
Not yet, there are some issues with the PR that I've been focusing on first (recent profiler/backtrace changes, memory corruption with realloc, gc analyzer false positives). |
Sounds fine. Feel free to ping me if you want a re-review once that's done. |
doc/src/manual/profile.md
Outdated
|
||
julia> @profile myfunc() | ||
julia> Time.@profile myfunc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we can do this change so we should probably have a backwards compatible API. Something like
Profile.@memprofile
Profile.memprint
maybe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was just a quick fix to have all the functionality exposed. I'll have a look at a better api, maybe @profile target=:memory ex
and have the other functions dispatch to the correct module based on the data that is being gathered.
src/gc.c
Outdated
@@ -938,10 +940,44 @@ void jl_gc_track_malloced_array(jl_ptls_t ptls, jl_array_t *a) JL_NOTSAFEPOINT | |||
ptls->heap.mallocarrays = ma; | |||
} | |||
|
|||
void jl_gc_count_allocd(size_t sz) JL_NOTSAFEPOINT | |||
void jl_gc_count_allocd(void * addr, size_t sz, uint16_t tag) JL_NOTSAFEPOINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used in the hot path, should be at least a static inline function.
src/memprofile.c
Outdated
return memprof_alloc_data_size_max; | ||
} | ||
|
||
JL_DLLEXPORT int jl_memprofile_running(void) JL_NOTSAFEPOINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this one.
27688c9
to
179cdc8
Compare
Reworked the C parts of the profiler based on the generalized backtrace format (left the stdlib changes out for now, while trying to share some more code and data between both profilers)
|
13f6341
to
827eadb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a somewhat cursory look over this. I like how you've used the backtrace buffers for serializing the allocation info; I think that makes sense. My only reservation there is whether scanning those buffers for roots will become expensive and whether there's anything we can do about it. How big do the buffers get in a typical profiling run?
Regarding the other implementation detail, it strikes me that writing the type of the allocation info is curiously non-atomic compared to writing the other data (cf, jl_memprofile_set_typeof
vs jl_memprofile_track_alloc
). I guess this falls out of how we currently do codegen for allocation? However I do think it's a potential sign of "trouble" (performance or code size impact) that this needs changes to codegen at all; it seems like the best outcome would be to have all the memory profiling stuff hidden away behind a single if (__unlikely(jl_memprofile_is_running()))
inside some runtime GC function call(s) which are already emitted. Is that possible with some rearrangement?
@@ -119,8 +167,11 @@ end | |||
Get the backtrace of the current exception, for use within `catch` blocks. | |||
""" | |||
function catch_backtrace() | |||
bt, bt2 = ccall(:jl_get_backtrace, Ref{SimpleVector}, ()) | |||
return _reformat_bt(bt::Vector{Ptr{Cvoid}}, bt2::Vector{Any}) | |||
bt1, bt2 = ccall(:jl_get_backtrace, Ref{SimpleVector}, ()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Note that bt2
is actually redundant here because the elements of bt2
are rooted in the task's current catch stack until the current catch
block exists. So we could actually change jl_get_backtrace
to remove bt2
entirely. (Could be a separate PR though. I'd be happy to review a PR which just did this and some of the _reformat_bt
changes and get that done in short order if you felt like pulling those out separately.)
ret = Vector{Union{InterpreterIP,Ptr{Cvoid}}}() | ||
i, j = 1, 1 | ||
struct AllocationInfo | ||
T::Union{Nothing,Type} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could call this field type
?
// 2.3. mark any managed objects in the backtrace buffers, | ||
// so that things like Interpreter frame objects do not disappear. | ||
jl_gc_queue_bt_buf(gc_cache, &sp, ptls2->bt_data, ptls2->bt_size); | ||
jl_gc_queue_bt_buf(gc_cache, &sp, (jl_bt_element_t *)jl_profile_get_data(), jl_profile_len_data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The profile data is global right? This line should go outside the jl_n_threads
loop.
#ifdef LIBOSXUNWIND | ||
size_t rec_backtrace_ctx_dwarf(jl_bt_element_t *bt_data, size_t maxsize, bt_context_t *ctx, int add_interp_frames) JL_NOTSAFEPOINT; | ||
size_t rec_backtrace_ctx_dwarf(jl_bt_element_t *bt_data, size_t *bt_size, size_t maxsize, | ||
bt_context_t *ctx, int add_interp_frames) JL_NOTSAFEPOINT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Should return an int
as a flag by convention, rather than size_t
I think.
|
||
|
||
// | ||
// Shared infrastructure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to see this consolidated.
Could do with a comment about where the non-shared stuff is kept. (I assume in the signals-*.c?)
bt_entry[5].uintptr = allocsz; | ||
// Used to "tag" this allocation within a particular domain (CPU, GPU, other) | ||
// or within a particular allocator (Pool, std, malloc), or as a free instead. | ||
bt_entry[6].uintptr = tag; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it would be more natural to put all this into the buffer ahead of the actual backtrace? That way you know that the following backtrace refers to an allocation when reading through the buffer later. (Or maybe this is known implicitly?)
(Edit: oh, I see — jl_memprofile_set_typeof()
needs access to bt_entry[2].jlvalue
. That does seem kind of unfortunate but I haven't read enough of what's here to know whether that's necessary.)
// The location of the data in memory, used to match allocations with deallocations. | ||
bt_entry[3].uintptr = (uintptr_t) v; | ||
// The time at which this happened | ||
bt_entry[4].uintptr = jl_clock_now(); // FIXME: double to a potentially 32-bit uintptr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, you'll need to pack it into one word (#ifdef _P64
) or two otherwise. I think that will do for now, but it does emphasize that we're drifting toward using the "backtrace" buffer for somewhat general purpose (but gc-aware) serialization. I don't think that's bad but it's food for thought.
// | ||
|
||
// data structures | ||
volatile jl_bt_element_t *bt_data_prof = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that allocation can be done in parallel we'll presumably need a lock around this profile data buffer for memory profiling. This is different from the performance profiling case which uses sampling and only touches the profile data buffer from the signal handler thread.
(Alternatively we could use separate memory profile data buffers per thread which could be merged at the end, at the cost of storage. Might be fine though if we never initialize that memory so that it can be backed lazily by the OS page table?)
assert(bt_size_cur < bt_size_max); | ||
} | ||
|
||
JL_DLLEXPORT void jl_memprofile_set_typeof(void * v, void * ty) JL_NOTSAFEPOINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that jl_memprofile_set_typeof
is done separately from jl_memprofile_track_alloc
makes this non-atomic with respect to any locking scheme we want to use within these functions. TBH that seems likely to lead to problems and it would be great to figure out how to make writing the allocation trace and info more atomic.
Speculation: the buffer is append-only during a profiling run so I guess we shouldn't need to sweep the whole thing except during a full collection. Instead maintain an index to the first young object within the buffer, or some such thing. |
Also adds `bt_overflow` flag instead of spitting out messages in the middle of profiling, to be used by client profiling code. This change allows for better checking of stack frames that could be incomplete due to insufficient backtrace buffer space. Realistically, a single truncated stack trace in the case of a sampling time profiler is unlikely to create large problems. However when taking backtraces for things such as a memory profiler, it is critical that all backtraces be accurate, and so we allow client code to be somewhat stricter here.
This adds C support for a memory profiler within the GC, tracking locations of allocations, deallocations, etc... It operates in a similar manner as the time profiler with single large buffers setup beforehand through an initialization function, reducing the need for expensive allocations while the program being measured is running. The memory profiler instruments the GC in all locations that the GC statistics themselves are being modified (e.g. `gc_num.allocd` and `gc_num.freed`) by introducing new helper functions `jl_gc_count_{allocd,freed,reallocd}()`. Those utility functions call the `jl_memprofile_track_{de,}alloc()` method to register an address, a size and a tag with the memory profiler. We also track type information as this can be critically helpful when debugging, and to do so without breaking API guarantees we insert methods to set the type of a chunk of memory after allocating it where necessary. The tagging system allows the memory profiler to disambiguate, at profile time, between e.g. pooled allocations and the "big" allocator. It also allows the memory allocator to support tracking multiple "memory domains", e.g. a GPU support package could manually call `jl_memprofile_track_alloc()` any time a chunk of memory is allocated on the GPU so as to use the same system. By default, all values are tracked, however one can set a `memprof_tag_filter` value to track only the values you are most interested in. (E.g. only CPU domain big allocations)
Maybe makes it possible to get rid of set_typeof call in generated code.
827eadb
to
c84f170
Compare
It'd be great to have this; I'm happy to help test/push this along as much as I can if no one else can (who originally worked on it). But if someone could at least rebase, that'd be nice, since it seems there have been some backtrace changes since the changes proposed here. |
Yeah sorry for leaving this PR in a bad state -- I didn't like the codegen part (ideally there shouldn't be any profiler-related code being generated all the time), and the stdlib would need some very careful redesign to not break the API of the current time profiler without just duplicating all code into a memory profiler. At that point I had found my memory leak so ran out of reasons to look at this 🙂 I can push the WIP stdlib if you're interested in reviving this though. |
Agreed. One thought I had was that perhaps we could treat this more like code coverage: just start julia in a special mode where codegen inserts memory profiling code during codegen. To be honest I don't think this is ideal — better to have memory profiling dynamically available — but it would make the whole thing a lot less intrusive to normal usage and allow us to merge something soon with less concern about performance impact. |
+1 to this being a great feature to have Does anyone recall if this was in a working state? I tried to make on my MacOS machine but failed, as CI did |
## Overview Record the type and stack of every allocation (or only at a given sample interval), and return as Julia objects. Alternate approach to existing alloc profiler PR: #33467 Complementary to garbage profiler PR: #42658 (maybe there's some nice way to meld them) This may be reinventing the wheel from #33467, but I'm not sure why that one needs stuff like LLVM passes. I mimicked some stuff from it, but this was my attempt to get something up and running. Could easily be missing stuff. ## Usage: ```julia using Profile.Allocs res = Allocs.@Profile sample_rate=0.001 my_func() prof = Allocs.fetch() # do something with `prof` ``` See also: JuliaPerf/PProf.jl#46 for support for visualizing these. Co-authored-by: Nathan Daly <[email protected]>
## Overview Record the type and stack of every allocation (or only at a given sample interval), and return as Julia objects. Alternate approach to existing alloc profiler PR: JuliaLang#33467 Complementary to garbage profiler PR: JuliaLang#42658 (maybe there's some nice way to meld them) This may be reinventing the wheel from JuliaLang#33467, but I'm not sure why that one needs stuff like LLVM passes. I mimicked some stuff from it, but this was my attempt to get something up and running. Could easily be missing stuff. ## Usage: ```julia using Profile.Allocs res = Allocs.@Profile sample_rate=0.001 my_func() prof = Allocs.fetch() # do something with `prof` ``` See also: JuliaPerf/PProf.jl#46 for support for visualizing these. Co-authored-by: Nathan Daly <[email protected]>
Given the recent memory profiling work - should we close this? |
## Overview Record the type and stack of every allocation (or only at a given sample interval), and return as Julia objects. Alternate approach to existing alloc profiler PR: JuliaLang#33467 Complementary to garbage profiler PR: JuliaLang#42658 (maybe there's some nice way to meld them) This may be reinventing the wheel from JuliaLang#33467, but I'm not sure why that one needs stuff like LLVM passes. I mimicked some stuff from it, but this was my attempt to get something up and running. Could easily be missing stuff. ## Usage: ```julia using Profile.Allocs res = Allocs.@Profile sample_rate=0.001 my_func() prof = Allocs.fetch() # do something with `prof` ``` See also: JuliaPerf/PProf.jl#46 for support for visualizing these. Co-authored-by: Nathan Daly <[email protected]>
Rebase of #31534. Putting it in a separate PR because the rebase was pretty painful and stuff might have broken (notably the profile buffer overflow handling).
I reverted a couple of renames and generally tried to minimize the diff to make it easier to rebase, but we can now continue any API rework on this branch of course.
cc @staticfloat