Memory profiler (continuation of #31534) #33467

maleadt · 2019-10-04T07:25:47Z

Rebase of #31534. Putting it in a separate PR because the rebase was pretty painful and stuff might have broken (notably the profile buffer overflow handling).

I reverted a couple of renames and generally tried to minimize the diff to make it easier to rebase, but we can now continue any API rework on this branch of course.

cc @staticfloat

base/error.jl

c42f · 2019-10-05T01:42:06Z

src/gc.c

+        // so that things like Interpreter frame objects do not disappear.
+        jl_gc_queue_bt_buf(gc_cache, &sp, ptls2->bt_data, ptls2->bt_size);
+        jl_gc_queue_bt_buf(gc_cache, &sp, (uintptr_t *)jl_profile_get_data(), jl_profile_len_data());
+        jl_gc_queue_bt_buf(gc_cache, &sp, (uintptr_t *)jl_memprofile_get_bt_data(), jl_memprofile_len_bt_data());


This is done for every thread — should be moved out of the loop?

I see that we're marking the profile data here as well as the memprofile data.

How big does the profile data buffer get? Should be worried about a performance hit here?

timholy · 2019-10-07T23:06:34Z

Has this addressed review comments on #31534?

maleadt · 2019-10-08T07:11:52Z

Has this addressed review comments on #31534?

Not yet, there are some issues with the PR that I've been focusing on first (recent profiler/backtrace changes, memory corruption with realloc, gc analyzer false positives).

timholy · 2019-10-08T07:23:53Z

Sounds fine. Feel free to ping me if you want a re-review once that's done.

KristofferC · 2019-10-08T13:22:33Z

doc/src/manual/profile.md


-julia> @profile myfunc()
+julia> Time.@profile myfunc()


I'm not sure we can do this change so we should probably have a backwards compatible API. Something like

Profile.@memprofile Profile.memprint

maybe.

Yeah, this was just a quick fix to have all the functionality exposed. I'll have a look at a better api, maybe @profile target=:memory ex and have the other functions dispatch to the correct module based on the data that is being gathered.

yuyichao · 2019-10-08T13:29:04Z

src/gc.c

@@ -938,10 +940,44 @@ void jl_gc_track_malloced_array(jl_ptls_t ptls, jl_array_t *a) JL_NOTSAFEPOINT
    ptls->heap.mallocarrays = ma;
 }

-void jl_gc_count_allocd(size_t sz) JL_NOTSAFEPOINT
+void jl_gc_count_allocd(void * addr, size_t sz, uint16_t tag) JL_NOTSAFEPOINT


This is used in the hot path, should be at least a static inline function.

yuyichao · 2019-10-08T13:29:38Z

src/memprofile.c

+    return memprof_alloc_data_size_max;
+}
+
+JL_DLLEXPORT int jl_memprofile_running(void) JL_NOTSAFEPOINT


Same for this one.

maleadt · 2019-10-18T14:49:00Z

Reworked the C parts of the profiler based on the generalized backtrace format (left the stdlib changes out for now, while trying to share some more code and data between both profilers)
Maybe a call for comments by @c42f since you touched/wrote much of that code:

unified _reformat_bt as you suggested, preserving bt2 using GC.@preserve: e71b057
add the ability to filter on specific backtrace entry types, since many callers don't expect e.g. allocation info frames: 479a544
add an allocation info backtrace entry: 13f6341#diff-9a5ccd88752720db11d2dffbb53d3d9fR169-R185

c42f

I've had a somewhat cursory look over this. I like how you've used the backtrace buffers for serializing the allocation info; I think that makes sense. My only reservation there is whether scanning those buffers for roots will become expensive and whether there's anything we can do about it. How big do the buffers get in a typical profiling run?

Regarding the other implementation detail, it strikes me that writing the type of the allocation info is curiously non-atomic compared to writing the other data (cf, jl_memprofile_set_typeof vs jl_memprofile_track_alloc). I guess this falls out of how we currently do codegen for allocation? However I do think it's a potential sign of "trouble" (performance or code size impact) that this needs changes to codegen at all; it seems like the best outcome would be to have all the memory profiling stuff hidden away behind a single if (__unlikely(jl_memprofile_is_running())) inside some runtime GC function call(s) which are already emitted. Is that possible with some rearrangement?

c42f · 2019-10-21T03:25:23Z

base/error.jl

@@ -119,8 +167,11 @@ end
 Get the backtrace of the current exception, for use within `catch` blocks.
 """
 function catch_backtrace()
-    bt, bt2 = ccall(:jl_get_backtrace, Ref{SimpleVector}, ())
-    return _reformat_bt(bt::Vector{Ptr{Cvoid}}, bt2::Vector{Any})
+    bt1, bt2 = ccall(:jl_get_backtrace, Ref{SimpleVector}, ())


This looks good to me. Note that bt2 is actually redundant here because the elements of bt2 are rooted in the task's current catch stack until the current catch block exists. So we could actually change jl_get_backtrace to remove bt2 entirely. (Could be a separate PR though. I'd be happy to review a PR which just did this and some of the _reformat_bt changes and get that done in short order if you felt like pulling those out separately.)

c42f · 2019-10-21T03:45:01Z

base/error.jl

-    ret = Vector{Union{InterpreterIP,Ptr{Cvoid}}}()
-    i, j = 1, 1
+struct AllocationInfo
+    T::Union{Nothing,Type}


Could call this field type?

c42f · 2019-10-21T03:51:26Z

src/gc.c

+        // 2.3. mark any managed objects in the backtrace buffers,
+        // so that things like Interpreter frame objects do not disappear.
+        jl_gc_queue_bt_buf(gc_cache, &sp, ptls2->bt_data, ptls2->bt_size);
+        jl_gc_queue_bt_buf(gc_cache, &sp, (jl_bt_element_t *)jl_profile_get_data(), jl_profile_len_data());


The profile data is global right? This line should go outside the jl_n_threads loop.

c42f · 2019-10-21T03:54:09Z

src/julia_internal.h

 #ifdef LIBOSXUNWIND
-size_t rec_backtrace_ctx_dwarf(jl_bt_element_t *bt_data, size_t maxsize, bt_context_t *ctx, int add_interp_frames) JL_NOTSAFEPOINT;
+size_t rec_backtrace_ctx_dwarf(jl_bt_element_t *bt_data, size_t *bt_size, size_t maxsize,
+                               bt_context_t *ctx, int add_interp_frames) JL_NOTSAFEPOINT;


Looks good. Should return an int as a flag by convention, rather than size_t I think.

c42f · 2019-10-21T03:57:29Z

src/profile.c

+
+
+//
+// Shared infrastructure


Good to see this consolidated.

Could do with a comment about where the non-shared stuff is kept. (I assume in the signals-*.c?)

c42f · 2019-10-21T04:58:27Z

src/profile.c

+    bt_entry[5].uintptr = allocsz;
+    // Used to "tag" this allocation within a particular domain (CPU, GPU, other)
+    // or within a particular allocator (Pool, std, malloc), or as a free instead.
+    bt_entry[6].uintptr = tag;


Perhaps it would be more natural to put all this into the buffer ahead of the actual backtrace? That way you know that the following backtrace refers to an allocation when reading through the buffer later. (Or maybe this is known implicitly?)

(Edit: oh, I see — jl_memprofile_set_typeof() needs access to bt_entry[2].jlvalue. That does seem kind of unfortunate but I haven't read enough of what's here to know whether that's necessary.)

c42f · 2019-10-21T06:56:26Z

src/profile.c

+    // The location of the data in memory, used to match allocations with deallocations.
+    bt_entry[3].uintptr = (uintptr_t) v;
+    // The time at which this happened
+    bt_entry[4].uintptr = jl_clock_now();   // FIXME: double to a potentially 32-bit uintptr


Right, you'll need to pack it into one word (#ifdef _P64) or two otherwise. I think that will do for now, but it does emphasize that we're drifting toward using the "backtrace" buffer for somewhat general purpose (but gc-aware) serialization. I don't think that's bad but it's food for thought.

c42f · 2019-10-22T06:58:24Z

src/profile.c

+//
+
+// data structures
+volatile jl_bt_element_t *bt_data_prof = NULL;


Given that allocation can be done in parallel we'll presumably need a lock around this profile data buffer for memory profiling. This is different from the performance profiling case which uses sampling and only touches the profile data buffer from the signal handler thread.

(Alternatively we could use separate memory profile data buffers per thread which could be merged at the end, at the cost of storage. Might be fine though if we never initialize that memory so that it can be backed lazily by the OS page table?)

c42f · 2019-10-22T07:04:29Z

src/profile.c

+    assert(bt_size_cur < bt_size_max);
+}
+
+JL_DLLEXPORT void jl_memprofile_set_typeof(void * v, void * ty) JL_NOTSAFEPOINT


The fact that jl_memprofile_set_typeof is done separately from jl_memprofile_track_alloc makes this non-atomic with respect to any locking scheme we want to use within these functions. TBH that seems likely to lead to problems and it would be great to figure out how to make writing the allocation trace and info more atomic.

c42f · 2019-11-02T07:05:36Z

My only reservation there is whether scanning those buffers for roots will become expensive and whether there's anything we can do about it.

Speculation: the buffer is append-only during a profiling run so I guess we shouldn't need to sweep the whole thing except during a full collection. Instead maintain an index to the first young object within the buffer, or some such thing.

Also adds `bt_overflow` flag instead of spitting out messages in the middle of profiling, to be used by client profiling code. This change allows for better checking of stack frames that could be incomplete due to insufficient backtrace buffer space. Realistically, a single truncated stack trace in the case of a sampling time profiler is unlikely to create large problems. However when taking backtraces for things such as a memory profiler, it is critical that all backtraces be accurate, and so we allow client code to be somewhat stricter here.

This adds C support for a memory profiler within the GC, tracking locations of allocations, deallocations, etc... It operates in a similar manner as the time profiler with single large buffers setup beforehand through an initialization function, reducing the need for expensive allocations while the program being measured is running. The memory profiler instruments the GC in all locations that the GC statistics themselves are being modified (e.g. `gc_num.allocd` and `gc_num.freed`) by introducing new helper functions `jl_gc_count_{allocd,freed,reallocd}()`. Those utility functions call the `jl_memprofile_track_{de,}alloc()` method to register an address, a size and a tag with the memory profiler. We also track type information as this can be critically helpful when debugging, and to do so without breaking API guarantees we insert methods to set the type of a chunk of memory after allocating it where necessary. The tagging system allows the memory profiler to disambiguate, at profile time, between e.g. pooled allocations and the "big" allocator. It also allows the memory allocator to support tracking multiple "memory domains", e.g. a GPU support package could manually call `jl_memprofile_track_alloc()` any time a chunk of memory is allocated on the GPU so as to use the same system. By default, all values are tracked, however one can set a `memprof_tag_filter` value to track only the values you are most interested in. (E.g. only CPU domain big allocations)

Maybe makes it possible to get rid of set_typeof call in generated code.

quinnj · 2020-06-16T17:52:59Z

It'd be great to have this; I'm happy to help test/push this along as much as I can if no one else can (who originally worked on it). But if someone could at least rebase, that'd be nice, since it seems there have been some backtrace changes since the changes proposed here.

maleadt · 2020-06-18T10:02:33Z

Yeah sorry for leaving this PR in a bad state -- I didn't like the codegen part (ideally there shouldn't be any profiler-related code being generated all the time), and the stdlib would need some very careful redesign to not break the API of the current time profiler without just duplicating all code into a memory profiler. At that point I had found my memory leak so ran out of reasons to look at this 🙂 I can push the WIP stdlib if you're interested in reviving this though.

c42f · 2020-06-22T03:33:22Z

I didn't like the codegen part (ideally there shouldn't be any profiler-related code being generated all the time)

Agreed. One thought I had was that perhaps we could treat this more like code coverage: just start julia in a special mode where codegen inserts memory profiling code during codegen. To be honest I don't think this is ideal — better to have memory profiling dynamically available — but it would make the whole thing a lot less intrusive to normal usage and allow us to merge something soon with less concern about performance impact.

IanButterworth · 2021-03-08T01:55:36Z

+1 to this being a great feature to have

Does anyone recall if this was in a working state? I tried to make on my MacOS machine but failed, as CI did

@Profile

## Overview Record the type and stack of every allocation (or only at a given sample interval), and return as Julia objects. Alternate approach to existing alloc profiler PR: #33467 Complementary to garbage profiler PR: #42658 (maybe there's some nice way to meld them) This may be reinventing the wheel from #33467, but I'm not sure why that one needs stuff like LLVM passes. I mimicked some stuff from it, but this was my attempt to get something up and running. Could easily be missing stuff. ## Usage: ```julia using Profile.Allocs res = Allocs.@Profile sample_rate=0.001 my_func() prof = Allocs.fetch() # do something with `prof` ``` See also: JuliaPerf/PProf.jl#46 for support for visualizing these. Co-authored-by: Nathan Daly <[email protected]>

@Profile

## Overview Record the type and stack of every allocation (or only at a given sample interval), and return as Julia objects. Alternate approach to existing alloc profiler PR: JuliaLang#33467 Complementary to garbage profiler PR: JuliaLang#42658 (maybe there's some nice way to meld them) This may be reinventing the wheel from JuliaLang#33467, but I'm not sure why that one needs stuff like LLVM passes. I mimicked some stuff from it, but this was my attempt to get something up and running. Could easily be missing stuff. ## Usage: ```julia using Profile.Allocs res = Allocs.@Profile sample_rate=0.001 my_func() prof = Allocs.fetch() # do something with `prof` ``` See also: JuliaPerf/PProf.jl#46 for support for visualizing these. Co-authored-by: Nathan Daly <[email protected]>

ViralBShah · 2022-02-25T00:12:00Z

Given the recent memory profiling work - should we close this?

@Profile

## Overview Record the type and stack of every allocation (or only at a given sample interval), and return as Julia objects. Alternate approach to existing alloc profiler PR: JuliaLang#33467 Complementary to garbage profiler PR: JuliaLang#42658 (maybe there's some nice way to meld them) This may be reinventing the wheel from JuliaLang#33467, but I'm not sure why that one needs stuff like LLVM passes. I mimicked some stuff from it, but this was my attempt to get something up and running. Could easily be missing stuff. ## Usage: ```julia using Profile.Allocs res = Allocs.@Profile sample_rate=0.001 my_func() prof = Allocs.fetch() # do something with `prof` ``` See also: JuliaPerf/PProf.jl#46 for support for visualizing these. Co-authored-by: Nathan Daly <[email protected]>

maleadt force-pushed the tb/memprofiler branch 9 times, most recently from ec85f1d to 884c28e Compare October 4, 2019 13:43

c42f reviewed Oct 5, 2019

View reviewed changes

base/error.jl Outdated Show resolved Hide resolved

c42f reviewed Oct 5, 2019

View reviewed changes

maleadt force-pushed the tb/memprofiler branch 7 times, most recently from 5fa9c30 to 200452c Compare October 7, 2019 15:06

maleadt mentioned this pull request Oct 8, 2019

GC analyzer does not spot safepoint issue #33501

Closed

KristofferC reviewed Oct 8, 2019

View reviewed changes

yuyichao reviewed Oct 8, 2019

View reviewed changes

maleadt force-pushed the tb/memprofiler branch 3 times, most recently from 27688c9 to 179cdc8 Compare October 9, 2019 09:47

This was referenced Oct 9, 2019

Add safepoint annotations to GC callbacks. #33508

Merged

[WIP] Memory Profiler #31534

Closed

maleadt force-pushed the tb/memprofiler branch from 13f6341 to 827eadb Compare October 18, 2019 15:12

c42f reviewed Oct 22, 2019

View reviewed changes

staticfloat and others added 7 commits November 27, 2019 14:25

Make it possible for rec_backtrace to skip interpreter frames.

e6c5a9b

Simplify _reformat_bt.

c24b2b7

Filter on wanted backtrace entry types.

b95c4ff

Split profiling functionality into separate file.

026c788

WIP: pass type of allocated value to GC functions.

c84f170

Maybe makes it possible to get rid of set_typeof call in generated code.

maleadt force-pushed the tb/memprofiler branch from 827eadb to c84f170 Compare November 27, 2019 13:25

vtjnash mentioned this pull request Aug 4, 2020

Faster joins meta issue JuliaData/DataFrames.jl#2340

Closed

timholy mentioned this pull request Jan 21, 2021

Feature request: memory flame graphs timholy/ProfileView.jl#166

Open

vilterp mentioned this pull request Sep 22, 2021

Heap snapshot #42286

Closed

29 tasks

timholy mentioned this pull request Oct 15, 2021

Garbage profiler #42658

Closed

4 tasks

vilterp mentioned this pull request Oct 22, 2021

Allocation profiler #42768

Merged

7 tasks

vilterp mentioned this pull request Jan 6, 2022

allocs profiler: extend to record Types for all allocations - (fix UnknownType in allocations results) #43688

Closed

vilterp mentioned this pull request Feb 18, 2022

profiling: include interpreter frames #43796

Open

maleadt closed this Feb 25, 2022

giordano deleted the tb/memprofiler branch February 26, 2022 02:54

NHDaly mentioned this pull request Jun 30, 2023

Allocation Profiler: Types for all allocations #50333

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory profiler (continuation of #31534) #33467

Memory profiler (continuation of #31534) #33467

maleadt commented Oct 4, 2019

c42f Oct 5, 2019

c42f Oct 5, 2019 •

edited

Loading

timholy commented Oct 7, 2019

maleadt commented Oct 8, 2019

timholy commented Oct 8, 2019

KristofferC Oct 8, 2019 •

edited

Loading

maleadt Oct 8, 2019

yuyichao Oct 8, 2019

yuyichao Oct 8, 2019

maleadt commented Oct 18, 2019

c42f left a comment

c42f Oct 21, 2019

c42f Oct 21, 2019

c42f Oct 21, 2019

c42f Oct 21, 2019

c42f Oct 21, 2019

c42f Oct 21, 2019

c42f Oct 21, 2019

c42f Oct 22, 2019

c42f Oct 22, 2019

c42f commented Nov 2, 2019

quinnj commented Jun 16, 2020

maleadt commented Jun 18, 2020

c42f commented Jun 22, 2020 •

edited

Loading

IanButterworth commented Mar 8, 2021 •

edited

Loading

ViralBShah commented Feb 25, 2022

Memory profiler (continuation of #31534) #33467

Memory profiler (continuation of #31534) #33467

Conversation

maleadt commented Oct 4, 2019

Choose a reason for hiding this comment

c42f Oct 5, 2019 • edited Loading

Choose a reason for hiding this comment

timholy commented Oct 7, 2019

maleadt commented Oct 8, 2019

timholy commented Oct 8, 2019

KristofferC Oct 8, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maleadt commented Oct 18, 2019

c42f left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c42f commented Nov 2, 2019

quinnj commented Jun 16, 2020

maleadt commented Jun 18, 2020

c42f commented Jun 22, 2020 • edited Loading

IanButterworth commented Mar 8, 2021 • edited Loading

ViralBShah commented Feb 25, 2022

c42f Oct 5, 2019 •

edited

Loading

KristofferC Oct 8, 2019 •

edited

Loading

c42f commented Jun 22, 2020 •

edited

Loading

IanButterworth commented Mar 8, 2021 •

edited

Loading