add time-series metrics for memory consumption by proxy data structures #6473

tdyas · 2021-07-12T21:04:29Z

#6441 and #6066 both report linkerd-proxy being OOM killed for using too much memory. To the extent not already tracked, it would be useful in debugging this to add time-series metrics that track the size of various data structures in the proxy sidecar. Capturing these metrics may aid in debugging the linked issues.

I am willing to implement such metrics. I need assistance though in figuring out which are the more useful places in the proxy to instrument. I can then put together a PR.

Which data structures contribute to ongoing proxy memory usage?
Which files should I be focusing on?

The text was updated successfully, but these errors were encountered:

tdyas · 2021-07-12T21:06:58Z

cc @cpretzer

hawkw · 2021-07-13T16:55:29Z

The proxy doesn't have a great deal of global data structures in the traditional sense; we don't have a single global hash map of service discovery destinations, for example. Instead, most memory allocations are either per-service or per-request.

Currently, we have metrics that track the number of services that have been built in different parts of the proxy (stack_create_total) and the number of those services that have been deallocated (stack_drop_total). Although these metrics don't directly track allocated memory for those services, they can be used to look for memory leaks, since the difference between the stack_create_total and stack_drop_total metric for a given label indicates the current number of stacks that exist for that part of the proxy. This can sometimes be useful for finding memory leaks, if the number of stacks built for a particular part of the stack shows an unbounded increase over time.

Another thing we don't currently have a metric for, but that would definitely be helpful, is recording the number of asynchronous tasks spawned on the proxy's tokio runtime. If a proxy is using large amounts of memory, this is frequently due to a large number of async tasks being spawned, potentially waiting to execute. However, to accurately expose a metric for this, the best approach is to instrument the runtime directly. The Tokio project is working on exposing stats from the runtime (see tokio-rs/tokio#3845), so when that work starts to land upstream, we can expose those metrics in the proxy's Prometheus metrics as well.

Beyond that, there might be a few other things that could be worth doing, but they're likely not as high impact as using the existing stack metrics and adding a count of how many tasks are currently active. We could consider instrumenting the proxy's buffers to track queue depth, letting us determine how many requests are waiting in a queue to be sent to a particular service. However, because the buffer queues are bounded, and requests that have been in a queue for too long are timed out, these queues shouldn't result in unbounded memory growth. Also, we could potentially enhance the existing stack metrics so that we are recording the actual memory use of those services, as well as counts. However, this could take some work, since many of these services own pointers to heap-allocated data. We would need to recursively traverse any such types to sum the size of their children as well. It turns out that determining the amount of heap memory used by an object in a language without a large runtime that manages allocations is surprisingly difficult --- while there is a Rust library for doing this, it isn't actively maintained and appears to not work correctly for reference-counted pointers (Arc and Rc), which the proxy uses pretty frequently. So, calculating accurate memory usage stats (rather than just counting the number of objects) may actually be fairly challenging.

tdyas · 2021-09-06T22:58:36Z

Closing as I'm not going to have the time to work on this any more.

adleong assigned hawkw Jul 13, 2021

tdyas closed this as completed Sep 6, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add time-series metrics for memory consumption by proxy data structures #6473

add time-series metrics for memory consumption by proxy data structures #6473

tdyas commented Jul 12, 2021

tdyas commented Jul 12, 2021

hawkw commented Jul 13, 2021

tdyas commented Sep 6, 2021

add time-series metrics for memory consumption by proxy data structures #6473

add time-series metrics for memory consumption by proxy data structures #6473

Comments

tdyas commented Jul 12, 2021

tdyas commented Jul 12, 2021

hawkw commented Jul 13, 2021

tdyas commented Sep 6, 2021