Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add time-series metrics for memory consumption by proxy data structures #6473

Closed
tdyas opened this issue Jul 12, 2021 · 3 comments
Closed

add time-series metrics for memory consumption by proxy data structures #6473

tdyas opened this issue Jul 12, 2021 · 3 comments
Assignees

Comments

@tdyas
Copy link

tdyas commented Jul 12, 2021

#6441 and #6066 both report linkerd-proxy being OOM killed for using too much memory. To the extent not already tracked, it would be useful in debugging this to add time-series metrics that track the size of various data structures in the proxy sidecar. Capturing these metrics may aid in debugging the linked issues.

I am willing to implement such metrics. I need assistance though in figuring out which are the more useful places in the proxy to instrument. I can then put together a PR.

  1. Which data structures contribute to ongoing proxy memory usage?
  2. Which files should I be focusing on?
@tdyas
Copy link
Author

tdyas commented Jul 12, 2021

cc @cpretzer

@hawkw
Copy link
Contributor

hawkw commented Jul 13, 2021

The proxy doesn't have a great deal of global data structures in the traditional sense; we don't have a single global hash map of service discovery destinations, for example. Instead, most memory allocations are either per-service or per-request.

Currently, we have metrics that track the number of services that have been built in different parts of the proxy (stack_create_total) and the number of those services that have been deallocated (stack_drop_total). Although these metrics don't directly track allocated memory for those services, they can be used to look for memory leaks, since the difference between the stack_create_total and stack_drop_total metric for a given label indicates the current number of stacks that exist for that part of the proxy. This can sometimes be useful for finding memory leaks, if the number of stacks built for a particular part of the stack shows an unbounded increase over time.

Another thing we don't currently have a metric for, but that would definitely be helpful, is recording the number of asynchronous tasks spawned on the proxy's tokio runtime. If a proxy is using large amounts of memory, this is frequently due to a large number of async tasks being spawned, potentially waiting to execute. However, to accurately expose a metric for this, the best approach is to instrument the runtime directly. The Tokio project is working on exposing stats from the runtime (see tokio-rs/tokio#3845), so when that work starts to land upstream, we can expose those metrics in the proxy's Prometheus metrics as well.

Beyond that, there might be a few other things that could be worth doing, but they're likely not as high impact as using the existing stack metrics and adding a count of how many tasks are currently active. We could consider instrumenting the proxy's buffers to track queue depth, letting us determine how many requests are waiting in a queue to be sent to a particular service. However, because the buffer queues are bounded, and requests that have been in a queue for too long are timed out, these queues shouldn't result in unbounded memory growth. Also, we could potentially enhance the existing stack metrics so that we are recording the actual memory use of those services, as well as counts. However, this could take some work, since many of these services own pointers to heap-allocated data. We would need to recursively traverse any such types to sum the size of their children as well. It turns out that determining the amount of heap memory used by an object in a language without a large runtime that manages allocations is surprisingly difficult --- while there is a Rust library for doing this, it isn't actively maintained and appears to not work correctly for reference-counted pointers (Arc and Rc), which the proxy uses pretty frequently. So, calculating accurate memory usage stats (rather than just counting the number of objects) may actually be fairly challenging.

@tdyas
Copy link
Author

tdyas commented Sep 6, 2021

Closing as I'm not going to have the time to work on this any more.

@tdyas tdyas closed this as completed Sep 6, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants