Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoadBalancer Exporter Does Not Release Memory When Using StreamIDs for Metrics #35810

Open
nicacioliveira opened this issue Oct 15, 2024 · 6 comments
Labels
bug Something isn't working exporter/loadbalancing needs triage New item requiring triage

Comments

@nicacioliveira
Copy link

Component(s)

exporter/loadbalancing

What happened?

Description

I’m facing an issue with high cardinality, and I’ve noticed that we need to implement a max_stale mechanism, similar to what is used in the delta-to-cumulative processor. This is because metrics with new streamIDs continue to grow over time, causing instances of the LoadBalancer to consume memory indefinitely.

Steps to Reproduce

I don’t have a specific way to reproduce this issue in a controlled environment, as it occurs in production. To manage it, I have to constantly restart the load-balancing pods to prevent memory exhaustion.

Evidence:
To mitigate the issue, I’ve set a minimum of 25 pods, but after a few hours, memory becomes exhausted due to the lack of a max_stale mechanism. After several days, I’m forced to perform a full rollout to reset all the pods.

image

Collector version

v0.110.0

Environment information

Environment

Kubernetes cluster on EKS

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

@nicacioliveira nicacioliveira added bug Something isn't working needs triage New item requiring triage labels Oct 15, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@nicacioliveira nicacioliveira changed the title LoadBalancing exporter not refresh memory when using streamID for metrics LoadBalancer Exporter Does Not Release Memory When Using StreamIDs for Metrics Oct 15, 2024
@atoulme
Copy link
Contributor

atoulme commented Oct 16, 2024

Please consider taking a heap dump with pprof so we can investigate what is happening.

@nicacioliveira
Copy link
Author

nicacioliveira commented Oct 17, 2024

Please consider taking a heap dump with pprof so we can investigate what is happening.

I'll take some time to do this, but it seems clear. Using streamID-based routing, we will eventually run out of memory because we will keep streams in memory forever. The balancing here needs a "refresh" as the same in the delta-to-cumulative processor

@dehaansa
Copy link
Contributor

I've been investigating several memory-based issues in the exporter, and would very much appreciate a pprof heap dump if you can collect one! Just looking through the code there should not be streamID based memory ramping, as the objects containing stream IDs appear to only live for the duration of one ConsumeMetrics call.

@dehaansa
Copy link
Contributor

Possibly affected by open-telemetry/opentelemetry-collector#11745

jpkrohling pushed a commit that referenced this issue Nov 27, 2024
…e update events (#36505)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
The load balancing exporter's k8sresolver was not handling update events
properly. The `callback` function was being executed after cleanup of
old endpoints and also after adding new endpoints. This causes exporter
churn in the case of an event in which the lists contain shared
elements. See the
[documentation](https://pkg.go.dev/k8s.io/client-go/tools/cache#ResourceEventHandler)
for examples where the state might change but the IP Addresses would
not, including the regular re-list events that might have zero changes.

<!-- Issue number (e.g. #1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes
#35658
May be related to
#35810
as well.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Added tests for no-change onChange call.

<!--Please delete paragraphs that you did not use before submitting.-->
shivanthzen pushed a commit to shivanthzen/opentelemetry-collector-contrib that referenced this issue Dec 5, 2024
…e update events (open-telemetry#36505)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
The load balancing exporter's k8sresolver was not handling update events
properly. The `callback` function was being executed after cleanup of
old endpoints and also after adding new endpoints. This causes exporter
churn in the case of an event in which the lists contain shared
elements. See the
[documentation](https://pkg.go.dev/k8s.io/client-go/tools/cache#ResourceEventHandler)
for examples where the state might change but the IP Addresses would
not, including the regular re-list events that might have zero changes.

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes
open-telemetry#35658
May be related to
open-telemetry#35810
as well.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Added tests for no-change onChange call.

<!--Please delete paragraphs that you did not use before submitting.-->
ZenoCC-Peng pushed a commit to ZenoCC-Peng/opentelemetry-collector-contrib that referenced this issue Dec 6, 2024
…e update events (open-telemetry#36505)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
The load balancing exporter's k8sresolver was not handling update events
properly. The `callback` function was being executed after cleanup of
old endpoints and also after adding new endpoints. This causes exporter
churn in the case of an event in which the lists contain shared
elements. See the
[documentation](https://pkg.go.dev/k8s.io/client-go/tools/cache#ResourceEventHandler)
for examples where the state might change but the IP Addresses would
not, including the regular re-list events that might have zero changes.

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes
open-telemetry#35658
May be related to
open-telemetry#35810
as well.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Added tests for no-change onChange call.

<!--Please delete paragraphs that you did not use before submitting.-->
@jpkrohling
Copy link
Member

This is because metrics with new streamIDs continue to grow over time, causing instances of the LoadBalancer to consume memory indefinitely

I believe we treat stream IDs like trace IDs: they are taken as input for the hashing algorithm but not kept in memory after the Consume operation has finished. I took a quick look at the code, and I have the impression that this is indeed what's happening (splitMetricsByStreamID using the stream ID as the key for the list of batches, and exporterAndEndpoint uses that key to make the routing decision). But perhaps I'm missing something and I would appreciate a heap dump as requested by other folks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/loadbalancing needs triage New item requiring triage
Projects
None yet
Development

No branches or pull requests

4 participants