-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LoadBalancer Exporter Does Not Release Memory When Using StreamIDs for Metrics #35810
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Please consider taking a heap dump with pprof so we can investigate what is happening. |
I'll take some time to do this, but it seems clear. Using streamID-based routing, we will eventually run out of memory because we will keep streams in memory forever. The balancing here needs a "refresh" as the same in the delta-to-cumulative processor |
I've been investigating several memory-based issues in the exporter, and would very much appreciate a pprof heap dump if you can collect one! Just looking through the code there should not be streamID based memory ramping, as the objects containing stream IDs appear to only live for the duration of one ConsumeMetrics call. |
Possibly affected by open-telemetry/opentelemetry-collector#11745 |
…e update events (#36505) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description The load balancing exporter's k8sresolver was not handling update events properly. The `callback` function was being executed after cleanup of old endpoints and also after adding new endpoints. This causes exporter churn in the case of an event in which the lists contain shared elements. See the [documentation](https://pkg.go.dev/k8s.io/client-go/tools/cache#ResourceEventHandler) for examples where the state might change but the IP Addresses would not, including the regular re-list events that might have zero changes. <!-- Issue number (e.g. #1234) or full URL to issue, if applicable. --> #### Link to tracking issue Fixes #35658 May be related to #35810 as well. <!--Describe what testing was performed and which tests were added.--> #### Testing Added tests for no-change onChange call. <!--Please delete paragraphs that you did not use before submitting.-->
…e update events (open-telemetry#36505) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description The load balancing exporter's k8sresolver was not handling update events properly. The `callback` function was being executed after cleanup of old endpoints and also after adding new endpoints. This causes exporter churn in the case of an event in which the lists contain shared elements. See the [documentation](https://pkg.go.dev/k8s.io/client-go/tools/cache#ResourceEventHandler) for examples where the state might change but the IP Addresses would not, including the regular re-list events that might have zero changes. <!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. --> #### Link to tracking issue Fixes open-telemetry#35658 May be related to open-telemetry#35810 as well. <!--Describe what testing was performed and which tests were added.--> #### Testing Added tests for no-change onChange call. <!--Please delete paragraphs that you did not use before submitting.-->
…e update events (open-telemetry#36505) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description The load balancing exporter's k8sresolver was not handling update events properly. The `callback` function was being executed after cleanup of old endpoints and also after adding new endpoints. This causes exporter churn in the case of an event in which the lists contain shared elements. See the [documentation](https://pkg.go.dev/k8s.io/client-go/tools/cache#ResourceEventHandler) for examples where the state might change but the IP Addresses would not, including the regular re-list events that might have zero changes. <!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. --> #### Link to tracking issue Fixes open-telemetry#35658 May be related to open-telemetry#35810 as well. <!--Describe what testing was performed and which tests were added.--> #### Testing Added tests for no-change onChange call. <!--Please delete paragraphs that you did not use before submitting.-->
I believe we treat stream IDs like trace IDs: they are taken as input for the hashing algorithm but not kept in memory after the Consume operation has finished. I took a quick look at the code, and I have the impression that this is indeed what's happening (splitMetricsByStreamID using the stream ID as the key for the list of batches, and exporterAndEndpoint uses that key to make the routing decision). But perhaps I'm missing something and I would appreciate a heap dump as requested by other folks. |
Component(s)
exporter/loadbalancing
What happened?
Description
I’m facing an issue with high cardinality, and I’ve noticed that we need to implement a max_stale mechanism, similar to what is used in the delta-to-cumulative processor. This is because metrics with new streamIDs continue to grow over time, causing instances of the LoadBalancer to consume memory indefinitely.
Steps to Reproduce
I don’t have a specific way to reproduce this issue in a controlled environment, as it occurs in production. To manage it, I have to constantly restart the load-balancing pods to prevent memory exhaustion.
Evidence:
To mitigate the issue, I’ve set a minimum of 25 pods, but after a few hours, memory becomes exhausted due to the lack of a max_stale mechanism. After several days, I’m forced to perform a full rollout to reset all the pods.
Collector version
v0.110.0
Environment information
Environment
Kubernetes cluster on EKS
OpenTelemetry Collector configuration
No response
Log output
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: