LoadBalancer Exporter Does Not Release Memory When Using StreamIDs for Metrics #35810

nicacioliveira · 2024-10-15T18:28:16Z

Component(s)

exporter/loadbalancing

What happened?

Description

I’m facing an issue with high cardinality, and I’ve noticed that we need to implement a max_stale mechanism, similar to what is used in the delta-to-cumulative processor. This is because metrics with new streamIDs continue to grow over time, causing instances of the LoadBalancer to consume memory indefinitely.

Steps to Reproduce

I don’t have a specific way to reproduce this issue in a controlled environment, as it occurs in production. To manage it, I have to constantly restart the load-balancing pods to prevent memory exhaustion.

Evidence:
To mitigate the issue, I’ve set a minimum of 25 pods, but after a few hours, memory becomes exhausted due to the lack of a max_stale mechanism. After several days, I’m forced to perform a full rollout to reset all the pods.

Collector version

v0.110.0

Environment information

Environment

Kubernetes cluster on EKS

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

github-actions · 2024-10-15T18:28:33Z

Pinging code owners:

exporter/loadbalancing: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

atoulme · 2024-10-16T17:57:01Z

Please consider taking a heap dump with pprof so we can investigate what is happening.

nicacioliveira · 2024-10-17T11:18:48Z

Please consider taking a heap dump with pprof so we can investigate what is happening.

I'll take some time to do this, but it seems clear. Using streamID-based routing, we will eventually run out of memory because we will keep streams in memory forever. The balancing here needs a "refresh" as the same in the delta-to-cumulative processor

dehaansa · 2024-11-21T21:47:30Z

I've been investigating several memory-based issues in the exporter, and would very much appreciate a pprof heap dump if you can collect one! Just looking through the code there should not be streamID based memory ramping, as the objects containing stream IDs appear to only live for the duration of one ConsumeMetrics call.

dehaansa · 2024-11-26T15:41:50Z

Possibly affected by open-telemetry/opentelemetry-collector#11745

…e update events (#36505)  #### Description The load balancing exporter's k8sresolver was not handling update events properly. The `callback` function was being executed after cleanup of old endpoints and also after adding new endpoints. This causes exporter churn in the case of an event in which the lists contain shared elements. See the [documentation](https://pkg.go.dev/k8s.io/client-go/tools/cache#ResourceEventHandler) for examples where the state might change but the IP Addresses would not, including the regular re-list events that might have zero changes.  #### Link to tracking issue Fixes #35658 May be related to #35810 as well.  #### Testing Added tests for no-change onChange call.

…e update events (open-telemetry#36505)  #### Description The load balancing exporter's k8sresolver was not handling update events properly. The `callback` function was being executed after cleanup of old endpoints and also after adding new endpoints. This causes exporter churn in the case of an event in which the lists contain shared elements. See the [documentation](https://pkg.go.dev/k8s.io/client-go/tools/cache#ResourceEventHandler) for examples where the state might change but the IP Addresses would not, including the regular re-list events that might have zero changes.  #### Link to tracking issue Fixes open-telemetry#35658 May be related to open-telemetry#35810 as well.  #### Testing Added tests for no-change onChange call.

jpkrohling · 2024-12-09T11:48:48Z

This is because metrics with new streamIDs continue to grow over time, causing instances of the LoadBalancer to consume memory indefinitely

I believe we treat stream IDs like trace IDs: they are taken as input for the hashing algorithm but not kept in memory after the Consume operation has finished. I took a quick look at the code, and I have the impression that this is indeed what's happening (splitMetricsByStreamID using the stream ID as the key for the list of batches, and exporterAndEndpoint uses that key to make the routing decision). But perhaps I'm missing something and I would appreciate a heap dump as requested by other folks.

nicacioliveira added bug Something isn't working needs triage New item requiring triage labels Oct 15, 2024

github-actions bot added the exporter/loadbalancing label Oct 15, 2024

nicacioliveira changed the title ~~LoadBalancing exporter not refresh memory when using streamID for metrics~~ LoadBalancer Exporter Does Not Release Memory When Using StreamIDs for Metrics Oct 15, 2024

github-actions bot mentioned this issue Oct 22, 2024

Weekly Report: 2024-10-15 - 2024-10-22 #35905

Closed

This was referenced Oct 29, 2024

Weekly Report: 2024-10-22 - 2024-10-29 #36039

Closed

Weekly Report: 2024-10-29 - 2024-11-05 #36187

Closed

github-actions bot mentioned this issue Nov 12, 2024

Weekly Report: 2024-11-05 - 2024-11-12 #36302

Closed

github-actions bot mentioned this issue Nov 19, 2024

Weekly Report: 2024-11-12 - 2024-11-19 #36426

Closed

dehaansa mentioned this issue Nov 22, 2024

[exporter/loadbalancing] Update k8sresolver handler to properly manage update events #36505

Merged

github-actions bot mentioned this issue Nov 26, 2024

Weekly Report: 2024-11-19 - 2024-11-26 #36533

Closed

github-actions bot mentioned this issue Dec 3, 2024

Weekly Report: 2024-11-26 - 2024-12-03 #36628

Closed

github-actions bot mentioned this issue Dec 10, 2024

Weekly Report: 2024-12-03 - 2024-12-10 #36739

Open

github-actions bot mentioned this issue Dec 17, 2024

Weekly Report: 2024-12-10 - 2024-12-17 #36867

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoadBalancer Exporter Does Not Release Memory When Using StreamIDs for Metrics #35810

LoadBalancer Exporter Does Not Release Memory When Using StreamIDs for Metrics #35810

nicacioliveira commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

atoulme commented Oct 16, 2024

nicacioliveira commented Oct 17, 2024 •

edited

Loading

dehaansa commented Nov 21, 2024

dehaansa commented Nov 26, 2024

jpkrohling commented Dec 9, 2024

LoadBalancer Exporter Does Not Release Memory When Using StreamIDs for Metrics #35810

LoadBalancer Exporter Does Not Release Memory When Using StreamIDs for Metrics #35810

Comments

nicacioliveira commented Oct 15, 2024

Component(s)

What happened?

Description

Steps to Reproduce

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Oct 15, 2024

atoulme commented Oct 16, 2024

nicacioliveira commented Oct 17, 2024 • edited Loading

dehaansa commented Nov 21, 2024

dehaansa commented Nov 26, 2024

jpkrohling commented Dec 9, 2024

nicacioliveira commented Oct 17, 2024 •

edited

Loading