[exporter/loadbalancing] Add top level sending_queue, retry_on_failure and timeout settings #36094

Fiery-Fenix · 2024-10-30T16:32:01Z

Description

Problem statement

loadbalancing exporter is actually a wrapper that's creates and manages set of actual otlp exporters
Those otlp exporters technically shares same configuration parameters that are defined on loadbalancing exporter level, including sending_queue configuration. The only difference is endpoint parameter that are substituted by loadbalancing exporter itself
This means, that sending_queue, retry_on_failure and timeout settings can be defined only on otlp sub-exporters, while top-level loadbalancing exporter is missing all those settings
This configuration approach produces several issue, that are already reported by users:

Impossibility to use Persistent Queue in loadbalancing exporter (see [loadbalancing-exporter] enabling exporterhelper persistent queue breaks loadbalancer exporting #16826). That's happens because otlp sub-exporters are sharing the same configurations, including configuration of the queue, i.e. they all are using the same storage instance at the same time which is not possible at the moment
Data loss even using sending_queue configuration (see [exporter/loadbalancing] Data loss occurring when change in downstream DNS records or pods #35378). That's happens because Queue is defined on level of otlp sub-exporters and if this exporter cannot flush data from queue (for example, endpoint is not available anymore) there is no other options that just to discard data from queue, i.e. there is no higher level queue and persistent storage where data can be returned is case of permanent failure

There might be some other potential issue that was already tracked and related to current configuration approach

Proposed solution

The easiest way to solve issues above - is to use standard approach for queue, retry and timeout configuration using exporterhelper
This will bring queue, retry and timeout functionality to the top-level of loadbalancing exporter, instead of otlp sub-exporters
Related to mentioned issues it will bring:

Single Persistent Queue, that is used by all otlp sub-exporters (not directly of course)
Queue will not be discarded/destroyed if any (or all) of endpoint that are unreachable anymore, top-level queue will keep data until new endpoints will be available
Scale-up and scale-down event for next layer of OpenTelemetry Collectors in K8s environments will be more predictable, and will not include data loss anymore (potential fix for [exporter/loadbalancing] Support consistency between scale-out events #33959). There is still a big chance of inconsistency when some data will be send to incorrect endpoint, but it's already better state that we have right now

Noticeable changes

loadbalancing exporter on top-level now uses exporterhelper with all supported functionality by it
sending_queue will be automatically disabled on otlp exporters when it already present on top-level loadbalancing exporter. This change is done to prevent data loss on otlp exporters because queue there doesn't provide expected result. Also it will prevent potential misconfiguration from user side and as result - irrelevant reported issues
exporter attribute for metrics generated from otlp sub-exporters now includes endpoint for better visibility and to segregate them from top-level loadbalancing exporter - was "exporter": "loadbalancing", now "exporter": "loadbalancing/127.0.0.1:4317"
logs, generated by otlp sub-exporters now includes additional attribute endpoint with endpoint value with the same reasons as for metrics

Link to tracking issue

Fixes #35378
Fixes #16826

Testing

Proposed changes was heavily tested on large K8s environment with set of different scale-up/scale-down event using persistent queue configuration - no data loss were detected, everything works as expected

Documentation

README.md was updated to reflect new configuration parameters available. Sample config.yaml was updated as well

linux-foundation-easycla · 2024-10-30T16:32:06Z

The committers listed above are authorized under a signed CLA.

✅ login: Fiery-Fenix / name: Vyacheslav Stepanov (be4bcab, a478d22, 2a5fea8, 9a9cac2, 64cb243, 37f9c25, eb89107, d8031ca, 3296a9c)

MovieStoreGuy

Looks like it will need make fmt to be run to fix up a couple lines.

jamesmoessis

This looks great, the approach is sound. Just a couple of talking points below. Thankyou for contributing this!

jamesmoessis · 2024-11-01T03:26:55Z

exporter/loadbalancingexporter/factory.go

+	// If top level queue is enabled - per-endpoint queue must be disable
+	// This helps us to avoid unexpected issues with mixing 2 level of exporter queues
+	if cfg.QueueSettings.Enabled {
+		oCfg.QueueConfig.Enabled = false


Looking at the ConsumeTraces(), and it would be similar for metrics and logs. This is the code for when it exports.

for exp, td := range exporterSegregatedTraces { start := time.Now() err := exp.ConsumeTraces(ctx, td) exp.consumeWG.Done() errs = multierr.Append(errs, err) ... }

Now that the exp sub-exporter has no queue, won't exp.ConsumeTraces() block until exporting is finished? This would, I assume, make the export times much slower because it has to export the routed data in-series. This would take quite a long time especially if there are hundreds of backends to export to, likely causing a timeout.

What might be necessary is an errgroup or similar which can execute these now-blocking-calls to the exporter. I would be interested to hear your thoughts or maybe @jpkrohling's or @MovieStoreGuy's thoughts on this.

Switching to an export queue also means that includes the queue workers which allow parallel processing, which allows a layer of latency hiding.

Including a new "queue" worker in this could be hazardous in how it is done, say if you create a routine for each split batch you could end up with high amount of routine scheduling which can cause a significant performance hit. Then if you do more an async send on channels you have the original risk that this solves due to the queue moving from the current queue structure, to now a buffered channel that now needs to be shared again.

For this iteration, I would suggest leaving it out of this change and keep as an issue in the backlog of "investigate and address if required"

Missed that part (
Than yes, disabling seb-exporter queue in code might be too risky for all users
I have removed those lines

@Fiery-Fenix @MovieStoreGuy revisiting this, I think we actually do need to have a way of disabling the sub exporter queue without blocking in series here.

It's not really solving the problem if the sub exporter has a queue remaining, because that's where the data loss occurred. Then, if we disable the sub-exporter queue, it's extremely slow (blocking exporting in series for each endpoint).

So I think this part of the code needs more work. Perhaps we should re open the issue or start a new issue.

jamesmoessis · 2024-11-01T03:33:53Z

.chloggen/feat_loadbalancing-exporter-queue.yaml

+# (Optional) One or more lines of additional information to render under the primary note.
+# These lines will be padded with 2 spaces and then inserted directly into the document.
+# Use pipe (|) for multiline entries.
+subtext: OTLP sub-exporter queue will be automatically disabled if loadbalancing exporter queue is enabled to avoid data loss


Now that you've consolidated all exporter queues into one, their overall size is smaller. This could lead to dropped data in some deployments, since you are going from n*queueSize to 1*queueSize - I don't think this is necessarily a breaking change but I think it should be mentioned in the changelog, so users can increase the queue size if they run into issues.

I've updated changelog to reflect this part

jamesmoessis · 2024-11-01T03:35:51Z

exporter/loadbalancingexporter/trace_exporter_test.go

@@ -349,35 +344,6 @@ func TestConsumeTracesUnexpectedExporterType(t *testing.T) {
 	assert.EqualError(t, res, fmt.Sprintf("unable to export traces, unexpected exporter type: expected exporter.Traces but got %T", newNopMockExporter()))
 }

-func TestBuildExporterConfig(t *testing.T) {


How difficult would it be to add a test that verifies data is not lost if an exporter is taken out of the ring? Could be useful since I don't see much verification of this behaviour

This sound rather E2E test than unit test, because in the current implementation I couldn't find a way how to control sub-exporters from within test routine. So I can assume that complexity for this test is too high and might require big refactoring of a whole component

Fiery-Fenix · 2024-11-04T11:02:02Z

Looks like it will need make fmt to be run to fix up a couple lines.

done

jpkrohling

I really like this solution! I think it only needs more explicit documentation and better care for current users explicitly setting the resiliency options.

exporter/loadbalancingexporter/README.md

jpkrohling · 2024-11-05T09:18:35Z

exporter/loadbalancingexporter/factory.go

@@ -32,20 +40,94 @@ func createDefaultConfig() component.Config {
 	otlpDefaultCfg.Endpoint = "placeholder:4317"

 	return &Config{
+		TimeoutSettings: exporterhelper.NewDefaultTimeoutConfig(),


This behavior should be clearly documented. I think it would even make sense to log a warning in case users tried to use that option.

That said, there ARE users relying on this feature at the moment. What should we do about them? I think we should copy their current values to the load-balancer level, which would give them roughly the same desired behavior they have today.

So, here's how it could work:

if the OTLP Exporter options do not include the resiliency options, use our new defaults

if there are, copy the ones from the OTLP section into the top-level config, and set the OTLP one to the default values, writing a log message stating that this is happening

if there are options at both levels, the ones at the load-balancing level wins, as that's the newest one and we can assume it's the latest intent from the user (but log a WARN in this case, it's serious enough)

I don't think we need a feature flag or deprecation plan for this.

Should we do this approach for all 3 resilience options (queue, retry and timeout) or it should be only to timeout settings?

@jpkrohling re-ping - I think this is the only open question left on this PR. Trying to get it moving again

Should we do this approach for all 3 resilience options

Please do all three in this PR since it can go out together for the best optimal experience.

I was thinking a lot about this "magic" and I'm sure now that we shouldn't do such settings copy because it may produce bunch of unpredictable issues.
First of all users not always looking into the logs, especially on containerized environments like K8s, and may easily miss warning record.
Also, to make this change work as expected, user needs to tune resilience options for BOTH loadbalancing exporter and otlp sub-exportes.
From our own experience, when we were testing this change: we have to lower down all 3 resilience options configured initially for sub-exportes to make data delivery reliable in our quite flexible environment. Simple copy of those settings make everything even worse than it was before
For example, retry and timeout settings were set to non-default, quite large, values. Keeping them as-is has a side effect of large delay in endpoints re-balancing, because failed sub-exporter was trying again and again to send data to already dead endpoint instead of returning data to loadbalancing exporter and re-balance it to "fresh" endpoints.
So I believe that users SHOULD review loadbalancing exporter settings by themself to achieve proper results instead of relying on unexpected change made by exporter code.
Probably we should add some additional warning to Changelog item that users should review their settings when upgrading to a new version? For me it sounds as much better option than implementing "magic" copy-paste code

I think that makes sense, and given you've tested in production that gives me more confidence.

Thank you for your response @Fiery-Fenix.

What I would suggest is the log a warning if the options are ignored at the load balancer level. Perhaps some internal metrics to track and some configuration to update based on the data being returned.

To @jpkrohling point, adding the load balancer queue and timeout helpers could cause some issues, mostly the timeout of timeouts for unaware users, so a log would help here. (Providing users more context and warnings, even if they may not get reviewed is important).

I do agree with your point that copying values might be magic and perhaps not the correct path forward. From my perspective, I think the top level load balancer would be at least double the sub exporters settings by default; mostly to allow a reshuffled into the queue and delivered to the next available endpoint.

The definite warning is when the top level exporter has a lower configuration compared to the sub exporter, that is a definite must.

I guess I'm just confused, but I'll try to clarify my points:

If this PR brings change in behavior for people using the load-balancing exporter today, this is potentially a breaking change. Breaking changes have to be phased in: at N+1 (next version), this new feature has to be added behind a feature-flag and disabled by default. with proper notices to end-users about the upcoming change in the default behavior (changelog, warn log entries); at N+3, the flag is enabled by default; at N+5, the flag is removed and only the new behavior is active.

My previous understanding was that the resiliency options from the individual exporters would be moved to the top-level config. Copying the config from the individual exporters config to the top-level would make this a non-breaking change, as the behavior for the end-user is the roughly the same.

My understanding now is that this is not changing the behavior: people can still configure the resiliency options at the individual exporters and at the top-level exporter. In this case, the readme still seems to be wrong in one place (the sending queue comment I left in place).

jpkrohling

Apart from a couple of clarifications, this looks OK to me.

jpkrohling · 2024-11-21T10:20:11Z

exporter/loadbalancingexporter/README.md

+        # except the endpoint
+        timeout: 1s
+        # doesn't take any effect because loadbalancing.sending_queue
+        # is enabled


This is still confusing to me: looking at the code in the factory, the only property it overrides is the endpoint. Why is this not effective?

func buildExporterConfig(cfg *Config, endpoint string) otlpexporter.Config { oCfg := cfg.Protocol.OTLP oCfg.Endpoint = endpoint return oCfg }

That was a leftover from previous version, removed

jpkrohling · 2024-11-21T10:37:32Z

exporter/loadbalancingexporter/factory.go

@@ -32,20 +40,94 @@ func createDefaultConfig() component.Config {
 	otlpDefaultCfg.Endpoint = "placeholder:4317"

 	return &Config{
+		TimeoutSettings: exporterhelper.NewDefaultTimeoutConfig(),


I guess I'm just confused, but I'll try to clarify my points:

If this PR brings change in behavior for people using the load-balancing exporter today, this is potentially a breaking change. Breaking changes have to be phased in: at N+1 (next version), this new feature has to be added behind a feature-flag and disabled by default. with proper notices to end-users about the upcoming change in the default behavior (changelog, warn log entries); at N+3, the flag is enabled by default; at N+5, the flag is removed and only the new behavior is active.

My previous understanding was that the resiliency options from the individual exporters would be moved to the top-level config. Copying the config from the individual exporters config to the top-level would make this a non-breaking change, as the behavior for the end-user is the roughly the same.

My understanding now is that this is not changing the behavior: people can still configure the resiliency options at the individual exporters and at the top-level exporter. In this case, the readme still seems to be wrong in one place (the sending queue comment I left in place).

jpkrohling · 2024-11-21T10:37:39Z

exporter/loadbalancingexporter/README.md

@@ -48,14 +48,17 @@ This also supports service name based exporting for traces. If you have two or m

 ## Resilience and scaling considerations

-The `loadbalancingexporter` will, irrespective of the chosen resolver (`static`, `dns`, `k8s`), create one exporter per endpoint. The exporter conforms to its published configuration regarding sending queue and retry mechanisms. Importantly, the `loadbalancingexporter` will not attempt to re-route data to a healthy endpoint on delivery failure, and data loss is therefore possible if the exporter's target remains unavailable once redelivery is exhausted. Due consideration needs to be given to the exporter queue and retry configuration when running in a highly elastic environment.
+The `loadbalancingexporter` will, irrespective of the chosen resolver (`static`, `dns`, `k8s`), create one exporter per endpoint. Each level of exporters, `loadbalancingexporter` itself and all sub-exporters (one per each endpoint), have it's own queue, timeout and retry mechanisms. Importantly, the `loadbalancingexporter` will attempt to re-route data to a healthy endpoint on delivery failure because in-memory queue, retry and timeout setting are enabled by default ([more details on queuing, retry and timeout default settings](https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md)).


I understand this, but I wonder how easy this is to digest for new users. Perhaps we need a diagram somewhere? Like this: https://excalidraw.com/#json=idSDC7sKrosXFgMflKTMr,4aSVSJCVCJDInmGanZJW9Q

…e and timeout settings

As it may produce blocked exporter

Fix go.mod

Add diagram to readme for better resilience visualization

Fiery-Fenix · 2024-11-21T18:46:21Z

That's totally make sense, so I made next changes:

Resilience options for loadbalancing exporter are disabled by default now, that's really a new functionality that may break existing workflows unexpectedly. This state is also reflected in Readme
Updated Readme with diagram and few simple suggestions when and which resilience options to use

This should be a better way to introduce this functionality, because users needs to explicitly enable new resilience options, otherwise exporter behavior would be roughly the same as in previous versions

MovieStoreGuy · 2024-11-25T00:48:59Z

exporter/loadbalancingexporter/README.md


-* When using the `static` resolver and a target is unavailable, all the target's load-balanced telemetry will fail to be delivered until either the target is restored or removed from the static list. The same principle applies to the `dns` resolver.
+```


Thank you for drawing this one out, if you ever wanted to make it easier to visualise in Github you can use:

https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-diagrams

MovieStoreGuy · 2024-11-27T05:04:41Z

Leaving this with @jpkrohling as the code owner for this component :)

jpkrohling · 2024-11-27T13:31:12Z

Thank you for your contribution, I think this change greatly improves the load balancer!

…e and timeout settings (open-telemetry#36094) #### Description ##### Problem statement `loadbalancing` exporter is actually a wrapper that's creates and manages set of actual `otlp` exporters Those `otlp` exporters technically shares same configuration parameters that are defined on `loadbalancing` exporter level, including `sending_queue` configuration. The only difference is `endpoint` parameter that are substituted by `loadbalancing` exporter itself This means, that `sending_queue`, `retry_on_failure` and `timeout` settings can be defined only on `otlp` sub-exporters, while top-level `loadbalancing` exporter is missing all those settings This configuration approach produces several issue, that are already reported by users: * Impossibility to use Persistent Queue in `loadbalancing` exporter (see open-telemetry#16826). That's happens because `otlp` sub-exporters are sharing the same configurations, including configuration of the queue, i.e. they all are using the same `storage` instance at the same time which is not possible at the moment * Data loss even using `sending_queue` configuration (see open-telemetry#35378). That's happens because Queue is defined on level of `otlp` sub-exporters and if this exporter cannot flush data from queue (for example, endpoint is not available anymore) there is no other options that just to discard data from queue, i.e. there is no higher level queue and persistent storage where data can be returned is case of permanent failure There might be some other potential issue that was already tracked and related to current configuration approach ##### Proposed solution The easiest way to solve issues above - is to use standard approach for queue, retry and timeout configuration using `exporterhelper` This will bring queue, retry and timeout functionality to the top-level of `loadbalancing` exporter, instead of `otlp` sub-exporters Related to mentioned issues it will bring: * Single Persistent Queue, that is used by all `otlp` sub-exporters (not directly of course) * Queue will not be discarded/destroyed if any (or all) of endpoint that are unreachable anymore, top-level queue will keep data until new endpoints will be available * Scale-up and scale-down event for next layer of OpenTelemetry Collectors in K8s environments will be more predictable, and will not include data loss anymore (potential fix for open-telemetry#33959). There is still a big chance of inconsistency when some data will be send to incorrect endpoint, but it's already better state that we have right now ##### Noticeable changes * `loadbalancing` exporter on top-level now uses `exporterhelper` with all supported functionality by it * `sending_queue` will be automatically disabled on `otlp` exporters when it already present on top-level `loadbalancing` exporter. This change is done to prevent data loss on `otlp` exporters because queue there doesn't provide expected result. Also it will prevent potential misconfiguration from user side and as result - irrelevant reported issues * `exporter` attribute for metrics generated from `otlp` sub-exporters now includes endpoint for better visibility and to segregate them from top-level `loadbalancing` exporter - was `"exporter": "loadbalancing"`, now `"exporter": "loadbalancing/127.0.0.1:4317"` * logs, generated by `otlp` sub-exporters now includes additional attribute `endpoint` with endpoint value with the same reasons as for metrics #### Link to tracking issue Fixes open-telemetry#35378 Fixes open-telemetry#16826 #### Testing Proposed changes was heavily tested on large K8s environment with set of different scale-up/scale-down event using persistent queue configuration - no data loss were detected, everything works as expected #### Documentation `README.md` was updated to reflect new configuration parameters available. Sample `config.yaml` was updated as well

Fiery-Fenix requested review from jpkrohling and a team as code owners October 30, 2024 16:32

github-actions bot assigned evan-bradley Oct 30, 2024

github-actions bot added the exporter/loadbalancing label Oct 30, 2024

MovieStoreGuy reviewed Nov 1, 2024

View reviewed changes

jamesmoessis reviewed Nov 1, 2024

View reviewed changes

jpkrohling reviewed Nov 5, 2024

View reviewed changes

jamesmoessis approved these changes Nov 14, 2024

View reviewed changes

jpkrohling reviewed Nov 21, 2024

View reviewed changes

Fiery-Fenix added 8 commits November 21, 2024 19:24

[exporter/loadbalancing] Add top level sending_queue, retry_on_failur…

be4bcab

…e and timeout settings

Remove duplicate test

9a9cac2

Apply make fmt

2a5fea8

Remove sub-exporter queue disabling

a478d22

As it may produce blocked exporter

Update changelog item with queue size recommendation

eb89107

Fix failing test

37f9c25

Fix go.mod

Added resilience mechanism description to README

3296a9c

Disable top level resilience options by default

d8031ca

Add diagram to readme for better resilience visualization

Fiery-Fenix force-pushed the feat/loadbalancing-exporter-queue branch from 217253a to d8031ca Compare November 21, 2024 18:33

Merge branch 'main' into feat/loadbalancing-exporter-queue

64cb243

MovieStoreGuy reviewed Nov 25, 2024

View reviewed changes

MovieStoreGuy approved these changes Nov 25, 2024

View reviewed changes

jpkrohling approved these changes Nov 27, 2024

View reviewed changes

jpkrohling merged commit 8e0ea01 into open-telemetry:main Nov 27, 2024
158 checks passed

github-actions bot added this to the next release milestone Nov 27, 2024

madaraszg-tulip mentioned this pull request Dec 7, 2024

otelcol.exporter.loadbalancing should support the loadbalancer level retry options grafana/alloy#2243

Open

jamesmoessis mentioned this pull request Dec 9, 2024

[exporter/loadbalancing] Data loss occurring when change in downstream DNS records or pods #35378

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[exporter/loadbalancing] Add top level sending_queue, retry_on_failure and timeout settings #36094

[exporter/loadbalancing] Add top level sending_queue, retry_on_failure and timeout settings #36094

Fiery-Fenix commented Oct 30, 2024

linux-foundation-easycla bot commented Oct 30, 2024 •

edited

Loading

MovieStoreGuy left a comment

jamesmoessis left a comment

jamesmoessis Nov 1, 2024 •

edited

Loading

MovieStoreGuy Nov 1, 2024

Fiery-Fenix Nov 4, 2024

jamesmoessis Dec 6, 2024

jamesmoessis Nov 1, 2024

Fiery-Fenix Nov 4, 2024

jamesmoessis Nov 1, 2024

Fiery-Fenix Nov 4, 2024

Fiery-Fenix commented Nov 4, 2024

jpkrohling left a comment

jpkrohling Nov 5, 2024

Fiery-Fenix Nov 8, 2024

jamesmoessis Nov 17, 2024

MovieStoreGuy Nov 17, 2024

Fiery-Fenix Nov 18, 2024

jamesmoessis Nov 19, 2024

MovieStoreGuy Nov 21, 2024

jpkrohling Nov 21, 2024

jpkrohling left a comment

jpkrohling Nov 21, 2024

Fiery-Fenix Nov 21, 2024

jpkrohling Nov 21, 2024

jpkrohling Nov 21, 2024

Fiery-Fenix commented Nov 21, 2024

MovieStoreGuy Nov 25, 2024

MovieStoreGuy commented Nov 27, 2024

jpkrohling commented Nov 27, 2024


		* When using the `static` resolver and a target is unavailable, all the target's load-balanced telemetry will fail to be delivered until either the target is restored or removed from the static list. The same principle applies to the `dns` resolver.
		```

[exporter/loadbalancing] Add top level sending_queue, retry_on_failure and timeout settings #36094

[exporter/loadbalancing] Add top level sending_queue, retry_on_failure and timeout settings #36094

Conversation

Fiery-Fenix commented Oct 30, 2024

Description

Problem statement

Proposed solution

Noticeable changes

Link to tracking issue

Testing

Documentation

linux-foundation-easycla bot commented Oct 30, 2024 • edited Loading

MovieStoreGuy left a comment

Choose a reason for hiding this comment

jamesmoessis left a comment

Choose a reason for hiding this comment

jamesmoessis Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fiery-Fenix commented Nov 4, 2024

jpkrohling left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpkrohling left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fiery-Fenix commented Nov 21, 2024

Choose a reason for hiding this comment

MovieStoreGuy commented Nov 27, 2024

jpkrohling commented Nov 27, 2024

linux-foundation-easycla bot commented Oct 30, 2024 •

edited

Loading

jamesmoessis Nov 1, 2024 •

edited

Loading