Add end-to-end tests to verify no data loss in stressed situations #20552

tigrannajaryan · 2023-04-03T17:31:21Z

Problem

I have been recently looking at how the Collector behaves in stressed situations (e.g. when memory limiter kicks in) and found that we have a number of places on our codebase where the components don't behave according to the contract and that results in data losses.

While it is important to test individual components to make sure they work as expected I think it is also necessary to have end-to-end tests that verify the behavior of the Collector as a whole since it is hard to guarantee that the combination of the components works as expected just by looking at the individual component unit tests.

Proposed Solution

I would like to add a new end-to-end integration test:

Use the existing testbed.
Run Collector in-process for shorter test durations.
Component sets to verify:
- Receivers: otlp (grpc and http transports), filelog, prometheus
- Processors: memorylimiter alone AND memorylimiter->batchprocessor combination
- Exporters otlp, otlphttp.
Scenarios to verify:
- Memory limit is hit, then recovered.
- Exporter helper QueuedRetry queue limit size is hit.
- Input rate (receiving) exceeds output rate (exporting) temporarily, long enough to cause hitting the above limits.
- Destination is unavailable, then becomes available.
- Exporter is sporadically available (return errors randomly).
- QueuedRetry with in-memory AND persistent queue options.
- QueuedRetry with queuing disabled.

Because the number of possible combination to test can be very large we may want to reduce them using pairwise combinatorial testing (already used by correctness tests) instead of full matrix testing.

What to verify:

No data is lost
No data is duplicated

Test Data Senders and Receivers

This test will require generators that can send data to the Collector and data receivers that can accept the data and simulate success/failure for every protocol that we want to test.

We already such data senders and data receivers for load generation purposes. It is highly advisable into the possibility to reuse these (if necessary extend them to support the scenarios we want to test for).

Alternatives Considered

Unit testing only. I don't think this is good enough, we have way to many moving parts in different components and different ways to connect them to be confident based on just unit testing.

The text was updated successfully, but these errors were encountered:

github-actions · 2023-06-05T03:30:18Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

VihasMakwana · 2023-09-25T06:41:40Z

@tigrannajaryan From the scenarios to verify, I think the most tricky ones are:

Exporter is sporadically available (return errors randomly).

For that, I came up with a POC.
I created a mock connector that acts as an exporter for our data senders and forwards the data received to data receivers in our testbed.
We can configure this connector to return permanent/non-permanent errors in the same way you did in the contract checker test on a random basis.

For,

Destination is unavailable, then becomes available.

We can do the following:

Extend our mock backend and cause it to return an unavailable error and enable QueuedRetry sender.
Use our mock connector to return non-permanent errors and let the client (i.e. data sender) retry sending the events.
Use our mock connector to return permanent errors and don't retry and move on with other events.

dmitryax · 2023-09-29T06:23:11Z

I'm not sure I understand why a connector is needed. Potentially, we can have a test/mock exporter, but I doubt it's really needed. Maybe we can have a mock backend to play this role and push random errors for the exporter to propagate (with disabled retry and queue).

For "Destination is unavailable, then becomes available", I don't see any reason for a connector. It should be just a mock backend.

VihasMakwana · 2023-09-29T06:34:16Z

Yeah, we can do that as well. Let me try that out.

@omrozowicz-splunk

Related issue: #20552 Tweak the mock-backend to do following: - Receives data from the receiver. - Returns errors randomly to our receiver, which attempts to resend/drop the data. This is helpful when we're required to test random behaviors of the collector and ensure reliable data delivery. This is my initial PR to expand the testbed. This will help my further efforts to expand the testbed. Myself and @omrozowicz-splunk plan on adding `sending_queue` support to the testbed and expanding the testing capabilities. --------- Co-authored-by: Daniel Jaglowski <[email protected]>

@omrozowicz-splunk

Related issue: open-telemetry#20552 Tweak the mock-backend to do following: - Receives data from the receiver. - Returns errors randomly to our receiver, which attempts to resend/drop the data. This is helpful when we're required to test random behaviors of the collector and ensure reliable data delivery. This is my initial PR to expand the testbed. This will help my further efforts to expand the testbed. Myself and @omrozowicz-splunk plan on adding `sending_queue` support to the testbed and expanding the testing capabilities. --------- Co-authored-by: Daniel Jaglowski <[email protected]>

@omrozowicz-splunk

Related issue: open-telemetry#20552 Tweak the mock-backend to do following: - Receives data from the receiver. - Returns errors randomly to our receiver, which attempts to resend/drop the data. This is helpful when we're required to test random behaviors of the collector and ensure reliable data delivery. This is my initial PR to expand the testbed. This will help my further efforts to expand the testbed. Myself and @omrozowicz-splunk plan on adding `sending_queue` support to the testbed and expanding the testing capabilities. --------- Co-authored-by: Daniel Jaglowski <[email protected]>

**Description:** This PR adds two tests with an otlp receiver and exporter with sending queues. We have two scenarios here: Sending queue full 1. We generate permanent errors until `sending_queue is full` log appears in the agent's logs 2. Then we get IDs of logs meant to be retried and IDs of logs received successfully and check if all of them were retried The current testbed is unable to get the information about the errors from load generator's perspective, so I needed to put `LogsToRetry` in `mock_backend` to be able to track what logs suffered from permanent error. Sending queue not full Sanity test to check a default behavior of sending queue, but without making it full. So far only logs sending queues are covered, not sure if we should add it for every data type. Currently, one test takes about ~9s. **Link to tracking Issue:** A related issue is this one: #20552, as these tests cover `Exporter helper QueuedRetry queue limit size is hit.` scenario --------- Co-authored-by: Pablo Baeyens <[email protected]>

**Description:** This PR adds two tests with an otlp receiver and exporter with sending queues. We have two scenarios here: Sending queue full 1. We generate permanent errors until `sending_queue is full` log appears in the agent's logs 2. Then we get IDs of logs meant to be retried and IDs of logs received successfully and check if all of them were retried The current testbed is unable to get the information about the errors from load generator's perspective, so I needed to put `LogsToRetry` in `mock_backend` to be able to track what logs suffered from permanent error. Sending queue not full Sanity test to check a default behavior of sending queue, but without making it full. So far only logs sending queues are covered, not sure if we should add it for every data type. Currently, one test takes about ~9s. **Link to tracking Issue:** A related issue is this one: open-telemetry#20552, as these tests cover `Exporter helper QueuedRetry queue limit size is hit.` scenario --------- Co-authored-by: Pablo Baeyens <[email protected]>

ChrsMark · 2024-01-24T14:19:59Z

Hey folks, I'm particularly interested in the delivery guarantees of the filelog receiver. I'm putting some thoughts here for discussion.

As of today the OTel Collector can handle (at least) the following situations:

The collector must track file offsets, even after a restart and continue where it left off. Using storage extension to store the receiver's state.
We should correctly track offsets even in the case of a hard crash -> Not always: Ensure reliable data delivery in erroneous situations opentelemetry-collector#7460
The collector must advance the file offset forward only when the events produced out of the previous lines are all acknowledged. Using a persistent-queue to persist the exporter's state.
We should be able to update/downgrade/re-install without losing the offset data.

It would be nice if we had end-to-end testing to cover the above scenarios.

Would the scenarios I list here fit in the scenarios described at #20552 (comment)?

Also it seems that there is no sufficient documentation that would help users to properly setup the Collector to achieve the above.
I'm sharing the configuration I used to manually reproduce the scenarios:

receivers:
  filelog:
    include: [/var/log/busybox/simple.log]
    storage: file_storage/filelogreceiver

extensions:
  file_storage/filelogreceiver:
    directory: /home/user/otelcol/file_storage/freceiver
  file_storage/otcouput:
    directory: /home/user/otelcol/file_storage/output

service:
  extensions: [file_storage/filelogreceiver, file_storage/otcouput]
  pipelines:
    logs:
      receivers: [filelog]
      exporters: [otlp/backend]
      processors: []

exporters:
  otlp/backend:
    endpoint: http://0.0.0.0:8200
    sending_queue:
      storage: file_storage/otcouput
    tls:
      insecure: true
      insecure_skip_verify: true

**Description:** <Describe what has changed.>  This PR adds documentation notes on how to achieve fault tolerance on `filelog`'s receiver offset tracking. The need for this is obvious but was also explained at #20552 (comment). **Link to tracking Issue:** <Issue number if applicable> **Testing:** <Describe what testing was performed and which tests were added.> **Documentation:** <Describe the documentation added.> --------- Signed-off-by: ChrsMark <[email protected]>

**Description:** <Describe what has changed.>  This PR adds documentation notes on how to achieve fault tolerance on `filelog`'s receiver offset tracking. The need for this is obvious but was also explained at open-telemetry#20552 (comment). **Link to tracking Issue:** <Issue number if applicable> **Testing:** <Describe what testing was performed and which tests were added.> **Documentation:** <Describe the documentation added.> --------- Signed-off-by: ChrsMark <[email protected]>

tigrannajaryan added enhancement New feature or request needs triage New item requiring triage labels Apr 3, 2023

tigrannajaryan mentioned this issue Apr 3, 2023

Ensure reliable data delivery in erroneous situations open-telemetry/opentelemetry-collector#7460

Open

atoulme removed the needs triage New item requiring triage label Apr 5, 2023

github-actions bot added the Stale label Jun 5, 2023

tigrannajaryan added never stale Issues marked with this label will be never staled and automatically removed and removed Stale labels Jun 5, 2023

VihasMakwana mentioned this issue Sep 28, 2023

[chore]: Expand e2e testbed #27251

Merged

omrozowicz-splunk mentioned this issue Nov 6, 2023

[chore] testbed: sending queue otlp #28904

Merged

ChrsMark mentioned this issue Jan 31, 2024

[receiver/filelog] Add docs for offset tracking #30914

Merged

OverOrion mentioned this issue Feb 6, 2024

File Log receiver loses logs even if persistence is set up #31074

Open

ChrsMark mentioned this issue Mar 27, 2024

Consider testing enhancements for filelog receiver #32001

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add end-to-end tests to verify no data loss in stressed situations #20552

Add end-to-end tests to verify no data loss in stressed situations #20552

tigrannajaryan commented Apr 3, 2023 •

edited

Loading

github-actions bot commented Jun 5, 2023

VihasMakwana commented Sep 25, 2023 •

edited

Loading

dmitryax commented Sep 29, 2023

VihasMakwana commented Sep 29, 2023

ChrsMark commented Jan 24, 2024

Add end-to-end tests to verify no data loss in stressed situations #20552

Add end-to-end tests to verify no data loss in stressed situations #20552

Comments

tigrannajaryan commented Apr 3, 2023 • edited Loading

Problem

Proposed Solution

Test Data Senders and Receivers

Alternatives Considered

github-actions bot commented Jun 5, 2023

VihasMakwana commented Sep 25, 2023 • edited Loading

dmitryax commented Sep 29, 2023

VihasMakwana commented Sep 29, 2023

ChrsMark commented Jan 24, 2024

tigrannajaryan commented Apr 3, 2023 •

edited

Loading

VihasMakwana commented Sep 25, 2023 •

edited

Loading