Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add end-to-end tests to verify no data loss in stressed situations #20552

Open
tigrannajaryan opened this issue Apr 3, 2023 · 5 comments
Open
Labels
enhancement New feature or request never stale Issues marked with this label will be never staled and automatically removed

Comments

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Apr 3, 2023

Problem

I have been recently looking at how the Collector behaves in stressed situations (e.g. when memory limiter kicks in) and found that we have a number of places on our codebase where the components don't behave according to the contract and that results in data losses.

While it is important to test individual components to make sure they work as expected I think it is also necessary to have end-to-end tests that verify the behavior of the Collector as a whole since it is hard to guarantee that the combination of the components works as expected just by looking at the individual component unit tests.

Proposed Solution

I would like to add a new end-to-end integration test:

  • Use the existing testbed.
  • Run Collector in-process for shorter test durations.
  • Component sets to verify:
    • Receivers: otlp (grpc and http transports), filelog, prometheus
    • Processors: memorylimiter alone AND memorylimiter->batchprocessor combination
    • Exporters otlp, otlphttp.
  • Scenarios to verify:
    • Memory limit is hit, then recovered.
    • Exporter helper QueuedRetry queue limit size is hit.
    • Input rate (receiving) exceeds output rate (exporting) temporarily, long enough to cause hitting the above limits.
    • Destination is unavailable, then becomes available.
    • Exporter is sporadically available (return errors randomly).
    • QueuedRetry with in-memory AND persistent queue options.
    • QueuedRetry with queuing disabled.

Because the number of possible combination to test can be very large we may want to reduce them using pairwise combinatorial testing (already used by correctness tests) instead of full matrix testing.

What to verify:

  • No data is lost
  • No data is duplicated

Test Data Senders and Receivers

This test will require generators that can send data to the Collector and data receivers that can accept the data and simulate success/failure for every protocol that we want to test.

We already such data senders and data receivers for load generation purposes. It is highly advisable into the possibility to reuse these (if necessary extend them to support the scenarios we want to test for).

Alternatives Considered

Unit testing only. I don't think this is good enough, we have way to many moving parts in different components and different ways to connect them to be confident based on just unit testing.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 5, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Jun 5, 2023
@tigrannajaryan tigrannajaryan added never stale Issues marked with this label will be never staled and automatically removed and removed Stale labels Jun 5, 2023
@VihasMakwana
Copy link
Contributor

VihasMakwana commented Sep 25, 2023

@tigrannajaryan From the scenarios to verify, I think the most tricky ones are:

  • Exporter is sporadically available (return errors randomly).

For that, I came up with a POC.
I created a mock connector that acts as an exporter for our data senders and forwards the data received to data receivers in our testbed.
We can configure this connector to return permanent/non-permanent errors in the same way you did in the contract checker test on a random basis.

For,

  • Destination is unavailable, then becomes available.

We can do the following:

  • Extend our mock backend and cause it to return an unavailable error and enable QueuedRetry sender.
  • Use our mock connector to return non-permanent errors and let the client (i.e. data sender) retry sending the events.
  • Use our mock connector to return permanent errors and don't retry and move on with other events.

@dmitryax
Copy link
Member

I'm not sure I understand why a connector is needed. Potentially, we can have a test/mock exporter, but I doubt it's really needed. Maybe we can have a mock backend to play this role and push random errors for the exporter to propagate (with disabled retry and queue).

For "Destination is unavailable, then becomes available", I don't see any reason for a connector. It should be just a mock backend.

@VihasMakwana
Copy link
Contributor

Yeah, we can do that as well. Let me try that out.

djaglowski added a commit that referenced this issue Oct 16, 2023
Related issue:
#20552

Tweak the mock-backend to do following:

- Receives data from the receiver.
- Returns errors randomly to our receiver, which attempts to resend/drop
the data.

This is helpful when we're required to test random behaviors of the
collector and ensure reliable data delivery.


This is my initial PR to expand the testbed. This will help my further
efforts to expand the testbed.
Myself and @omrozowicz-splunk plan on adding `sending_queue` support to
the testbed and expanding the testing capabilities.

---------

Co-authored-by: Daniel Jaglowski <[email protected]>
JaredTan95 pushed a commit to openinsight-proj/opentelemetry-collector-contrib that referenced this issue Oct 18, 2023
Related issue:
open-telemetry#20552

Tweak the mock-backend to do following:

- Receives data from the receiver.
- Returns errors randomly to our receiver, which attempts to resend/drop
the data.

This is helpful when we're required to test random behaviors of the
collector and ensure reliable data delivery.


This is my initial PR to expand the testbed. This will help my further
efforts to expand the testbed.
Myself and @omrozowicz-splunk plan on adding `sending_queue` support to
the testbed and expanding the testing capabilities.

---------

Co-authored-by: Daniel Jaglowski <[email protected]>
jmsnll pushed a commit to jmsnll/opentelemetry-collector-contrib that referenced this issue Nov 12, 2023
Related issue:
open-telemetry#20552

Tweak the mock-backend to do following:

- Receives data from the receiver.
- Returns errors randomly to our receiver, which attempts to resend/drop
the data.

This is helpful when we're required to test random behaviors of the
collector and ensure reliable data delivery.


This is my initial PR to expand the testbed. This will help my further
efforts to expand the testbed.
Myself and @omrozowicz-splunk plan on adding `sending_queue` support to
the testbed and expanding the testing capabilities.

---------

Co-authored-by: Daniel Jaglowski <[email protected]>
mx-psi added a commit that referenced this issue Jan 4, 2024
**Description:** This PR adds two tests with an otlp receiver and
exporter with sending queues. We have two scenarios here:

Sending queue full
1. We generate permanent errors until `sending_queue is full` log
appears in the agent's logs
2. Then we get IDs of logs meant to be retried and IDs of logs received
successfully and check if all of them were retried

The current testbed is unable to get the information about the errors
from load generator's perspective, so I needed to put `LogsToRetry` in
`mock_backend` to be able to track what logs suffered from permanent
error.

Sending queue not full
Sanity test to check a default behavior of sending queue, but without
making it full.

So far only logs sending queues are covered, not sure if we should add
it for every data type. Currently, one test takes about ~9s.

**Link to tracking Issue:** A related issue is this one:
#20552,
as these tests cover `Exporter helper QueuedRetry queue limit size is
hit.` scenario

---------

Co-authored-by: Pablo Baeyens <[email protected]>
cparkins pushed a commit to AmadeusITGroup/opentelemetry-collector-contrib that referenced this issue Jan 10, 2024
**Description:** This PR adds two tests with an otlp receiver and
exporter with sending queues. We have two scenarios here:

Sending queue full
1. We generate permanent errors until `sending_queue is full` log
appears in the agent's logs
2. Then we get IDs of logs meant to be retried and IDs of logs received
successfully and check if all of them were retried

The current testbed is unable to get the information about the errors
from load generator's perspective, so I needed to put `LogsToRetry` in
`mock_backend` to be able to track what logs suffered from permanent
error.

Sending queue not full
Sanity test to check a default behavior of sending queue, but without
making it full.

So far only logs sending queues are covered, not sure if we should add
it for every data type. Currently, one test takes about ~9s.

**Link to tracking Issue:** A related issue is this one:
open-telemetry#20552,
as these tests cover `Exporter helper QueuedRetry queue limit size is
hit.` scenario

---------

Co-authored-by: Pablo Baeyens <[email protected]>
@ChrsMark
Copy link
Member

Hey folks, I'm particularly interested in the delivery guarantees of the filelog receiver. I'm putting some thoughts here for discussion.

As of today the OTel Collector can handle (at least) the following situations:

  1. The collector must track file offsets, even after a restart and continue where it left off. Using storage extension to store the receiver's state.
  2. We should correctly track offsets even in the case of a hard crash -> Not always: Ensure reliable data delivery in erroneous situations opentelemetry-collector#7460
  3. The collector must advance the file offset forward only when the events produced out of the previous lines are all acknowledged. Using a persistent-queue to persist the exporter's state.
  4. We should be able to update/downgrade/re-install without losing the offset data.

It would be nice if we had end-to-end testing to cover the above scenarios.

Would the scenarios I list here fit in the scenarios described at #20552 (comment)?

Also it seems that there is no sufficient documentation that would help users to properly setup the Collector to achieve the above.
I'm sharing the configuration I used to manually reproduce the scenarios:

receivers:
  filelog:
    include: [/var/log/busybox/simple.log]
    storage: file_storage/filelogreceiver

extensions:
  file_storage/filelogreceiver:
    directory: /home/user/otelcol/file_storage/freceiver
  file_storage/otcouput:
    directory: /home/user/otelcol/file_storage/output

service:
  extensions: [file_storage/filelogreceiver, file_storage/otcouput]
  pipelines:
    logs:
      receivers: [filelog]
      exporters: [otlp/backend]
      processors: []

exporters:
  otlp/backend:
    endpoint: http://0.0.0.0:8200
    sending_queue:
      storage: file_storage/otcouput
    tls:
      insecure: true
      insecure_skip_verify: true

djaglowski pushed a commit that referenced this issue Feb 1, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->

This PR adds documentation notes on how to achieve fault tolerance on
`filelog`'s receiver offset tracking.
The need for this is obvious but was also explained at
#20552 (comment).

**Link to tracking Issue:** <Issue number if applicable>

**Testing:** <Describe what testing was performed and which tests were
added.>

**Documentation:** <Describe the documentation added.>

---------

Signed-off-by: ChrsMark <[email protected]>
cparkins pushed a commit to AmadeusITGroup/opentelemetry-collector-contrib that referenced this issue Feb 1, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->

This PR adds documentation notes on how to achieve fault tolerance on
`filelog`'s receiver offset tracking.
The need for this is obvious but was also explained at
open-telemetry#20552 (comment).

**Link to tracking Issue:** <Issue number if applicable>

**Testing:** <Describe what testing was performed and which tests were
added.>

**Documentation:** <Describe the documentation added.>

---------

Signed-off-by: ChrsMark <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request never stale Issues marked with this label will be never staled and automatically removed
Projects
None yet
Development

No branches or pull requests

5 participants