-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add end-to-end tests to verify no data loss in stressed situations #20552
Comments
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
@tigrannajaryan From the scenarios to verify, I think the most tricky ones are:
For that, I came up with a POC. For,
We can do the following:
|
I'm not sure I understand why a connector is needed. Potentially, we can have a test/mock exporter, but I doubt it's really needed. Maybe we can have a mock backend to play this role and push random errors for the exporter to propagate (with disabled retry and queue). For "Destination is unavailable, then becomes available", I don't see any reason for a connector. It should be just a mock backend. |
Yeah, we can do that as well. Let me try that out. |
Related issue: #20552 Tweak the mock-backend to do following: - Receives data from the receiver. - Returns errors randomly to our receiver, which attempts to resend/drop the data. This is helpful when we're required to test random behaviors of the collector and ensure reliable data delivery. This is my initial PR to expand the testbed. This will help my further efforts to expand the testbed. Myself and @omrozowicz-splunk plan on adding `sending_queue` support to the testbed and expanding the testing capabilities. --------- Co-authored-by: Daniel Jaglowski <[email protected]>
Related issue: open-telemetry#20552 Tweak the mock-backend to do following: - Receives data from the receiver. - Returns errors randomly to our receiver, which attempts to resend/drop the data. This is helpful when we're required to test random behaviors of the collector and ensure reliable data delivery. This is my initial PR to expand the testbed. This will help my further efforts to expand the testbed. Myself and @omrozowicz-splunk plan on adding `sending_queue` support to the testbed and expanding the testing capabilities. --------- Co-authored-by: Daniel Jaglowski <[email protected]>
Related issue: open-telemetry#20552 Tweak the mock-backend to do following: - Receives data from the receiver. - Returns errors randomly to our receiver, which attempts to resend/drop the data. This is helpful when we're required to test random behaviors of the collector and ensure reliable data delivery. This is my initial PR to expand the testbed. This will help my further efforts to expand the testbed. Myself and @omrozowicz-splunk plan on adding `sending_queue` support to the testbed and expanding the testing capabilities. --------- Co-authored-by: Daniel Jaglowski <[email protected]>
**Description:** This PR adds two tests with an otlp receiver and exporter with sending queues. We have two scenarios here: Sending queue full 1. We generate permanent errors until `sending_queue is full` log appears in the agent's logs 2. Then we get IDs of logs meant to be retried and IDs of logs received successfully and check if all of them were retried The current testbed is unable to get the information about the errors from load generator's perspective, so I needed to put `LogsToRetry` in `mock_backend` to be able to track what logs suffered from permanent error. Sending queue not full Sanity test to check a default behavior of sending queue, but without making it full. So far only logs sending queues are covered, not sure if we should add it for every data type. Currently, one test takes about ~9s. **Link to tracking Issue:** A related issue is this one: #20552, as these tests cover `Exporter helper QueuedRetry queue limit size is hit.` scenario --------- Co-authored-by: Pablo Baeyens <[email protected]>
**Description:** This PR adds two tests with an otlp receiver and exporter with sending queues. We have two scenarios here: Sending queue full 1. We generate permanent errors until `sending_queue is full` log appears in the agent's logs 2. Then we get IDs of logs meant to be retried and IDs of logs received successfully and check if all of them were retried The current testbed is unable to get the information about the errors from load generator's perspective, so I needed to put `LogsToRetry` in `mock_backend` to be able to track what logs suffered from permanent error. Sending queue not full Sanity test to check a default behavior of sending queue, but without making it full. So far only logs sending queues are covered, not sure if we should add it for every data type. Currently, one test takes about ~9s. **Link to tracking Issue:** A related issue is this one: open-telemetry#20552, as these tests cover `Exporter helper QueuedRetry queue limit size is hit.` scenario --------- Co-authored-by: Pablo Baeyens <[email protected]>
Hey folks, I'm particularly interested in the delivery guarantees of the As of today the OTel Collector can handle (at least) the following situations:
It would be nice if we had end-to-end testing to cover the above scenarios. Would the scenarios I list here fit in the scenarios described at #20552 (comment)? Also it seems that there is no sufficient documentation that would help users to properly setup the Collector to achieve the above. receivers:
filelog:
include: [/var/log/busybox/simple.log]
storage: file_storage/filelogreceiver
extensions:
file_storage/filelogreceiver:
directory: /home/user/otelcol/file_storage/freceiver
file_storage/otcouput:
directory: /home/user/otelcol/file_storage/output
service:
extensions: [file_storage/filelogreceiver, file_storage/otcouput]
pipelines:
logs:
receivers: [filelog]
exporters: [otlp/backend]
processors: []
exporters:
otlp/backend:
endpoint: http://0.0.0.0:8200
sending_queue:
storage: file_storage/otcouput
tls:
insecure: true
insecure_skip_verify: true |
**Description:** <Describe what has changed.> <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> This PR adds documentation notes on how to achieve fault tolerance on `filelog`'s receiver offset tracking. The need for this is obvious but was also explained at #20552 (comment). **Link to tracking Issue:** <Issue number if applicable> **Testing:** <Describe what testing was performed and which tests were added.> **Documentation:** <Describe the documentation added.> --------- Signed-off-by: ChrsMark <[email protected]>
**Description:** <Describe what has changed.> <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> This PR adds documentation notes on how to achieve fault tolerance on `filelog`'s receiver offset tracking. The need for this is obvious but was also explained at open-telemetry#20552 (comment). **Link to tracking Issue:** <Issue number if applicable> **Testing:** <Describe what testing was performed and which tests were added.> **Documentation:** <Describe the documentation added.> --------- Signed-off-by: ChrsMark <[email protected]>
Problem
I have been recently looking at how the Collector behaves in stressed situations (e.g. when memory limiter kicks in) and found that we have a number of places on our codebase where the components don't behave according to the contract and that results in data losses.
While it is important to test individual components to make sure they work as expected I think it is also necessary to have end-to-end tests that verify the behavior of the Collector as a whole since it is hard to guarantee that the combination of the components works as expected just by looking at the individual component unit tests.
Proposed Solution
I would like to add a new end-to-end integration test:
otlp
(grpc and http transports),filelog
,prometheus
memorylimiter
alone ANDmemorylimiter
->batchprocessor
combinationotlp
,otlphttp
.Because the number of possible combination to test can be very large we may want to reduce them using pairwise combinatorial testing (already used by correctness tests) instead of full matrix testing.
What to verify:
Test Data Senders and Receivers
This test will require generators that can send data to the Collector and data receivers that can accept the data and simulate success/failure for every protocol that we want to test.
We already such data senders and data receivers for load generation purposes. It is highly advisable into the possibility to reuse these (if necessary extend them to support the scenarios we want to test for).
Alternatives Considered
Unit testing only. I don't think this is good enough, we have way to many moving parts in different components and different ways to connect them to be confident based on just unit testing.
The text was updated successfully, but these errors were encountered: