Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix kafka performance test write failing due to server overwhelmed #24964

Merged
merged 2 commits into from
Jan 13, 2023

Conversation

Abacn
Copy link
Contributor

@Abacn Abacn commented Jan 10, 2023

Part of #22303

Kafka performance test starts failing again after #24879 in. This is because we removed shuffle=appliance there and then read from shuffle becomes faster. However, the producer is unable to digests data within timeout (see flooding warning log of send failed : 'Expiring 148 record(s) for beam-sdf-0:130675 ms has passed since batch creation')

entered #24963 for the IO connector fix. For now remove ReShuffle and make the test green again.

  • Remove ReShuffle before write

  • Minor Optimizations on WriteCallBack

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI.

* Remove ReShuffle before write

* Minor Optimizations on WriteCallBack
@Abacn
Copy link
Contributor Author

Abacn commented Jan 10, 2023

R: @johnjcasey

@Abacn
Copy link
Contributor Author

Abacn commented Jan 10, 2023

Run Java KafkaIO Performance Test

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@Abacn
Copy link
Contributor Author

Abacn commented Jan 10, 2023

Performance test succeeded after removing Reshuffle (on test branch)

https://ci-beam.apache.org/job/beam_PerformanceTests_Kafka_IO/3538/

Seems like jenkins trigger broken again

@@ -42,13 +42,16 @@
*/
class KafkaWriter<K, V> extends DoFn<ProducerRecord<K, V>, Void> {

protected transient @Nullable Callback callback;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the motivation for a single callback?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SendCallback does not hold any state other than a reference to its outer class (KafkaWriter.this). It modified member variables of KafkaWriter in a synchronized block. This can be set as a transient member of KafkaWriter which has less overhead (no need new object each processElement call, and this is also what flink kafka connector doing)

@Abacn
Copy link
Contributor Author

Abacn commented Jan 10, 2023

Run RAT PreCommit

@Abacn
Copy link
Contributor Author

Abacn commented Jan 10, 2023

Run Java_Kafka_IO_Direct PreCommit

@Abacn
Copy link
Contributor Author

Abacn commented Jan 10, 2023

Run Java KafkaIO Performance Test

@Abacn
Copy link
Contributor Author

Abacn commented Jan 10, 2023

Run Python_PVR_Flink PreCommit

@Abacn
Copy link
Contributor Author

Abacn commented Jan 11, 2023

hold on for now. Removing the reshuffle in testbatch write causing batch test has high probability fail (reshuffle added before to fix this issue)

We need to fix the IO connector to make sure it is robust to incoming traffic

@Abacn
Copy link
Contributor Author

Abacn commented Jan 11, 2023

Run Java KafkaIO Performance Test

Now streaming test succeeded, but batch read test stucks due to not see all 100,000,000 results. Checking the counter it has 99,994,820 result counted. Not sure why there are missing records getting read.

@Abacn
Copy link
Contributor Author

Abacn commented Jan 12, 2023

Run Java KafkaIO Performance Test

@Abacn
Copy link
Contributor Author

Abacn commented Jan 12, 2023

performance test succeeded this time. Anyway flaky is better than permared. PTAL @johnjcasey

@@ -212,14 +210,16 @@ public void testKafkaIOReadsAndWritesCorrectlyInStreaming() throws IOException {

PipelineResult writeResult = writePipeline.run();
PipelineResult.State writeState = writeResult.waitUntilFinish();
// Fail the test if pipeline failed.
assertNotEquals(PipelineResult.State.FAILED, writeState);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current failing on master is due to the write pipeline continuously hitting OOM and cannot finish the bundle within timeout; then read pipeline does not receive expected number of data, e.g. https://ci-beam.apache.org/view/PerformanceTests/job/beam_PerformanceTests_Kafka_IO/3546/console

org.apache.beam.sdk.io.kafka.KafkaIOIT > testKafkaIOReadsAndWritesCorrectlyInStreaming FAILED
20:26:04     java.lang.AssertionError: actual number of records 34569527 smaller than expected: 100000000.

put this assertion right after writeResult.waitUntilFinish() to make test fail early if write pipeline fails.

The OOM issue is also a motivation of using single Callback instance (though not have decisive help)

@Abacn
Copy link
Contributor Author

Abacn commented Jan 13, 2023

GA Playground Examples CI failure irrelevant. merging for now

@Abacn Abacn merged commit 4534904 into apache:master Jan 13, 2023
@Abacn Abacn deleted the fixkafkawrite branch January 13, 2023 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants