-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix kafka performance test write failing due to server overwhelmed #24964
Conversation
* Remove ReShuffle before write * Minor Optimizations on WriteCallBack
R: @johnjcasey |
Run Java KafkaIO Performance Test |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
Performance test succeeded after removing Reshuffle (on test branch) https://ci-beam.apache.org/job/beam_PerformanceTests_Kafka_IO/3538/ Seems like jenkins trigger broken again |
@@ -42,13 +42,16 @@ | |||
*/ | |||
class KafkaWriter<K, V> extends DoFn<ProducerRecord<K, V>, Void> { | |||
|
|||
protected transient @Nullable Callback callback; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the motivation for a single callback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SendCallback does not hold any state other than a reference to its outer class (KafkaWriter.this). It modified member variables of KafkaWriter in a synchronized block. This can be set as a transient member of KafkaWriter which has less overhead (no need new object each processElement call, and this is also what flink kafka connector doing)
Run RAT PreCommit |
Run Java_Kafka_IO_Direct PreCommit |
Run Java KafkaIO Performance Test |
Run Python_PVR_Flink PreCommit |
hold on for now. Removing the reshuffle in testbatch write causing batch test has high probability fail (reshuffle added before to fix this issue) We need to fix the IO connector to make sure it is robust to incoming traffic |
Run Java KafkaIO Performance Test Now streaming test succeeded, but batch read test stucks due to not see all 100,000,000 results. Checking the counter it has 99,994,820 result counted. Not sure why there are missing records getting read. |
Run Java KafkaIO Performance Test |
performance test succeeded this time. Anyway flaky is better than permared. PTAL @johnjcasey |
@@ -212,14 +210,16 @@ public void testKafkaIOReadsAndWritesCorrectlyInStreaming() throws IOException { | |||
|
|||
PipelineResult writeResult = writePipeline.run(); | |||
PipelineResult.State writeState = writeResult.waitUntilFinish(); | |||
// Fail the test if pipeline failed. | |||
assertNotEquals(PipelineResult.State.FAILED, writeState); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current failing on master is due to the write pipeline continuously hitting OOM and cannot finish the bundle within timeout; then read pipeline does not receive expected number of data, e.g. https://ci-beam.apache.org/view/PerformanceTests/job/beam_PerformanceTests_Kafka_IO/3546/console
org.apache.beam.sdk.io.kafka.KafkaIOIT > testKafkaIOReadsAndWritesCorrectlyInStreaming FAILED
20:26:04 java.lang.AssertionError: actual number of records 34569527 smaller than expected: 100000000.
put this assertion right after writeResult.waitUntilFinish() to make test fail early if write pipeline fails.
The OOM issue is also a motivation of using single Callback instance (though not have decisive help)
GA Playground Examples CI failure irrelevant. merging for now |
Part of #22303
Kafka performance test starts failing again after #24879 in. This is because we removed shuffle=appliance there and then read from shuffle becomes faster. However, the producer is unable to digests data within timeout (see flooding warning log of
send failed : 'Expiring 148 record(s) for beam-sdf-0:130675 ms has passed since batch creation'
)entered #24963 for the IO connector fix. For now remove ReShuffle and make the test green again.
Remove ReShuffle before write
Minor Optimizations on WriteCallBack
Please add a meaningful description for your change here
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.