-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache total size in file based source #26746
Conversation
Some sources e.g. BigQueryTableSource already implemented cache: Lines 71 to 73 in 7caadb2
|
R: @johnjcasey |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
Run Java_Examples_Dataflow PreCommit |
Run Java_GCP_IO_Direct PreCommit |
Run Java PreCommit |
@@ -91,6 +94,7 @@ protected FileBasedSource( | |||
this.mode = Mode.FILEPATTERN; | |||
this.emptyMatchTreatment = emptyMatchTreatment; | |||
this.fileOrPatternSpec = fileOrPatternSpec; | |||
this.filesSizeBytes = new AtomicReference<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a file resource level cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's just a Long value
@@ -73,6 +74,8 @@ | |||
private MatchResult.@Nullable Metadata singleFileMetadata; | |||
private final Mode mode; | |||
|
|||
private final AtomicReference<@Nullable Long> filesSizeBytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test case for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Did some test, reading from tpcds_1T.web_sales:
11562 logs searching for pattern gs://.../temp/BigQueryExtractTemp/42fa897c43704ed7998868ebf83e9198/
5781 logs for gs://.../temp/BigQueryExtractTemp/ebd9c20b02f142439a837fbf99d72e25/ basically each file is matched only 3 times (2 during split, 1 before delete) That is a decrease by half (meaning making half of the match request)
3908 logs searching for pattern "gs://.../temp/BigQueryExtractTemp/8568e876f4404544a69b996c5ff47ca4/" basically each file is matched only 2 times (during split, before delete) Tested on 10 workers (n1s1); runner v1 has 520k record/sec throughput, while runner v2 has 600k record/sec throughput. Summary:
|
Java PreCommit known flake unrelated: #21333 |
* Cache total size in file based source * Add test case
In some use case it is found excessive match request (up to 6 times) during splitting file based source (e.g. in BigQueryTableSource) and causing split request taking lots of time when there are lots of files.
This value can be cached at minimum cost and avoid actual match api call.
Please add a meaningful description for your change here
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.