Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache total size in file based source #26746

Merged
merged 2 commits into from
May 19, 2023
Merged

Conversation

Abacn
Copy link
Contributor

@Abacn Abacn commented May 17, 2023

In some use case it is found excessive match request (up to 6 times) during splitting file based source (e.g. in BigQueryTableSource) and causing split request taking lots of time when there are lots of files.

This value can be cached at minimum cost and avoid actual match api call.

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI.

@github-actions github-actions bot added the java label May 17, 2023
@Abacn
Copy link
Contributor Author

Abacn commented May 17, 2023

Some sources e.g. BigQueryTableSource already implemented cache:

Long maybeNumBytes = tableSizeBytes.get();
if (maybeNumBytes != null) {
return maybeNumBytes;

@Abacn
Copy link
Contributor Author

Abacn commented May 17, 2023

R: @johnjcasey

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@Abacn
Copy link
Contributor Author

Abacn commented May 17, 2023

Run Java_Examples_Dataflow PreCommit

@Abacn
Copy link
Contributor Author

Abacn commented May 17, 2023

Run Java_GCP_IO_Direct PreCommit

@Abacn
Copy link
Contributor Author

Abacn commented May 17, 2023

Run Java PreCommit

@@ -91,6 +94,7 @@ protected FileBasedSource(
this.mode = Mode.FILEPATTERN;
this.emptyMatchTreatment = emptyMatchTreatment;
this.fileOrPatternSpec = fileOrPatternSpec;
this.filesSizeBytes = new AtomicReference<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a file resource level cache?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's just a Long value

@@ -73,6 +74,8 @@
private MatchResult.@Nullable Metadata singleFileMetadata;
private final Mode mode;

private final AtomicReference<@Nullable Long> filesSizeBytes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test case for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Abacn
Copy link
Contributor Author

Abacn commented May 18, 2023

Did some test, reading from tpcds_1T.web_sales:

  • on master, runner v1

11562 logs searching for pattern gs://.../temp/BigQueryExtractTemp/42fa897c43704ed7998868ebf83e9198/

  • this branch, runner v1

5781 logs for gs://.../temp/BigQueryExtractTemp/ebd9c20b02f142439a837fbf99d72e25/ basically each file is matched only 3 times (2 during split, 1 before delete)

That is a decrease by half (meaning making half of the match request)

  • on master, runner v2

3908 logs searching for pattern "gs://.../temp/BigQueryExtractTemp/8568e876f4404544a69b996c5ff47ca4/" basically each file is matched only 2 times (during split, before delete)

Tested on 10 workers (n1s1); runner v1 has 520k record/sec throughput, while runner v2 has 600k record/sec throughput.

Summary:

  • This change significantly reduced the number of List API request to gcs by half on Dataflow runner v1
  • No change on Dataflow runner v2 as the runner already avoided redundant call for estimateSize.
  • It is found that runner v2 is more efficient for this use case (BigQueryIO EXPORT read), having higher throughput under same environment.

@Abacn
Copy link
Contributor Author

Abacn commented May 19, 2023

Java PreCommit known flake unrelated: #21333

@johnjcasey johnjcasey merged commit 658e50f into apache:master May 19, 2023
@Abacn Abacn deleted the cachefilesize branch December 28, 2023 22:16
cushon pushed a commit to cushon/beam that referenced this pull request May 24, 2024
* Cache total size in file based source

* Add test case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants