Cache total size in file based source #26746

Abacn · 2023-05-17T17:28:12Z

In some use case it is found excessive match request (up to 6 times) during splitting file based source (e.g. in BigQueryTableSource) and causing split request taking lots of time when there are lots of files.

This value can be cached at minimum cost and avoid actual match api call.

Please add a meaningful description for your change here

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

Abacn · 2023-05-17T17:39:45Z

Some sources e.g. BigQueryTableSource already implemented cache:

beam/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryTableSource.java

Lines 71 to 73 in 7caadb2

    
           Long maybeNumBytes = tableSizeBytes.get(); 
        
           if (maybeNumBytes != null) { 
        
             return maybeNumBytes;

Abacn · 2023-05-17T17:57:59Z

R: @johnjcasey

github-actions · 2023-05-17T17:59:20Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

Abacn · 2023-05-17T19:40:04Z

Run Java_Examples_Dataflow PreCommit

Abacn · 2023-05-17T19:40:11Z

Run Java_GCP_IO_Direct PreCommit

Abacn · 2023-05-17T19:40:18Z

Run Java PreCommit

johnjcasey · 2023-05-17T20:10:38Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSource.java

@@ -91,6 +94,7 @@ protected FileBasedSource(
    this.mode = Mode.FILEPATTERN;
    this.emptyMatchTreatment = emptyMatchTreatment;
    this.fileOrPatternSpec = fileOrPatternSpec;
+    this.filesSizeBytes = new AtomicReference<>();


Is this a file resource level cache?

No, it's just a Long value

johnjcasey · 2023-05-17T20:10:49Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSource.java

@@ -73,6 +74,8 @@
  private MatchResult.@Nullable Metadata singleFileMetadata;
  private final Mode mode;

+  private final AtomicReference<@Nullable Long> filesSizeBytes;


Can we add a test case for this?

Abacn · 2023-05-18T18:30:18Z

Did some test, reading from tpcds_1T.web_sales:

on master, runner v1

11562 logs searching for pattern gs://.../temp/BigQueryExtractTemp/42fa897c43704ed7998868ebf83e9198/

this branch, runner v1

5781 logs for gs://.../temp/BigQueryExtractTemp/ebd9c20b02f142439a837fbf99d72e25/ basically each file is matched only 3 times (2 during split, 1 before delete)

That is a decrease by half (meaning making half of the match request)

on master, runner v2

3908 logs searching for pattern "gs://.../temp/BigQueryExtractTemp/8568e876f4404544a69b996c5ff47ca4/" basically each file is matched only 2 times (during split, before delete)

Tested on 10 workers (n1s1); runner v1 has 520k record/sec throughput, while runner v2 has 600k record/sec throughput.

Summary:

This change significantly reduced the number of List API request to gcs by half on Dataflow runner v1
No change on Dataflow runner v2 as the runner already avoided redundant call for estimateSize.
It is found that runner v2 is more efficient for this use case (BigQueryIO EXPORT read), having higher throughput under same environment.

Abacn · 2023-05-19T14:03:07Z

Java PreCommit known flake unrelated: #21333

* Cache total size in file based source * Add test case

Cache total size in file based source

cfa6875

github-actions bot added the java label May 17, 2023

johnjcasey requested changes May 17, 2023

View reviewed changes

Add test case

7640e74

Abacn force-pushed the cachefilesize branch from bc9893c to 7640e74 Compare May 17, 2023 20:37

johnjcasey merged commit 658e50f into apache:master May 19, 2023

Abacn deleted the cachefilesize branch December 28, 2023 22:16

cushon pushed a commit to cushon/beam that referenced this pull request May 24, 2024

Cache total size in file based source (apache#26746)

5248996

* Cache total size in file based source * Add test case

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache total size in file based source #26746

Cache total size in file based source #26746

Abacn commented May 17, 2023

Abacn commented May 17, 2023

Abacn commented May 17, 2023

github-actions bot commented May 17, 2023

Abacn commented May 17, 2023

Abacn commented May 17, 2023

Abacn commented May 17, 2023

johnjcasey May 17, 2023

Abacn May 17, 2023

johnjcasey May 17, 2023

Abacn May 17, 2023

Abacn commented May 18, 2023 •

edited

Loading

Abacn commented May 19, 2023

Cache total size in file based source #26746

Cache total size in file based source #26746

Conversation

Abacn commented May 17, 2023

GitHub Actions Tests Status (on master branch)

Abacn commented May 17, 2023

Abacn commented May 17, 2023

github-actions bot commented May 17, 2023

Abacn commented May 17, 2023

Abacn commented May 17, 2023

Abacn commented May 17, 2023

johnjcasey May 17, 2023

Choose a reason for hiding this comment

Abacn May 17, 2023

Choose a reason for hiding this comment

johnjcasey May 17, 2023

Choose a reason for hiding this comment

Abacn May 17, 2023

Choose a reason for hiding this comment

Abacn commented May 18, 2023 • edited Loading

Abacn commented May 19, 2023

Abacn commented May 18, 2023 •

edited

Loading