-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Managed Iceberg] Make manifest file writes and commits more efficient #32666
Conversation
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
Added as a release blocker because these are update-incompatible changes. Streaming writes are going to be officially supported in 2.60.0 so this should get in with it to avoid breaking pipeline update |
assign set of reviewers |
Assigning reviewers. If you would like to opt out of this review, comment R: @kennknowles for label java. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
Hi, kindly pin about the status of this PR. Since this is added to 2.60.0, could you please request a expedited review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WriteToDestinations.java
Outdated
Show resolved
Hide resolved
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WriteGroupedRowsToFiles.java
Show resolved
Hide resolved
.../java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WriteUngroupedRowsToFiles.java
Show resolved
Hide resolved
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/SerializableDataFile.java
Outdated
Show resolved
Hide resolved
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/SerializableDataFile.java
Outdated
Show resolved
Hide resolved
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WriteGroupedRowsToFiles.java
Show resolved
Hide resolved
.../java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WriteUngroupedRowsToFiles.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. LGTM.
We can merge after open comments are addressed.
Will merge when tests go green |
apache#32666) * group all data files before writing a manifest file * add to changes md * add data file roundtrip equality test * address comments
When writing to Iceberg, we need to write just one manifest file per snapshot.
However, we are currently writing one manifest file per bundle (or one per GIB batch for streaming writes), which is a lot more frequent than needed. In medium/large streaming jobs, we can end up with thousands of extra manifest files. For an Iceberg table, the effect of this inefficiency is felt in two ways:
Solution:
Continue writing bundles/batches to data files, but stop writing manifest files at that frequency. Instead, group data files by destination then write and commit just one manifest file per destination. Essentially, the number of manifest files should be 1-1 with snapshots/commits (currently, it's roughly 1-1 with data files).