Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Managed Iceberg] Make manifest file writes and commits more efficient #32666

Merged
merged 4 commits into from
Oct 8, 2024

Conversation

ahmedabu98
Copy link
Contributor

@ahmedabu98 ahmedabu98 commented Oct 5, 2024

When writing to Iceberg, we need to write just one manifest file per snapshot.

However, we are currently writing one manifest file per bundle (or one per GIB batch for streaming writes), which is a lot more frequent than needed. In medium/large streaming jobs, we can end up with thousands of extra manifest files. For an Iceberg table, the effect of this inefficiency is felt in two ways:

  • Writing more files than necessary
  • During query planning, having to open and read more files than necessary

Solution:
Continue writing bundles/batches to data files, but stop writing manifest files at that frequency. Instead, group data files by destination then write and commit just one manifest file per destination. Essentially, the number of manifest files should be 1-1 with snapshots/commits (currently, it's roughly 1-1 with data files).

@ahmedabu98 ahmedabu98 changed the title [Managed Iceberg] Make file writes and commits more efficient [Managed Iceberg] Make manifest file writes and commits more efficient Oct 5, 2024
Copy link
Contributor

github-actions bot commented Oct 5, 2024

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@ahmedabu98 ahmedabu98 added this to the 2.60.0 Release milestone Oct 6, 2024
@ahmedabu98
Copy link
Contributor Author

Added as a release blocker because these are update-incompatible changes. Streaming writes are going to be officially supported in 2.60.0 so this should get in with it to avoid breaking pipeline update

@ahmedabu98
Copy link
Contributor Author

assign set of reviewers

Copy link
Contributor

github-actions bot commented Oct 6, 2024

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @damccorm for label build.
R: @chamikaramj for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@Abacn
Copy link
Contributor

Abacn commented Oct 8, 2024

Hi, kindly pin about the status of this PR. Since this is added to 2.60.0, could you please request a expedited review?

Copy link
Contributor

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM.

We can merge after open comments are addressed.

@ahmedabu98
Copy link
Contributor Author

Will merge when tests go green

@ahmedabu98 ahmedabu98 merged commit c9aa996 into apache:master Oct 8, 2024
22 checks passed
reeba212 pushed a commit to reeba212/beam that referenced this pull request Dec 4, 2024
apache#32666)

* group all data files before writing a manifest file

* add to changes md

* add data file roundtrip equality test

* address comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants