Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Managed Iceberg] Support writing to partitioned tables #32102

Merged
merged 15 commits into from
Aug 16, 2024

Conversation

ahmedabu98
Copy link
Contributor

@ahmedabu98 ahmedabu98 commented Aug 7, 2024

Fixes #31943

Adds support for writing to partitioned Iceberg tables.

A record writer manager is introduced to open and close writers as necessary. An Iceberg data writer instance is configured to write to only one partition, so multiple writers are needed to write to multiple partitions.

The behavior remains unchanged when writing to unpartitioned tables.

Also some small but key changes:

  • Use Iceberg APIs to determine the datafile location (as opposed to hardcoding the location to be at <warehouse>/<table>/<data-file>
  • Write manifest files under <table>/metadata/

@github-actions github-actions bot added the build label Aug 7, 2024
Copy link
Contributor

github-actions bot commented Aug 7, 2024

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

Copy link
Contributor

github-actions bot commented Aug 7, 2024

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

@ahmedabu98 ahmedabu98 marked this pull request as draft August 7, 2024 23:38
@ahmedabu98 ahmedabu98 added this to the 2.59.0 Release milestone Aug 13, 2024
@ahmedabu98 ahmedabu98 marked this pull request as ready for review August 13, 2024 22:20
@ahmedabu98 ahmedabu98 changed the title [Managed Iceberg] Support writing partitioned data [Managed Iceberg] Support writing to partitioned tables Aug 13, 2024
@ahmedabu98
Copy link
Contributor Author

assign set of reviewers

Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @m-trieu for label java.
R: @damccorm for label build.
R: @Abacn for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@ahmedabu98
Copy link
Contributor Author

CC: @arthurpessoa, @chamikaramj

Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fairly quick pass

* <p>After closing, the resulting {@link ManifestFile}s can be retrieved using {@link
* #getManifestFiles()}.
*/
class RecordWriterManager {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the documentation, this might be accurately named PartitionedRecordWriter or something. That communicates better than "Manager" which could mean almost anything. And it seems it could be Autocloseable which would enable using it in try-with-resources blocks, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had it as PartitionedRecordWriter but thought it may be misleading since the class is also used for unpartitioned writes

Added Autocloseable implementation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. A reason I don't like the "Manager" name is that I don't know what it is doing in the managing, and it really doesn't communicate that this is a thing that does the writing. We have so many things called "manager" and none of them have anything in common.

Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to get it merged for the release.

@ahmedabu98 ahmedabu98 merged commit c9ad32e into apache:master Aug 16, 2024
22 checks passed
reeba212 pushed a commit to reeba212/beam that referenced this pull request Dec 4, 2024
* support writing partitioned data

* trigger integration tests

* partitioned record writer to manage writers for different partitions

* partitioned record writer

* reject rows when we are saturated with record writers

* refactor record writer manager

* add tests

* add more tests

* make record writer manager transient

* clean up test path

* cleanup

* cleanup

* address comments

* revert readability change

* add to changes md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Writing partitioned data with Managed IcebergIO fails
3 participants