[Managed Iceberg] Allow updating partition specs during pipeline runtime #32879

ahmedabu98 · 2024-10-18T23:23:23Z

Users may update the table's spec while a write pipeline is running (e.g. streaming). Sometimes, this update can happen after serializing DataFiles. The partition spec is not serializable so we don't preserve it, but we do keep its ID. We can use this information to fetch the correct partition spec and recreate the DataFile before appending to the table.

Fixes #32862

ahmedabu98 · 2024-10-18T23:25:37Z

R: @DanielMorales9

github-actions · 2024-10-18T23:26:57Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

DanielMorales9 · 2024-10-21T00:09:40Z

I've looked into the Iceberg codebase to check for any violations of snapshot consistency. Interestingly, it seems possible to commit data files produced with an old partitioning scheme (and therefore from a previous snapshot). Could this actually be a feature of partition evolution?! 🤔
Any thoughts?

ahmedabu98 · 2024-10-21T09:14:10Z

Interestingly, it seems possible to commit data files produced with an old partitioning scheme

Yep this is what I found empirically as well. The docs mention that each partition spec's metadata is kept separate from each other. This allows Iceberg to carry out a separate query plan for each partition spec, then combine all files afterwards.

Old data written with the old partition spec doesn't get affected. What I'm seeing is new data written with the old spec is also not affected. This is convenient for streaming writes because after updating the table's partition spec, there may be some inflight data that is still using the old partition spec. Iceberg will still accept that data when it gets committed and manage the metadata accordingly.

DanielMorales9 · 2024-10-21T11:50:42Z

Yep, makes sense! 👍
I checked once again and I can confirm that the manifest files are stored in the manifest list along with their Partition Spec Id and the writing process groups Data Files by their Partition Spec Id. We can merge ✅

Example Manifest list:

{
  "manifest_path": "gs://<bucket>/<schema>/<table>/metadata/c4aa6dc6-5e0b-454c-b126-1d97d2f223f8-m1.avro",
  "manifest_length": 8363467,
  "partition_spec_id": 0,
  "content": 0,
  "sequence_number": 4814,
  "min_sequence_number": 4093,
  "added_snapshot_id": 481579842725173797,
  "added_data_files_count": 0,
  "existing_data_files_count": 24356,
  "deleted_data_files_count": 0,
  "added_rows_count": 0,
  "existing_rows_count": 785677,
  "deleted_rows_count": 0,
  "partitions": {
    "array": []
  }
}
{
  "manifest_path": "gs://<bucket>/<schema>/<table>//metadata/9753c692-b48e-49bb-9c27-713c8a41842e-m1.avro",
  "manifest_length": 8372685,
  "partition_spec_id": 0,
  "content": 0,
  "sequence_number": 4093,
  "min_sequence_number": 3557,
  "added_snapshot_id": 1992827298894205835,
  "added_data_files_count": 0,
  "existing_data_files_count": 19878,
  "deleted_data_files_count": 0,
  "added_rows_count": 0,
  "existing_rows_count": 1263244,
  "deleted_rows_count": 0,
  "partitions": {
    "array": []
  }
}

DanielMorales9

lgtm

…anifest files

ahmedabu98 · 2024-10-21T16:50:09Z

@DanielMorales9 sorry but I did some experimenting and found it needed some more work, namely what was missing:

Refreshing our cached tables, so that we can use the updated partition spec when writing
Handling batches that contain files using old and new spec. Performing a single commit on all data files is no longer an option because commits are limited to one partition spec. However, performing multiple commits is not safe (if a bundle fails between commits, we may end up with data duplication). Instead, we can create a manifest file for each spec and append them all using one commit

I fixed this up in the previous commit baba789, can you take another look?

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/AppendFilesToTables.java

DanielMorales9 · 2024-10-25T10:01:38Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/AppendFilesToTables.java

+    // To handle this, we create a manifest file for each partition spec, and group data files
+    // accordingly.
+    // Afterward, we append all manifests using a single commit operation.
+    private void appendManifestFiles(Table table, Iterable<FileWriteResult> fileWriteResults)


This is a long method 😲

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/AppendFilesToTables.java

ahmedabu98 · 2024-11-05T11:43:44Z

@DanielMorales9 any more thoughts on this one?

DanielMorales9 · 2024-11-05T12:40:50Z

Nope, LGTM 👍

allowed updating partition specs at runtime

98d6f0e

github-actions bot added java io labels Oct 18, 2024

add to changes md

806e13d

ahmedabu98 changed the title ~~[Managed Iceberg] Allowed updating partition specs at runtime~~ [Managed Iceberg] Allow updating partition specs at runtime Oct 18, 2024

add to changes md

1ad9f3d

ahmedabu98 mentioned this pull request Oct 18, 2024

[Bug]: IcebergIO fails when the table's partition layout is changed at runtime #32862

Closed

17 tasks

trigger iceberg integration tests

3ee46c6

github-actions bot added the build label Oct 18, 2024

DanielMorales9 approved these changes Oct 21, 2024

View reviewed changes

ahmedabu98 added 2 commits October 21, 2024 19:33

refresh cached tables; split multiple partition specs into separate m…

baba789

…anifest files

add test

40cdde1

DanielMorales9 reviewed Oct 25, 2024

View reviewed changes

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/AppendFilesToTables.java Show resolved Hide resolved

DanielMorales9 reviewed Oct 25, 2024

View reviewed changes

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/AppendFilesToTables.java Show resolved Hide resolved

DanielMorales9 reviewed Oct 25, 2024

View reviewed changes

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/AppendFilesToTables.java Show resolved Hide resolved

address comment

602a2fe

ahmedabu98 changed the title ~~[Managed Iceberg] Allow updating partition specs at runtime~~ [Managed Iceberg] Allow updating partition specs during pipeline runtime Nov 5, 2024

clarify changes comment

808b4f4

ahmedabu98 merged commit 689af5b into apache:master Nov 5, 2024
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Managed Iceberg] Allow updating partition specs during pipeline runtime #32879

[Managed Iceberg] Allow updating partition specs during pipeline runtime #32879

ahmedabu98 commented Oct 18, 2024

ahmedabu98 commented Oct 18, 2024

github-actions bot commented Oct 18, 2024

DanielMorales9 commented Oct 21, 2024

ahmedabu98 commented Oct 21, 2024

DanielMorales9 commented Oct 21, 2024 •

edited

Loading

DanielMorales9 left a comment

ahmedabu98 commented Oct 21, 2024

DanielMorales9 Oct 25, 2024 •

edited

Loading

ahmedabu98 commented Nov 5, 2024

DanielMorales9 commented Nov 5, 2024

[Managed Iceberg] Allow updating partition specs during pipeline runtime #32879

[Managed Iceberg] Allow updating partition specs during pipeline runtime #32879

Conversation

ahmedabu98 commented Oct 18, 2024

ahmedabu98 commented Oct 18, 2024

github-actions bot commented Oct 18, 2024

DanielMorales9 commented Oct 21, 2024

ahmedabu98 commented Oct 21, 2024

DanielMorales9 commented Oct 21, 2024 • edited Loading

DanielMorales9 left a comment

Choose a reason for hiding this comment

ahmedabu98 commented Oct 21, 2024

DanielMorales9 Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

ahmedabu98 commented Nov 5, 2024

DanielMorales9 commented Nov 5, 2024

DanielMorales9 commented Oct 21, 2024 •

edited

Loading

DanielMorales9 Oct 25, 2024 •

edited

Loading