-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TableMetadataBuilder #587
TableMetadataBuilder #587
Conversation
6afeeb0
to
4193131
Compare
ToDos:
|
Fixes #232 |
Thanks @c-thiel for this pr, I've skimmed through it and it looks great to me. However this pr is too huge to review(3k lines), would you mind to split them into smaller onces? For example, we can add one pr for methods involved in one |
Thanks for your Feedback @liurenjie1024. This isn't really a refactoring of the builder, it's more a complete rewrite. The old builder allowed to create corrupt metadata in various ways. Splitting it up by I would currently prefer to keep it as a larger block mainly because:
We now have a vision of what it could look like in the end. Before putting any more effort in, we should answer the following questions:
Those points might change the overall design quite a bit and might require a re-write of After we answered those questions, and we still think splitting makes sense, I can try to find time to build stacked-PRs. Maybe just splitting normalization / validation in |
@liurenjie1024 I tried to cut a few things out - but not along the lines of
After they are all merged, I'll rebase this PR for the actual builder. |
Hi, @c-thiel Sorry for late reply.
I've went through the new builder and I think this is your design is the right direction.
To be honest, I don't quite understand the use case. We can ask for background of this in dev channel, but I think this is not a blocker of this pr, we can always add this later.
I've took a look at the comments of these two prs: apache/iceberg#6701 apache/iceberg#7445 And I think the reason behavior is the |
I agree that this should be required, as I mentioned in #550 |
That sound reasonable to me. If one pr per table update is too much burden, could we split them by components, for example sort oder, partition spec, schema changes? |
@liurenjie1024 thanks for the Feedback!
The problem in changing it later is that it changes the semantic of the function. Right now we expect source_id to match the In my opinion ids are much cleaner than names (we might have dropped and re-added a column with the same name in the meantime), so I am OK with going forward. However, moving over to java semantics will require new endpoints (i.e. Give me a thumbs up if that's OK for you. I'll also open a discussion in the dev channel to get some more opinions. |
I don't think we should add the argument to be honest. My reasoning is as follows: Maybe add @nastra or @Fokko could add some comments on the intention of that parameter? |
I have reviewed most PRs that I am confident can be merged. The only one left is #615, for which I need more input. |
a3c1c89
to
fea1817
Compare
e00450b
to
e36d834
Compare
@Xuanwo, @liurenjie1024 this PR is ready for another round of review. It's now rebased on the 6 PRs we merged during the last months. The core logic is ~1100 lines of code, including quite a bit of comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@c-thiel I left a few comments, and suggestions for improvement, LMKWYT. Apart from that it looks great and good to go 👍
return Ok(self); | ||
} | ||
|
||
// ToDo Discuss: Java builds a fresh spec here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just do this once, after that the schema is evolved using the UpdateSchema
, that's implemented by the SchemaUpdate. There you pass in the last-column-id
, and new fields will increment from that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is actually a very important point, and I think we should fix rust.
Currently in rust we use the SchemaBuilder
for everything.
To get to it, we use an existing Schema
and use the into_builder(self)
method.
I believe we should modify this to into_builder(self, last_column_id: Option<i64>)
so that users are forced to think about the last_column_id. Most likely they will want to use the last_column_id
of the metadata they got the Schema
from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning toward the other way, where the last-field-id
is tracked by Iceberg-rust itself and nog exposed by the user. The last-schema-id
is a monotonically increasing ID that keeps track of which field-IDs are already being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do it completely internally, then we should probably not allow or discourage into_builder
on schema.
I am unsure how an alternative API should look like. @liurenjie1024 some input would be very helpful.
My thoughts:
- Introduce
TableMetadata.update_schema
that returns the currentSchemaBuilder
, however with a fixedlast-column-id
(set to theTableMetadata
value). We document thatadd_schema
should only be used for schemas modified in this way to ensure thatlast_column_id
is correct. - If documentation is not enough, we could Introduce a new type
SchemaUpdate
that can only be created by callingTableMetadata.update_schema
. It would hold theTableMetadataBuilder
internally and have methods likeadd_field
orremove_field
. Upon build, it would return theTableMetadataBuilder
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When adding a word of caution to the method, I'm fine with keeping it public for now.
I know that this is quite late in the process here, but I think it might be complimentary to the builder we're seeing here. For PyIceberg we've taken a different approach, where we apply the MetadataUpdates
(AddSchema
, SetCurrentSchema
, etc) to the metadata, rather than having a builder where you can modify the TableMetadata
. Code can be found here: https://github.com/apache/iceberg-python/blob/60800d88c7e44fe2ed58a25f1b0fcd5927156adf/pyiceberg/table/update/__init__.py#L215-L494
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TableUpdate
is currently only part of the rest catalog - not of the core spec.
I think we should keep them public unless we go switch to the python approach.
I added a word of warning that we do not check compatibility. Let me know if this is sufficient.
We should extend it with a link to the new method that is implemented in #697 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TableUpdate is currently only part of the rest catalog - not of the core spec.
Just to mention that it's part of core crate is rust:
pub enum TableUpdate { |
I think we should keep them public unless we go switch to the python approach.
In fact, the original design was following java/python approach, see
pub struct Transaction<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liurenjie1024 you are right, I thought it was part of rest.
I am still a bit hesitant to making it private. Let me try to curate my points. None of those are blockers, just want to take them into consideration.
- I think it's quite useful in tests - good evidence is the rust crate itself where we use it at a few places. With Lakekeeper we are also using it for tests. Its handy to not require a
TableUpdate
wrapper around the schema that we want to add. - There are features that we would need to expose in other ways:
a) applying multipleTableUpdates
to aTableMetadata
in a single shot and obtaining theTableMetadataBuildResult
, includingchanges
andexpired_metadata_logs
. ATransaction
isn't suitable for this. ATransaction
works on aTable
, notTableMetadata
.TableMetadata
is much lighter and we as a catalog never actually initializeTable
- and we don't want to initialize it just to mutate some metadata.Transaction
also calls theCatalog
trait in the end - not applying the updates. So its not exposing the Metadata mutating functionality for us.
This could be solved by addingpub struct TableUpdates(Vec<TableUpdate>)
or so instead and add anapply
method to it. There are other options as well of course. Each of them would require another public struct, so we might also stick with the builder.
b) InitializingTableMetadata
conveniently including all fields. Currently the only way to do this is via the builder. I don't think it would be ergonomic to have a basicnew()
method onTableMetadata
and require user to use a completely different mechanism, such asTableUpdate.apply
to go further (i.e. adding a snapshot or setting the Uuid).TableCreation
doesn't offer this either in its current state.
My baseline is that mutating TableMetadata
on its own is something that is valuable to external crates, such as our Catalog. So we should offer a public ergonomic interface for it. I believe the builder offers this interface - Transaction
and TableUpdate
on its own do not.
Let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's reasonable to keep them public for reasons you mentioned. TableMetadataBuilder
is doesn't need to do all validity checks for inputs, which are the responsibliity of transaction api.
I've added this to the spec a while ago: apache#7445 But I think this was a mistake, and we should not expose this to the public APIs, as it is much better to track this internally. I noticed this while reviewing apache/iceberg-rust#587 Removing this as part of the APIs in Java, and the Open-API update makes it much more resilient, and don't require the clients to compute this value
Okay, I've added this to the spec a while ago: apache#7445 But I think this was a mistake, and we should not expose this to the public APIs, as it is much better to track this internally. I noticed this while reviewing apache/iceberg-rust#587 Removing this as part of the APIs in Java, and the Open-API update makes it much more resilient, and don't require the clients to compute this value. For example. when there are two conflicting schema changes, the last-column-id must be recomputed correctly when doing the retry operation.
Okay, I've added this to the spec a while ago: apache#7445 But I think this was a mistake, and we should not expose this to the public APIs, as it is much better to track this internally. I noticed this while reviewing apache/iceberg-rust#587 Removing this as part of the APIs in Java, and the Open-API update makes it much more resilient, and don't require the clients to compute this value. For example. when there are two conflicting schema changes, the last-column-id must be recomputed correctly when doing the retry operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished second round review. There are some files I haven't check yet, need another review.
pub fn current_schema(&self) -> &SchemaRef { | ||
self.metadata.current_schema() | ||
} | ||
|
||
/// Get the current last column id | ||
#[inline] | ||
pub fn last_column_id(&self) -> i32 { | ||
self.metadata.last_column_id | ||
} | ||
|
||
/// Get the current last updated timestamp | ||
#[inline] | ||
pub fn last_updated_ms(&self) -> i64 { | ||
self.metadata.last_updated_ms | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make all these three methods as private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment here: #587 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
/// # Errors | ||
/// - The ref is unknown. | ||
/// - Any of the preconditions of `self.add_snapshot` are not met. | ||
pub fn append_snapshot(self, snapshot: Snapshot, ref_name: Option<&str>) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- In fact that methods is a wrong design for me. The implementation lacks some necessary check, and it's inconsistent with other parts, e.g.
TableMetadata
is designed to be immutable, and all modifications should go throughtTableMetadataBuilder
. I think we should make that api as deprecated, and remove it in future. - Yes, but it's weird when it has a
branch
argument. And the reason I suggest this is that I want to keep api consistent with java implementation. - Similar to 2, I want to keep api name consistent to java. And you are right the name could be either branch or tag. I'm ok with argument name, but I prefer to have function name consistent with java api.
|
||
/// Append a snapshot to the specified branch. | ||
/// If branch is not specified, the snapshot is appended to the main branch. | ||
/// The `ref` must already exist. Retention settings from the `ref` are re-used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in iceberg's term tag is a special branch, while ref
typicall means SnapshotRef
which includes more things like retention time, etc. I've updated it to "branch
or tag
", what do you think?
self.metadata.snapshot_log.clear(); | ||
} | ||
|
||
if self.metadata.refs.remove(ref_name).is_some() || ref_name == MAIN_BRANCH { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that if ref_name == MAIN_BRANCH
, we have changed it. My assumpation is that we always have MAIN_BRANCH
since it's default branch? Please correct me if I'm wrong.
return Ok(self); | ||
} | ||
|
||
// ToDo Discuss: Java builds a fresh spec here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I agree that we should keep
add_schema
for now. - I agree that we should implement the issue proposed by @Fokko , which will be part of our trable transaction api.
- I don't think we should discourage the use of
add_schema
since they have different responsibilities.TableMetadataBuilder
do some necessary checks to avoid broke table metadata, but it doesn't guarantee that behavior. We should be careful with our transaction, which should guarantee to produce valid table metadata.
In fact, I'm thinking about make all TableMetadataBuilder
methods crate private. Java makes them public since java's access modifier is limited and doesn't support complex path visiblity like rust. But I think these methods should be called by transaction api or applying TableUpdate
. What do you think?
@liurenjie1024 for |
Co-authored-by: Renjie Liu <[email protected]>
Slept over it and |
/// Remove snapshots by its ids from the table metadata. | ||
/// Does nothing if a snapshot id is not present. | ||
/// Keeps as changes only the snapshots that were actually removed. | ||
pub fn remove_snapshots(mut self, snapshot_ids: &[i64]) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently when removing snapshots we might have parent_snapshot_id
s of other snapshots pointing to non-existing snapshots. This follows the behavior in Java.
Is this desirable? Or would it be more correct to set them to None
as there is no parent available anymore?
@Fokko
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@c-thiel I think it is more correct to set it to None
. I'll follow up on the Java side to get this fixed.
/// # Errors | ||
/// - The ref is unknown. | ||
/// - Any of the preconditions of `self.add_snapshot` are not met. | ||
pub fn append_snapshot(self, snapshot: Snapshot, ref_name: Option<&str>) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for misclarification, I mean deprecating the append_snapshot
method in TableMetadata
, and planning to remove it in future release. I think adding #[deprecate]
annotation would be enough since cargo will warn about this. Also I think we should remove this method in TableMetadataBuilder
?
crates/iceberg/src/catalog/mod.rs
Outdated
@@ -912,14 +950,14 @@ mod tests { | |||
#[test] | |||
fn test_check_last_assigned_partition_id() { | |||
let metadata = metadata(); | |||
|
|||
println!("{:?}", metadata.last_partition_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this debug statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -376,6 +375,24 @@ impl Schema { | |||
pub fn accessor_by_field_id(&self, field_id: i32) -> Option<Arc<StructAccessor>> { | |||
self.field_id_to_accessor.get(&field_id).cloned() | |||
} | |||
|
|||
/// Check if this schema is identical to another schema semantically - excluding schema id. | |||
pub(crate) fn is_same_schema(&self, other: &SchemaRef) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not implement PartialEq
, Eq
trait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this as well, but opted for a method with documentation.
This method excludes the schema_id
which I can describe here in the docstring. Using Eq
I would expect the schema_id
to be equal too - especially because its used in tests.
} | ||
} | ||
|
||
impl From<TableMetadataBuildResult> for TableMetadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need this? metadata
field already exposed as public field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for convenience - should I remove it?
partition_spec, | ||
sort_order.unwrap_or(SortOrder::unsorted_order()), | ||
location, | ||
FormatVersion::V1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Java, we create V2 tables by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do now too :)
* Core,Open-API: Don't expose the `last-column-id` Okay, I've added this to the spec a while ago: #7445 But I think this was a mistake, and we should not expose this to the public APIs, as it is much better to track this internally. I noticed this while reviewing apache/iceberg-rust#587 Removing this as part of the APIs in Java, and the Open-API update makes it much more resilient, and don't require the clients to compute this value. For example. when there are two conflicting schema changes, the last-column-id must be recomputed correctly when doing the retry operation. * Update the tests as well * Add `deprecation` flag * Wording Co-authored-by: Eduard Tudenhoefner <[email protected]> * Wording Co-authored-by: Eduard Tudenhoefner <[email protected]> * Wording * Thanks Ryan! * Remove `LOG` --------- Co-authored-by: Eduard Tudenhoefner <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @c-thiel for this great pr, LGTM!
* Squash builder * Address comments * Address comments * Match on FormatVersion to fail for V3 * Fix examples * Fix tests * Address comments * Address comments * Update crates/iceberg/src/spec/table_metadata_builder.rs Co-authored-by: Renjie Liu <[email protected]> * Remove ReferenceType * Fix import * Remove current_schema and last_updated_ms accessors * Ensure main branch is not removed * Address comments * Fix tests * Do not ensure ensure_main_branch_not_removed * set_branch_snapshot create branch if not exists --------- Co-authored-by: Renjie Liu <[email protected]>
* Core,Open-API: Don't expose the `last-column-id` Okay, I've added this to the spec a while ago: apache#7445 But I think this was a mistake, and we should not expose this to the public APIs, as it is much better to track this internally. I noticed this while reviewing apache/iceberg-rust#587 Removing this as part of the APIs in Java, and the Open-API update makes it much more resilient, and don't require the clients to compute this value. For example. when there are two conflicting schema changes, the last-column-id must be recomputed correctly when doing the retry operation. * Update the tests as well * Add `deprecation` flag * Wording Co-authored-by: Eduard Tudenhoefner <[email protected]> * Wording Co-authored-by: Eduard Tudenhoefner <[email protected]> * Wording * Thanks Ryan! * Remove `LOG` --------- Co-authored-by: Eduard Tudenhoefner <[email protected]>
This PR is now ready for first reviews.
Some Remarks:
add_sort_order
andadd_partition_spec
the Java code re-builds the added sort-order against the current schema by matching column names. This implementation currently does not do this. Adding this feature would requirePartitionSpec
(bound) to store the schema it was bound against (probably a good idea anyway) and splitSortOrder
in bound and unbound, where the boundSortOrder
also stores the schema it was bound against. Instead, this implementation assumes that provided sort-orders and partition-specs are valid for the current schema. Compatibility with the current schema is tested.add_schema
method does not require anew_last_column_id
argument. In java there is a todo to achieve the same. I put my reasoning in a comment in the code, feel free to comment on it.new()
behaviour that now re-assignes field-ids to start from 0. Some tests started from 1 before. Re-assigning the ids, just like in Java, ensures that fresh metadata always has fresh and correct ids even if they are created manually or re-used from another metadata.