Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Reassign field ids for schema #615

Merged
merged 7 commits into from
Oct 4, 2024

Conversation

c-thiel
Copy link
Collaborator

@c-thiel c-thiel commented Sep 8, 2024

Required for TableMetadataBuilder (#587).

When new TableMetadata is created, field ids should be reassign to ensure that field numbering is normalized and started from 0.

@c-thiel c-thiel changed the title Feat: Reassign field ids for schema feat: Reassign field ids for schema Sep 8, 2024
@c-thiel c-thiel mentioned this pull request Sep 8, 2024
/// If not provided, it will start from 0.
///
/// All specified aliases and identifier fields will be updated to the new field-ids.
pub fn with_reassigned_field_ids(mut self, start_from: Option<i32>) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate on the circumstances under which the user wants to set this? I used to think users always need to start with 0.

///
/// All specified aliases and identifier fields will be updated to the new field-ids.
pub fn with_reassigned_field_ids(mut self, start_from: Option<i32>) -> Self {
self.reassign_field_ids_from = Some(start_from.unwrap_or(0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we always set Some(v) here, how about just using i32 for reassign_field_ids_from with default value 0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically you would pass in last-column-id from the table metadata. In the case of a CREATE TABLE this will be zero, in the case of a REPLACE TABLE it will be last-column-id where all columns get a new ID and you ensure they are not used before.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Xuanwo only doubt with the i32 is that people might input 1 as something is required - and it's shorter than Default::default().
The rust tests are quite a good example that this can happen: We have many tests where field numbering starts from 1, so I had to re-write them to be compatible with the new builder.

I am OK with the i32 as well, just wanted to bring up that a None is probably more verbose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think in this case we shoud enforce user to pass an u32 rather an Option<i32>, since if this method is called, we always reassign field ids.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think this method is currently only used by TableMetadata builder? Could we make this method pub(crate) for now to avoid exposing to user?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both good points, fixed in the latest commit.
Its dead_code for now

@Xuanwo
Copy link
Member

Xuanwo commented Sep 10, 2024

cc @liurenjie1024 would you like to take a look too?

@liurenjie1024
Copy link
Contributor

cc @liurenjie1024 would you like to take a look too?

Thanks for pinging me, I'll take a review.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @c-thiel for this pr!

///
/// All specified aliases and identifier fields will be updated to the new field-ids.
pub fn with_reassigned_field_ids(mut self, start_from: Option<i32>) -> Self {
self.reassign_field_ids_from = Some(start_from.unwrap_or(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think in this case we shoud enforce user to pass an u32 rather an Option<i32>, since if this method is called, we always reassign field ids.

///
/// All specified aliases and identifier fields will be updated to the new field-ids.
pub fn with_reassigned_field_ids(mut self, start_from: Option<i32>) -> Self {
self.reassign_field_ids_from = Some(start_from.unwrap_or(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think this method is currently only used by TableMetadata builder? Could we make this method pub(crate) for now to avoid exposing to user?

let outer_fields = fields
.into_iter()
.map(|field| {
self.old_to_new_id.insert(field.id, self.next_field_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check exisistence of field.id here since we don't trust user provide field ids, and it may duplicate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the check. It should never trigger currently, because I switched the build method order as you suggested.
Nevertheless the check is good if Reassignment is ever used outside of the builder in the future.

@@ -105,7 +116,16 @@ impl SchemaBuilder {
}

/// Builds the schema.
pub fn build(self) -> Result<Schema> {
pub fn build(mut self) -> Result<Schema> {
if let Some(start_from) = self.reassign_field_ids_from {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that maybe we should put this part in the end of this method? The build process has some check, for example identifier field type, uniqueness of field name. Though current approach is correct, the error reporting maybe confusing. For example when some field doesn't pass identifier type check, current approach reports reassigned id, rather original id.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - I didn't think about the error messages.
It's unfortunately not super straightforward to switch the order, as for example index_by_id needs a full rebuild as both keys and values changes. I think we have two options to achieve the swap:

  1. Use id_reassigner.apply_* for all other fields that have been build (i.e. id_to_field) or
  2. Re-build those fields using the same code - i.e. the build option.

I opted for 2.
While it is slightly less performant because we have to visit the structure again, it is much less error prone as we simply re-use existing and well-tested code.

Let me know what you think!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also opted for 2, which is much cleaner and easier to maintain. About performance, I don't think this is a concern for now as it's not supposed to be called frequently. Current codes are working as opt 1, are you planning to refactoring it in this pr to opt 2? BTW currently opt 1 also looks good to me, so it's up to you to do it in this pr or future pr.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liurenjie1024 I refactored it in my latest commit, however we cannot get rid of the two functions:
apply_to_aliases and apply_to_identifier_fields. The builder itself is standard now.

@@ -2237,4 +2377,168 @@ table {
let schema = table_schema_simple().0;
assert_eq!(3, schema.highest_field_id());
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tests!

@c-thiel
Copy link
Collaborator Author

c-thiel commented Sep 25, 2024

@liurenjie1024 I addressed all of your comments.
I also noticed that I didn't handle a potential field_id overflow error which I also fixed.

Regarding your uniqueness of id comment: I noticed that this wasn't enforced previously, so I fixed it and added a test for it as well. This is currently making a few tests crash, where key-ids & value-ids of maps are re-used elsewhere in the schema.

I am checking how java handles this and will fix the tests accordingly.

@c-thiel
Copy link
Collaborator Author

c-thiel commented Sep 25, 2024

Ok, java does not allow this. It's easy to test with the testNestedStructSchemaToMapping java test.
I fixed the rusts test that failed due to duplicate nested IDs.

@liurenjie1024 ready for another round :)

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @c-thiel for this great pr!

@@ -86,6 +87,16 @@ impl SchemaBuilder {
self
}

/// Reassign all field-ids (nested) on build.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Reassign all field-ids (nested) on build.
/// Reassign all field-ids (including nested) on build.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in latest commit

@@ -105,7 +116,16 @@ impl SchemaBuilder {
}

/// Builds the schema.
pub fn build(self) -> Result<Schema> {
pub fn build(mut self) -> Result<Schema> {
if let Some(start_from) = self.reassign_field_ids_from {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also opted for 2, which is much cleaner and easier to maintain. About performance, I don't think this is a concern for now as it's not supposed to be called frequently. Current codes are working as opt 1, are you planning to refactoring it in this pr to opt 2? BTW currently opt 1 also looks good to me, so it's up to you to do it in this pr or future pr.

@c-thiel
Copy link
Collaborator Author

c-thiel commented Oct 1, 2024

@liurenjie1024 last two minor changes added - I think we are good to merge @liurenjie1024 @Xuanwo

@liurenjie1024 liurenjie1024 merged commit 2bc66c4 into apache:main Oct 4, 2024
16 checks passed
@liurenjie1024
Copy link
Contributor

Thanks @c-thiel for this pr!

shaeqahmed pushed a commit to matanolabs/iceberg-rust that referenced this pull request Dec 9, 2024
* Reassign field ids for schema

* Address comments

* Schema ensure unique field ids

* Fix tests with duplicate nested field ids

* Use Schema::builder() for reassigned ids

* Better docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants