Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Safer PartitionSpec & SchemalessPartitionSpec #645

Merged
merged 5 commits into from
Nov 3, 2024

Conversation

c-thiel
Copy link
Collaborator

@c-thiel c-thiel commented Sep 23, 2024

Fixes #550

This PR is a result of the issue mentioned above, a Slack Discussion and the preparation for the new MetadataBuilder #587.

Lets first establish, that PartitionSpec should have a bound schema because:

  • We need to bind the schema to to PartitonSpec to allow internal partition id handling (see @rdblue comment in slack).
  • It allows for consistent handling of schema related operations, such as determining the partition type without returning a Result.
  • A PartitionSpec is only valid for the schema is had been build against, and we should not imply otherwise.

If we agree on this, then we have a small Problem with TableMetadata: It contains historic PartitionSpecs that cannot be bound against the current_schema. As a result, we need a type that has the same properties as PartitionSpec but is not bound to a schema. I thought we could use UnboundPartitionSpec for this at first, but it serves a different purpose and would make the very important field_id Optional.

To solve this, I introduce a SchemalessPartitionSpec. Common functions between PartitionSpec and SchemalessPartitionSpec are made available via a common trait. While the trait is public and as such allows to develop functions that take either variant as an input, I am exposing the most important attributes (i.e. fields) still directly, so that the trait typically doesn't need to be imported.

For reference, Java solves this by force-binding potentially non-compatible schemas to a PartitionSpec

Happy to hear your opinions! This really is a bit of a fiddly topic. I don't like to introduce a new type, on the other hand using the API feels more natural to me now. This can also be seen in the tests, where we had a significant number of places where we passed PartitionSpec alongside its Schema around, just to call partition_type eventually.

@c-thiel
Copy link
Collaborator Author

c-thiel commented Sep 26, 2024

Introducing SchemalessPartitionSpec might be our way to avoid apache/iceberg#4563.

@c-thiel
Copy link
Collaborator Author

c-thiel commented Oct 1, 2024

@liurenjie1024 , @Xuanwo could you maybe have a brief first look if this is going in a good direction?
Then I could start refactoring the Builder.

@Xuanwo
Copy link
Member

Xuanwo commented Oct 9, 2024

Hi, @c-thiel, thanks a lot for working on this. This issue does seem complex. I didn't join the discussion before, so I will take some time to go through all the references and then review this PR. Sorry for the delay.

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @c-thiel, thank you so much for handling this. I spent some time loading the entire context. Most changes look fine to me; the only comment I have is about the implementation details.

@@ -235,6 +261,179 @@ impl UnboundPartitionSpec {
}
}

/// Trait for common functions between [`PartitionSpec`], [`UnboundPartitionSpec`] and [`PreservedPartitionSpec`]
pub trait UnboundPartitionSpecInterface<T: PartitionFieldInterface> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, compared to having a trait, I personally feel that building an enum for this would be better. Maybe we could have:

enum PartitionSpec {
  Bound(BoundPartitionSpec),
  Unbound(UnboundPartitionSpec),
}

Users can convert between bound and unbound versions based on their needs. This makes it easier to find the differences between various partition specs.

I believe this can avoid some complexity around the trait system and narrow down our API surface. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good plan - and thanks for taking the time to review this @Xuanwo!

I will give it a shot and see how it feels.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @c-thiel for this pr! After skimming through it, I've got some questions:

  1. Why we need another struct SchemalessPartitionSpec? Why not just create an alias for UnboundPartitionSpec? I think they share same properties: a partition spec not validated to bound to a schema.
  2. I have some concerns about introducing too many traits, as @Xuanwo commented. Rust's trait system have a lot of constraits, and if our goal is to provide an unified interface to user, maybe we should use an enum? Also, I would suggest to postpone introducing this enum/interface before we see actual use case?

crates/iceberg/src/spec/partition.rs Show resolved Hide resolved
@c-thiel
Copy link
Collaborator Author

c-thiel commented Oct 23, 2024

Hey @liurenjie1024 , thanks for giving it a look!
Regarding 1) they are not identical (see also my argument from the start. FieldId is optional for unbound spec, but it is not for "previously bound" partition specs which are present in TableMetadata. This is a very important field which is why I don't like using the unbound spec in place of the schemaless spec. It would lead to unnecessary error handling in various places where the field is needed. We also can't make the unbound field required.

I think we are OK to neither expose enum nor trait now. Would you be ok with using the trait internally (pub crate) for now so that I don't have to duplicate logic?

@liurenjie1024
Copy link
Contributor

Hi, @c-thiel

Regarding 1) they are not identical (see also my argument from the start. FieldId is optional for unbound spec, but it is not for "previously bound" partition specs which are present in TableMetadata. This is a very important field which is why I don't like using the unbound spec in place of the schemaless spec. It would lead to unnecessary error handling in various places where the field is needed. We also can't make the unbound field required.

This makes sense to me. Typically I don't want to introduce data structures that don't exists in java/python library, since it would be quite confusing for java/python community members to understand and review. But this maybe a special case, and I think java implementation's use of buildUnchecked maybe unsafe either.

I think we are OK to neither expose enum nor trait now. Would you be ok with using the trait internally (pub crate) for now so that I don't have to duplicate logic?

I'm fine with limiting trait to crate only without exposing them to user, but the XXInterface doesn't sound to follow rust's naming convention. There are also other methods for removing duplication logics, such as macro. I think macro maybe a better tool here?

@Xuanwo
Copy link
Member

Xuanwo commented Oct 24, 2024

I believe using enum here is sufficient. It's acceptable for us to have some duplicated code since we know the APIs we need are limited and unlikely to expand significantly in the future.

I want to avoid using too many macros in our codebase, as they can make it difficult to read and maintain, particularly at our API level.

@liurenjie1024
Copy link
Contributor

I believe using enum here is sufficient. It's acceptable for us to have some duplicated code since we know the APIs we need are limited and unlikely to expand significantly in the future.

I want to avoid using too many macros in our codebase, as they can make it difficult to read and maintain, particularly at our API level.

enum also sounds good to me.

@c-thiel
Copy link
Collaborator Author

c-thiel commented Oct 24, 2024

Enum it is then. I'll refractor it by the end of this week.
Thanks @Xuanwo @liurenjie1024!

@c-thiel c-thiel force-pushed the feat/safe-partition-spec branch from c0e6d2b to 9562d22 Compare October 28, 2024 23:54
@c-thiel
Copy link
Collaborator Author

c-thiel commented Oct 29, 2024

@Xuanwo, @liurenjie1024 ready for another round.
Ditched both the trait and the enum. Bare with me:

As already mentioned earlier, the enum or the trait were just a vehicle to reduce duplication of logic and to clarify that there are three variants of partition specs with different nuances. However the enum should rarely be used directly, as individual variants are much more powerful with a clear hierarchy: Bound > Schemaless > Unbound. Thus, it would be bad to ship the enum around to functions that require the Bound or Schemaless variants (i.e. need a field_id). It would always require matching and error handling. Thus, we use the specific variants pretty much everywhere anyway.

If we agree that we shouldn't use the enum to pass PartitionSpecs, then having an enum with owned fields is bad, because it requires a lot of clone. This is why I oped for enum<'a> PartitionSpec<'a> in 9562d22.

I didn't like the enum though as I didn't want to use it anyway. So in my last commit I got rid of that as well.
We gain a lot less code and API surface. We loose quite a bit of interoperability between different schema variants. What especially hurts is the is_compatible_with method, which used to be completely cross-functional for all variants.
I implement it now only for the single combination needed by the TableMetadataBuilder now (Schemaless compatible with Unbound)

Let me know what you think!

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @c-thiel for this great pr! Generally LGTM, and I just left some minor comments.

crates/iceberg/src/spec/partition.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/partition.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/partition.rs Outdated Show resolved Hide resolved
@liurenjie1024
Copy link
Contributor

Also please merge with main branch to keep it updated.

@liurenjie1024
Copy link
Contributor

@Xuanwo, @liurenjie1024 ready for another round. Ditched both the trait and the enum. Bare with me:

As already mentioned earlier, the enum or the trait were just a vehicle to reduce duplication of logic and to clarify that there are three variants of partition specs with different nuances. However the enum should rarely be used directly, as individual variants are much more powerful with a clear hierarchy: Bound > Schemaless > Unbound. Thus, it would be bad to ship the enum around to functions that require the Bound or Schemaless variants (i.e. need a field_id). It would always require matching and error handling. Thus, we use the specific variants pretty much everywhere anyway.

If we agree that we shouldn't use the enum to pass PartitionSpecs, then having an enum with owned fields is bad, because it requires a lot of clone. This is why I oped for enum<'a> PartitionSpec<'a> in 9562d22.

I didn't like the enum though as I didn't want to use it anyway. So in my last commit I got rid of that as well. We gain a lot less code and API surface. We loose quite a bit of interoperability between different schema variants. What especially hurts is the is_compatible_with method, which used to be completely cross-functional for all variants. I implement it now only for the single combination needed by the TableMetadataBuilder now (Schemaless compatible with Unbound)

Let me know what you think!

Sorry I missed this comment. I agree that trait/enum just for reducing duplication of logic is somehow to antipattern, and sometimes a little duplication is acceptable.

@liurenjie1024
Copy link
Contributor

Thanks @c-thiel for this great pr!

@liurenjie1024 liurenjie1024 merged commit 6e0bcf5 into apache:main Nov 3, 2024
16 checks passed
Comment on lines +231 to 233
pub fn partition_spec_by_id(&self, spec_id: i32) -> Option<&SchemalessPartitionSpecRef> {
self.partition_specs.get(&spec_id)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that we need another type, such as SchemalessPartitionSpec. Why shouldn't we return an UnboundPartitionSpec instead? Looking at the signatures:

pub struct UnboundPartitionSpec {
    /// Identifier for PartitionSpec
    pub(crate) spec_id: Option<i32>,
    /// Details of the partition spec
    pub(crate) fields: Vec<UnboundPartitionField>,
}
pub struct SchemalessPartitionSpec {
    /// Identifier for PartitionSpec
    spec_id: i32,
    /// Details of the partition spec
    fields: Vec<PartitionField>,
}

And the PartitionField and UnboundPartitionField are basically the same.

Introducing SchemalessPartitionSpec might be our way to avoid apache/iceberg#4563.

If we cannot find the field anymore, the best thing to do is to include the file in the query plan.

ZENOTME pushed a commit to risingwavelabs/iceberg-rust that referenced this pull request Nov 15, 2024
* SchemalessPartitionSpec

* Traits -> Enum

* Remove PartitionSpec enum

* Address comments
ZENOTME pushed a commit to risingwavelabs/iceberg-rust that referenced this pull request Nov 16, 2024
* SchemalessPartitionSpec

* Traits -> Enum

* Remove PartitionSpec enum

* Address comments
shaeqahmed pushed a commit to matanolabs/iceberg-rust that referenced this pull request Dec 9, 2024
* SchemalessPartitionSpec

* Traits -> Enum

* Remove PartitionSpec enum

* Address comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

refactor: make PartitionSpec more safe
4 participants