feat: Safer PartitionSpec & SchemalessPartitionSpec #645

c-thiel · 2024-09-23T20:16:42Z

Fixes #550

This PR is a result of the issue mentioned above, a Slack Discussion and the preparation for the new MetadataBuilder #587.

Lets first establish, that PartitionSpec should have a bound schema because:

We need to bind the schema to to PartitonSpec to allow internal partition id handling (see @rdblue comment in slack).
It allows for consistent handling of schema related operations, such as determining the partition type without returning a Result.
A PartitionSpec is only valid for the schema is had been build against, and we should not imply otherwise.

If we agree on this, then we have a small Problem with TableMetadata: It contains historic PartitionSpecs that cannot be bound against the current_schema. As a result, we need a type that has the same properties as PartitionSpec but is not bound to a schema. I thought we could use UnboundPartitionSpec for this at first, but it serves a different purpose and would make the very important field_id Optional.

To solve this, I introduce a SchemalessPartitionSpec. Common functions between PartitionSpec and SchemalessPartitionSpec are made available via a common trait. While the trait is public and as such allows to develop functions that take either variant as an input, I am exposing the most important attributes (i.e. fields) still directly, so that the trait typically doesn't need to be imported.

For reference, Java solves this by force-binding potentially non-compatible schemas to a PartitionSpec

Happy to hear your opinions! This really is a bit of a fiddly topic. I don't like to introduce a new type, on the other hand using the API feels more natural to me now. This can also be seen in the tests, where we had a significant number of places where we passed PartitionSpec alongside its Schema around, just to call partition_type eventually.

c-thiel · 2024-09-26T06:15:26Z

Introducing SchemalessPartitionSpec might be our way to avoid apache/iceberg#4563.

c-thiel · 2024-10-01T07:07:55Z

@liurenjie1024 , @Xuanwo could you maybe have a brief first look if this is going in a good direction?
Then I could start refactoring the Builder.

Xuanwo · 2024-10-09T15:11:30Z

Hi, @c-thiel, thanks a lot for working on this. This issue does seem complex. I didn't join the discussion before, so I will take some time to go through all the references and then review this PR. Sorry for the delay.

Xuanwo

Hi, @c-thiel, thank you so much for handling this. I spent some time loading the entire context. Most changes look fine to me; the only comment I have is about the implementation details.

Xuanwo · 2024-10-14T08:44:02Z

crates/iceberg/src/spec/partition.rs

@@ -235,6 +261,179 @@ impl UnboundPartitionSpec {
    }
 }

+/// Trait for common functions between [`PartitionSpec`], [`UnboundPartitionSpec`] and [`PreservedPartitionSpec`]
+pub trait UnboundPartitionSpecInterface<T: PartitionFieldInterface> {


Hi, compared to having a trait, I personally feel that building an enum for this would be better. Maybe we could have:

enum PartitionSpec { Bound(BoundPartitionSpec), Unbound(UnboundPartitionSpec), }

Users can convert between bound and unbound versions based on their needs. This makes it easier to find the differences between various partition specs.

I believe this can avoid some complexity around the trait system and narrow down our API surface. What do you think?

Sounds like a good plan - and thanks for taking the time to review this @Xuanwo!

I will give it a shot and see how it feels.

liurenjie1024

Thanks @c-thiel for this pr! After skimming through it, I've got some questions:

Why we need another struct SchemalessPartitionSpec? Why not just create an alias for UnboundPartitionSpec? I think they share same properties: a partition spec not validated to bound to a schema.
I have some concerns about introducing too many traits, as @Xuanwo commented. Rust's trait system have a lot of constraits, and if our goal is to provide an unified interface to user, maybe we should use an enum? Also, I would suggest to postpone introducing this enum/interface before we see actual use case?

crates/iceberg/src/spec/partition.rs

c-thiel · 2024-10-23T12:13:04Z

Hey @liurenjie1024 , thanks for giving it a look!
Regarding 1) they are not identical (see also my argument from the start. FieldId is optional for unbound spec, but it is not for "previously bound" partition specs which are present in TableMetadata. This is a very important field which is why I don't like using the unbound spec in place of the schemaless spec. It would lead to unnecessary error handling in various places where the field is needed. We also can't make the unbound field required.

I think we are OK to neither expose enum nor trait now. Would you be ok with using the trait internally (pub crate) for now so that I don't have to duplicate logic?

liurenjie1024 · 2024-10-24T09:07:54Z

Hi, @c-thiel

Regarding 1) they are not identical (see also my argument from the start. FieldId is optional for unbound spec, but it is not for "previously bound" partition specs which are present in TableMetadata. This is a very important field which is why I don't like using the unbound spec in place of the schemaless spec. It would lead to unnecessary error handling in various places where the field is needed. We also can't make the unbound field required.

This makes sense to me. Typically I don't want to introduce data structures that don't exists in java/python library, since it would be quite confusing for java/python community members to understand and review. But this maybe a special case, and I think java implementation's use of buildUnchecked maybe unsafe either.

I think we are OK to neither expose enum nor trait now. Would you be ok with using the trait internally (pub crate) for now so that I don't have to duplicate logic?

I'm fine with limiting trait to crate only without exposing them to user, but the XXInterface doesn't sound to follow rust's naming convention. There are also other methods for removing duplication logics, such as macro. I think macro maybe a better tool here?

Xuanwo · 2024-10-24T09:21:53Z

I believe using enum here is sufficient. It's acceptable for us to have some duplicated code since we know the APIs we need are limited and unlikely to expand significantly in the future.

I want to avoid using too many macros in our codebase, as they can make it difficult to read and maintain, particularly at our API level.

liurenjie1024 · 2024-10-24T10:14:22Z

I believe using enum here is sufficient. It's acceptable for us to have some duplicated code since we know the APIs we need are limited and unlikely to expand significantly in the future.

I want to avoid using too many macros in our codebase, as they can make it difficult to read and maintain, particularly at our API level.

enum also sounds good to me.

c-thiel · 2024-10-24T12:15:02Z

Enum it is then. I'll refractor it by the end of this week.
Thanks @Xuanwo @liurenjie1024!

c-thiel · 2024-10-29T00:26:33Z

@Xuanwo, @liurenjie1024 ready for another round.
Ditched both the trait and the enum. Bare with me:

As already mentioned earlier, the enum or the trait were just a vehicle to reduce duplication of logic and to clarify that there are three variants of partition specs with different nuances. However the enum should rarely be used directly, as individual variants are much more powerful with a clear hierarchy: Bound > Schemaless > Unbound. Thus, it would be bad to ship the enum around to functions that require the Bound or Schemaless variants (i.e. need a field_id). It would always require matching and error handling. Thus, we use the specific variants pretty much everywhere anyway.

If we agree that we shouldn't use the enum to pass PartitionSpecs, then having an enum with owned fields is bad, because it requires a lot of clone. This is why I oped for enum<'a> PartitionSpec<'a> in 9562d22.

I didn't like the enum though as I didn't want to use it anyway. So in my last commit I got rid of that as well.
We gain a lot less code and API surface. We loose quite a bit of interoperability between different schema variants. What especially hurts is the is_compatible_with method, which used to be completely cross-functional for all variants.
I implement it now only for the single combination needed by the TableMetadataBuilder now (Schemaless compatible with Unbound)

Let me know what you think!

liurenjie1024

Thanks @c-thiel for this great pr! Generally LGTM, and I just left some minor comments.

crates/iceberg/src/spec/partition.rs

liurenjie1024 · 2024-11-01T12:13:01Z

Also please merge with main branch to keep it updated.

liurenjie1024 · 2024-11-03T04:08:46Z

@Xuanwo, @liurenjie1024 ready for another round. Ditched both the trait and the enum. Bare with me:

As already mentioned earlier, the enum or the trait were just a vehicle to reduce duplication of logic and to clarify that there are three variants of partition specs with different nuances. However the enum should rarely be used directly, as individual variants are much more powerful with a clear hierarchy: Bound > Schemaless > Unbound. Thus, it would be bad to ship the enum around to functions that require the Bound or Schemaless variants (i.e. need a field_id). It would always require matching and error handling. Thus, we use the specific variants pretty much everywhere anyway.

If we agree that we shouldn't use the enum to pass PartitionSpecs, then having an enum with owned fields is bad, because it requires a lot of clone. This is why I oped for enum<'a> PartitionSpec<'a> in 9562d22.

I didn't like the enum though as I didn't want to use it anyway. So in my last commit I got rid of that as well. We gain a lot less code and API surface. We loose quite a bit of interoperability between different schema variants. What especially hurts is the is_compatible_with method, which used to be completely cross-functional for all variants. I implement it now only for the single combination needed by the TableMetadataBuilder now (Schemaless compatible with Unbound)

Let me know what you think!

Sorry I missed this comment. I agree that trait/enum just for reducing duplication of logic is somehow to antipattern, and sometimes a little duplication is acceptable.

liurenjie1024 · 2024-11-03T04:10:00Z

Thanks @c-thiel for this great pr!

Fokko · 2024-11-11T12:33:53Z

crates/iceberg/src/spec/table_metadata.rs

+    pub fn partition_spec_by_id(&self, spec_id: i32) -> Option<&SchemalessPartitionSpecRef> {
        self.partition_specs.get(&spec_id)
    }


I'm not convinced that we need another type, such as SchemalessPartitionSpec. Why shouldn't we return an UnboundPartitionSpec instead? Looking at the signatures:

pub struct UnboundPartitionSpec { /// Identifier for PartitionSpec pub(crate) spec_id: Option<i32>, /// Details of the partition spec pub(crate) fields: Vec<UnboundPartitionField>, }

pub struct SchemalessPartitionSpec { /// Identifier for PartitionSpec spec_id: i32, /// Details of the partition spec fields: Vec<PartitionField>, }

And the PartitionField and UnboundPartitionField are basically the same.

Introducing SchemalessPartitionSpec might be our way to avoid apache/iceberg#4563.

If we cannot find the field anymore, the best thing to do is to include the file in the query plan.

* SchemalessPartitionSpec * Traits -> Enum * Remove PartitionSpec enum * Address comments

Xuanwo reviewed Oct 14, 2024

View reviewed changes

liurenjie1024 reviewed Oct 23, 2024

View reviewed changes

crates/iceberg/src/spec/partition.rs Show resolved Hide resolved

c-thiel added 2 commits October 28, 2024 15:47

SchemalessPartitionSpec

002a0c3

Traits -> Enum

9562d22

c-thiel force-pushed the feat/safe-partition-spec branch from c0e6d2b to 9562d22 Compare October 28, 2024 23:54

Remove PartitionSpec enum

0690e49

liurenjie1024 reviewed Nov 1, 2024

View reviewed changes

crates/iceberg/src/spec/partition.rs Outdated Show resolved Hide resolved

crates/iceberg/src/spec/partition.rs Outdated Show resolved Hide resolved

crates/iceberg/src/spec/partition.rs Outdated Show resolved Hide resolved

c-thiel added 2 commits November 1, 2024 06:44

Merge branch 'main' into feat/safe-partition-spec

f878d6a

Address comments

b54386a

liurenjie1024 approved these changes Nov 3, 2024

View reviewed changes

liurenjie1024 merged commit 6e0bcf5 into apache:main Nov 3, 2024
16 checks passed

Fokko reviewed Nov 11, 2024

View reviewed changes

Xuanwo mentioned this pull request Nov 11, 2024

Why shouldn't we return an UnboundPartitionSpec instead? #694

Closed

ZENOTME pushed a commit to risingwavelabs/iceberg-rust that referenced this pull request Nov 15, 2024

feat: Safer PartitionSpec & SchemalessPartitionSpec (apache#645)

5f845e4

* SchemalessPartitionSpec * Traits -> Enum * Remove PartitionSpec enum * Address comments

ZENOTME pushed a commit to risingwavelabs/iceberg-rust that referenced this pull request Nov 16, 2024

feat: Safer PartitionSpec & SchemalessPartitionSpec (apache#645)

674157e

* SchemalessPartitionSpec * Traits -> Enum * Remove PartitionSpec enum * Address comments

shaeqahmed pushed a commit to matanolabs/iceberg-rust that referenced this pull request Dec 9, 2024

feat: Safer PartitionSpec & SchemalessPartitionSpec (apache#645)

2e383ee

* SchemalessPartitionSpec * Traits -> Enum * Remove PartitionSpec enum * Address comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Safer PartitionSpec & SchemalessPartitionSpec #645

feat: Safer PartitionSpec & SchemalessPartitionSpec #645

c-thiel commented Sep 23, 2024 •

edited

Loading

c-thiel commented Sep 26, 2024

c-thiel commented Oct 1, 2024

Xuanwo commented Oct 9, 2024

Xuanwo left a comment

Xuanwo Oct 14, 2024

c-thiel Oct 14, 2024

liurenjie1024 left a comment

c-thiel commented Oct 23, 2024 •

edited

Loading

liurenjie1024 commented Oct 24, 2024

Xuanwo commented Oct 24, 2024

liurenjie1024 commented Oct 24, 2024

c-thiel commented Oct 24, 2024

c-thiel commented Oct 29, 2024

liurenjie1024 left a comment

liurenjie1024 commented Nov 1, 2024

liurenjie1024 commented Nov 3, 2024

liurenjie1024 commented Nov 3, 2024

Fokko Nov 11, 2024

feat: Safer PartitionSpec & SchemalessPartitionSpec #645

feat: Safer PartitionSpec & SchemalessPartitionSpec #645

Conversation

c-thiel commented Sep 23, 2024 • edited Loading

c-thiel commented Sep 26, 2024

c-thiel commented Oct 1, 2024

Xuanwo commented Oct 9, 2024

Xuanwo left a comment

Choose a reason for hiding this comment

Xuanwo Oct 14, 2024

Choose a reason for hiding this comment

c-thiel Oct 14, 2024

Choose a reason for hiding this comment

liurenjie1024 left a comment

Choose a reason for hiding this comment

c-thiel commented Oct 23, 2024 • edited Loading

liurenjie1024 commented Oct 24, 2024

Xuanwo commented Oct 24, 2024

liurenjie1024 commented Oct 24, 2024

c-thiel commented Oct 24, 2024

c-thiel commented Oct 29, 2024

liurenjie1024 left a comment

Choose a reason for hiding this comment

liurenjie1024 commented Nov 1, 2024

liurenjie1024 commented Nov 3, 2024

liurenjie1024 commented Nov 3, 2024

Fokko Nov 11, 2024

Choose a reason for hiding this comment

c-thiel commented Sep 23, 2024 •

edited

Loading

c-thiel commented Oct 23, 2024 •

edited

Loading