-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Safer PartitionSpec & SchemalessPartitionSpec #645
Conversation
Introducing |
@liurenjie1024 , @Xuanwo could you maybe have a brief first look if this is going in a good direction? |
Hi, @c-thiel, thanks a lot for working on this. This issue does seem complex. I didn't join the discussion before, so I will take some time to go through all the references and then review this PR. Sorry for the delay. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @c-thiel, thank you so much for handling this. I spent some time loading the entire context. Most changes look fine to me; the only comment I have is about the implementation details.
crates/iceberg/src/spec/partition.rs
Outdated
@@ -235,6 +261,179 @@ impl UnboundPartitionSpec { | |||
} | |||
} | |||
|
|||
/// Trait for common functions between [`PartitionSpec`], [`UnboundPartitionSpec`] and [`PreservedPartitionSpec`] | |||
pub trait UnboundPartitionSpecInterface<T: PartitionFieldInterface> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, compared to having a trait, I personally feel that building an enum
for this would be better. Maybe we could have:
enum PartitionSpec {
Bound(BoundPartitionSpec),
Unbound(UnboundPartitionSpec),
}
Users can convert between bound and unbound versions based on their needs. This makes it easier to find the differences between various partition specs.
I believe this can avoid some complexity around the trait system and narrow down our API surface. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good plan - and thanks for taking the time to review this @Xuanwo!
I will give it a shot and see how it feels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @c-thiel for this pr! After skimming through it, I've got some questions:
- Why we need another struct
SchemalessPartitionSpec
? Why not just create an alias forUnboundPartitionSpec
? I think they share same properties: a partition spec not validated to bound to a schema. - I have some concerns about introducing too many traits, as @Xuanwo commented. Rust's trait system have a lot of constraits, and if our goal is to provide an unified interface to user, maybe we should use an enum? Also, I would suggest to postpone introducing this enum/interface before we see actual use case?
Hey @liurenjie1024 , thanks for giving it a look! I think we are OK to neither expose enum nor trait now. Would you be ok with using the trait internally (pub crate) for now so that I don't have to duplicate logic? |
Hi, @c-thiel
This makes sense to me. Typically I don't want to introduce data structures that don't exists in java/python library, since it would be quite confusing for java/python community members to understand and review. But this maybe a special case, and I think java implementation's use of
I'm fine with limiting trait to crate only without exposing them to user, but the |
I believe using I want to avoid using too many macros in our codebase, as they can make it difficult to read and maintain, particularly at our API level. |
|
Enum it is then. I'll refractor it by the end of this week. |
c0e6d2b
to
9562d22
Compare
@Xuanwo, @liurenjie1024 ready for another round. As already mentioned earlier, the enum or the trait were just a vehicle to reduce duplication of logic and to clarify that there are three variants of partition specs with different nuances. However the enum should rarely be used directly, as individual variants are much more powerful with a clear hierarchy: Bound > Schemaless > Unbound. Thus, it would be bad to ship the enum around to functions that require the Bound or Schemaless variants (i.e. need a field_id). It would always require matching and error handling. Thus, we use the specific variants pretty much everywhere anyway. If we agree that we shouldn't use the enum to pass PartitionSpecs, then having an enum with owned fields is bad, because it requires a lot of I didn't like the enum though as I didn't want to use it anyway. So in my last commit I got rid of that as well. Let me know what you think! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @c-thiel for this great pr! Generally LGTM, and I just left some minor comments.
Also please merge with main branch to keep it updated. |
Sorry I missed this comment. I agree that trait/enum just for reducing duplication of logic is somehow to antipattern, and sometimes a little duplication is acceptable. |
Thanks @c-thiel for this great pr! |
pub fn partition_spec_by_id(&self, spec_id: i32) -> Option<&SchemalessPartitionSpecRef> { | ||
self.partition_specs.get(&spec_id) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that we need another type, such as SchemalessPartitionSpec
. Why shouldn't we return an UnboundPartitionSpec
instead? Looking at the signatures:
pub struct UnboundPartitionSpec {
/// Identifier for PartitionSpec
pub(crate) spec_id: Option<i32>,
/// Details of the partition spec
pub(crate) fields: Vec<UnboundPartitionField>,
}
pub struct SchemalessPartitionSpec {
/// Identifier for PartitionSpec
spec_id: i32,
/// Details of the partition spec
fields: Vec<PartitionField>,
}
And the PartitionField
and UnboundPartitionField
are basically the same.
Introducing SchemalessPartitionSpec might be our way to avoid apache/iceberg#4563.
If we cannot find the field anymore, the best thing to do is to include the file in the query plan.
* SchemalessPartitionSpec * Traits -> Enum * Remove PartitionSpec enum * Address comments
* SchemalessPartitionSpec * Traits -> Enum * Remove PartitionSpec enum * Address comments
* SchemalessPartitionSpec * Traits -> Enum * Remove PartitionSpec enum * Address comments
Fixes #550
This PR is a result of the issue mentioned above, a Slack Discussion and the preparation for the new MetadataBuilder #587.
Lets first establish, that
PartitionSpec
should have a bound schema because:PartitonSpec
to allow internal partition id handling (see @rdblue comment in slack).Result
.If we agree on this, then we have a small Problem with
TableMetadata
: It contains historic PartitionSpecs that cannot be bound against thecurrent_schema
. As a result, we need a type that has the same properties asPartitionSpec
but is not bound to a schema. I thought we could useUnboundPartitionSpec
for this at first, but it serves a different purpose and would make the very importantfield_id
Optional.To solve this, I introduce a
SchemalessPartitionSpec
. Common functions betweenPartitionSpec
andSchemalessPartitionSpec
are made available via a common trait. While the trait is public and as such allows to develop functions that take either variant as an input, I am exposing the most important attributes (i.e.fields
) still directly, so that the trait typically doesn't need to be imported.For reference, Java solves this by force-binding potentially non-compatible schemas to a
PartitionSpec
TableMetadataBuilder
: https://github.com/apache/iceberg/blob/5a2c1c9c8dd8d61e08bbad713c24e6b726eee323/core/src/main/java/org/apache/iceberg/TableMetadata.java#L734TableMetadataParser
: https://github.com/apache/iceberg/blob/5a2c1c9c8dd8d61e08bbad713c24e6b726eee323/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L393Happy to hear your opinions! This really is a bit of a fiddly topic. I don't like to introduce a new type, on the other hand using the API feels more natural to me now. This can also be seen in the tests, where we had a significant number of places where we passed
PartitionSpec
alongside itsSchema
around, just to callpartition_type
eventually.