Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup type coercion (#3419) #6778

Merged
merged 4 commits into from
Jun 28, 2023
Merged

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Part of #3419

Rationale for this change

This is the first of what will be a series of PRs to consolidate the type coercion logic and make it easier to follow. In particular this PR adds the ability to express type coercion of heterogeneous types, as occurs in arithmetic. This in turn feeds into the effort to move more of this logic upstream apache/arrow-rs#3999

What changes are included in this PR?

Replaces coerce_types with get_input_types that returns separate types for the left and right expressions. It also consolidates the coercion logic in a single private signature function to avoid inconsistency and reduce repetition.

Are these changes tested?

Are there any user-facing changes?

@tustvold tustvold added the api change Changes the API exposed to users of the crate label Jun 27, 2023
@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions labels Jun 27, 2023
Comment on lines +2143 to 2151
// utf8 = utf8
col("c1").eq(col("c1")),
// u8 AND u8
col("c3").and(col("c3")),
// utf8 = bool
col("c1").eq(bool_expr.clone()),
// u32 AND bool
col("c2").and(bool_expr),
col("c3").bitand(col("c3")),
// utf8 = u8
col("c1").eq(col("c3")),
// bool AND bool
bool_expr.clone().and(bool_expr),
];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure what this test was supposed to be testing, but a lot of these expression are ill-formed and now return an error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps @izveigor remembers 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with this test, but it really seems redundant.

Operator::Multiply |
Operator::Divide|
Operator::Modulo => {
// TODO: this logic would be easier to follow if the functions were inlined
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to do this a follow up PR, but wanted to avoid conflating too many things in one PR

// date - date
(Date32, Date32) => Some(Interval(DayTime)),
(Date64, Date64) => Some(Interval(MonthDayNano)),
(Date32, Date64) | (Date64, Date32) => Some(Interval(MonthDayNano)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not actually supported, so I'm not sure why it was here

Comment on lines -74 to -83
// date - timestamp, timestamp - date
(Date32, Timestamp(_, _))
| (Timestamp(_, _), Date32)
| (Date64, Timestamp(_, _))
| (Timestamp(_, _), Date64) => {
// TODO: make get_result_type must after coerce type.
// if type isn't coerced, we need get common type, and then get result type.
let common_type = temporal_coercion(lhs_type, rhs_type);
common_type.and_then(|t| mathematics_temporal_result_type(&t, &t))
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formulation using signature avoids this, this function is only fed types after coercion

}

/// Coercion rules for mathematics operators between decimal and non-decimal types.
pub fn math_decimal_coercion(
lhs_type: &DataType,
rhs_type: &DataType,
) -> (Option<DataType>, Option<DataType>) {
) -> Option<(DataType, DataType)> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this method to be more consistent with the signature of other coercion functions, i.e. returning None to indicate a coercion failure, instead of don't coerce

@@ -743,9 +714,6 @@ fn temporal_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataTyp
use arrow::datatypes::IntervalUnit::*;
use arrow::datatypes::TimeUnit::*;

if lhs_type == rhs_type {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was incorrect if the types are not temporal, rather than adding a check, I just removed it as we explicitly test the types below

.to_string();
assert_contains!(&err, "Date32 + Date64 can't be evaluated because there isn't a common type to coerce the types to");
let (lhs, rhs) =
get_input_types(&DataType::Date32, &Operator::Plus, &DataType::Date64)
Copy link
Contributor Author

@tustvold tustvold Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh I think it was a bug that this wasn't supported, there is no good reason that I can see as to why we wouldn't coerce to the larger representation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At one point I think the arrow cast kernels didn't support the coercion so perhaps that is why it was not done automatically

op,
ref right,
}) => {
// this is a workaround for https://github.com/apache/arrow-datafusion/issues/3419
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -499,6 +492,7 @@ pub fn coercion_decimal_mathematics_type(
left_decimal_type: &DataType,
right_decimal_type: &DataType,
) -> Option<DataType> {
// TODO: Move this logic into kernel implementations
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intend to this as a follow on, it is peculiar that BinaryExpr is performing a precision/scale coercion prior to calling kernels, which then in the case of divide then do their own scale coercion. The kernels should just be smart enough to perform the arithmetic correctly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @viirya

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, if the coercion rule that actually comes from Spark is considered to be general enough for all downstreams.

@tustvold tustvold marked this pull request as draft June 27, 2023 14:40
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jun 27, 2023
@@ -432,11 +432,15 @@ select '1 month'::interval + '1980-01-01T12:00:00'::timestamp;

# Exected error: interval (scalar) - date / timestamp (scalar)

query error DataFusion error: type_coercion\ncaused by\nError during planning: Interval\(MonthDayNano\) \- Date32 can't be evaluated because there isn't a common type to coerce the types to
query D
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this work wasn't actually intentional, but I don't see a reason why they shouldn't work

@@ -38,14 +173,6 @@ fn mathematics_temporal_result_type(
use arrow::datatypes::IntervalUnit::*;
use arrow::datatypes::TimeUnit::*;

if !is_interval(lhs_type)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is unnecessary given the match block that follows

match (lhs_type, rhs_type) {
// interval +/-
(Interval(_), Interval(_)) => Some(Interval(MonthDayNano)),
(Date64, Date32) | (Date32, Date64) => Some(Date64),
(Utf8, Date32) | (Date32, Utf8) => Some(Date32),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was split out into a separate string_temporal_coercion as it cannot be used for arithmetic

}
_ => (None, None),
// Unlike with comparison we don't coerce to a decimal in the case of floating point
// numbers, instead falling back to floating point arithmetic instead
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this behaviour was intentional or accidental, but I opted to preserve it

@tustvold tustvold marked this pull request as ready for review June 27, 2023 16:33
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👨‍🍳 👌 -- this is a very nice cleanup -- thank you very much @tustvold

I had some stylistic nit-picks but nothing that should prevent this from merging in my mind.

Thank you for cleaning this up

cc @jackwener and @liukun4515 as I think this is a topic you have previously expressed interest in as well

Comment on lines +2143 to 2151
// utf8 = utf8
col("c1").eq(col("c1")),
// u8 AND u8
col("c3").and(col("c3")),
// utf8 = bool
col("c1").eq(bool_expr.clone()),
// u32 AND bool
col("c2").and(bool_expr),
col("c3").bitand(col("c3")),
// utf8 = u8
col("c1").eq(col("c3")),
// bool AND bool
bool_expr.clone().and(bool_expr),
];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps @izveigor remembers 🤔

@@ -432,11 +432,15 @@ select '1 month'::interval + '1980-01-01T12:00:00'::timestamp;

# Exected error: interval (scalar) - date / timestamp (scalar)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Exected error: interval (scalar) - date / timestamp (scalar)

op,
ref right,
}) => {
// this is a workaround for https://github.com/apache/arrow-datafusion/issues/3419
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
}
Expr::BinaryExpr(BinaryExpr { left, op, right }) => {
let lhs = left.get_type(&self.schema)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the names left_type and right_type I think would be more consistent with the rest of the code in this file and more readable. The convention in the rest of the code seems to be lhs and rhs refer to the left and right argument Expr themselves

The same comment applies to the other changes in this file too

use crate::Operator;

/// The type signature of an instantiation of binary expression
struct Signature {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can use a different name for this struture as there is already a Signature in https://docs.rs/datafusion-expr/26.0.0/datafusion_expr/struct.Signature.html

Perhaps OpSignature or BinarySignature ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is module private I personally think this is fine, I generally dislike adding additional qualifiers to types, we can always change down the line

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I really dislike is overloading names of things that are different -- my feeble mind can't mentally alias / keep the two separate

}
}

/// Returns a [`Signature`] for applying `op` to arguments of type `lhs` and `rhs`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ This is perfect. Thank you

/// Returns the output type of applying numeric operations such as `=`
/// to arguments `lhs_type` and `rhs_type` if both are numeric
/// Coerce `lhs_type` and `rhs_type` to a common type for the purposes of a comparison operation
/// where one is numeric and one is `Utf8`/`LargeUtf8`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// where one is numeric and one is `Utf8`/`LargeUtf8`.
/// where one is temporal and one is `Utf8`/`LargeUtf8`.

@@ -499,6 +492,7 @@ pub fn coercion_decimal_mathematics_type(
left_decimal_type: &DataType,
right_decimal_type: &DataType,
) -> Option<DataType> {
// TODO: Move this logic into kernel implementations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @viirya

assert_eq!(e, "Float32 + Utf8 can't be evaluated because there isn't a common type to coerce the types to");
assert_eq!(
e,
"Cannot coerce arithmetic expression Float32 + Utf8 to valid types"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new messages are easier to understand 👍

.to_string();
assert_contains!(&err, "Date32 + Date64 can't be evaluated because there isn't a common type to coerce the types to");
let (lhs, rhs) =
get_input_types(&DataType::Date32, &Operator::Plus, &DataType::Date64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At one point I think the arrow cast kernels didn't support the coercion so perhaps that is why it was not done automatically

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants