-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(function): add greatest function #12474
Conversation
This match the Spark implementation for greatest: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.greatest.html
FYI #12357 (comment) |
Related to #6531 |
Hey @alamb, we talked about the test location at the beginning of the CMU talk you gave, where the tests are located? |
I think the tests I was talking about are described here: If you want to test the type of a function you can do it like select arrow_typeif(my_func(foo))) |
I see, thank you, how is your experience with debugging this kind of test? |
It is great! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
|
||
let cmp = make_comparator(lhs, rhs, SORT_OPTIONS)?; | ||
|
||
let len = lhs.len().min(rhs.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they array lengths should be equal (otherwise we would be losing data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
// If both arrays are not nested, have the same length and no nulls, we can use the faster vectorised kernel | ||
// - If both arrays are not nested: Nested types, such as lists, are not supported as the null semantics are not well-defined. | ||
// - both array does not have any nulls: cmp::gt_eq will return null if any of the input is null while we want to return false in that case | ||
if !lhs.data_type().is_nested() && lhs.null_count() == 0 && rhs.null_count() == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this probably should use logical null count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
// - If both arrays are not nested: Nested types, such as lists, are not supported as the null semantics are not well-defined. | ||
// - both array does not have any nulls: cmp::gt_eq will return null if any of the input is null while we want to return false in that case | ||
if !lhs.data_type().is_nested() && lhs.null_count() == 0 && rhs.null_count() == 0 { | ||
return cmp::gt_eq(&lhs, &rhs).map_err(|e| e.into()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a test with float NaN values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
…eatest # Conflicts: # datafusion/sqllogictest/test_files/functions.slt
Please add a document function for it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @rluvaton
for array in arrays_iter { | ||
largest = keep_larger(Arc::clone(array), largest)?; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep_larger
constructs an intermediate array for each call. I'd prefer to reduce the materialization to just the last one. It's okay to do this in the follow-up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do it in a separate PR, this PR is large enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried using interleave
like we talked but than I would not be able to use the kernels functions so I did not do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to add proto-related supports for this new function. But it is not a blocker for merging this PR from my perspective. Thank you @rluvaton
What is proto related support? |
Sorry, forget it 🙈 They are wiped out in #10173 |
Thank you @waynexia see you on Monday at your CMU talk, I'll join 10 minutes earlier so we can talk about data fusion if interested 😊 |
Can we merge this? |
Thank you again for working on this! |
Thank you, I'll try to contribute again! |
Which issue does this PR close?
Closes #12472
Rationale for this change
Support more operators
What changes are included in this PR?
Implement the Spark implementation for
greatest
:https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.greatest.html
Are these changes tested?
I was not able to find where you put the functions test, the
coalesce
for example has only one test for the return type, I must missing somethingdatafusion/datafusion/functions/src/core/coalesce.rs
Lines 142 to 157 in 7bd7747
Are there any user-facing changes?
Yes, I need to add documentation, but first check if this feature is desired