-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ScalarFunction MAKE_MAP
and MAP
#11361
Implement ScalarFunction MAKE_MAP
and MAP
#11361
Conversation
I noticed the map and struct of Arrow allow duplicate keys but I think this behavior is wrong.
Typically, a struct or map should not have a duplicate key (It would cause some problems when getting data). In DuckDB, the query will fail if exist the duplicate keys.
I'm unsure where we should handle this issue (DataFusion or Arrow). To solve it, I prefer to fix |
For |
Thanks @jayzhan211. Sounds great. |
I did some research For For map, the spec says the keys should be unique (and application, aka DataFusion) enforced
So my suggestion is:
|
I agree with @jayzhan211 it seems less than ideal to have two functions rather than just one. However I agree with @goldmedal that it isn't clear that this is all that much better. It looks to me like duckdb has several functions to create They support this kind of literal syntax SELECT MAP {'key1': 10, 'key2': 20, 'key3': 30}; As well as the parallel lists implementation SELECT MAP(['key1', 'key2', 'key3'], [10, 20, 30]); They also have a function similar to SELECT map_from_entries([('key1', 10), ('key2', 20), ('key3', 30)]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @goldmedal -- this is a good start. I do think there are still several issues to work out before this feature would be usable by many people.
I think we have 2 options
- merge this PR as is and file tickets to iterate/improve on main (much as we did with the array / list implementations)
- Keep working on this PR
I would personally prefer 1 as it has worked well for us in the past and allows the development to proceed incrementally. What do you think @jayzhan211
Some initial thoughts on things left to do:
- document
map
/make_map
in the documentation. We can do this as a follow on PR: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#struct-functions - Support arrays (not just scalars)
- Decide if we want to have
make_map
follownamed_struct
or if it would be better to have something more like what duckdb does
and probably more
datafusion/functions/src/core/map.rs
Outdated
use datafusion_expr::{ColumnarValue, ScalarUDFImpl, Signature, Volatility}; | ||
|
||
fn make_map(args: &[ColumnarValue]) -> Result<ColumnarValue> { | ||
if args.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these conditions should have been checked by the planner, so it would probably be ok to panic here or return an internal error. A real error is ok too, but I suspect it would be impossible to actually hit
datafusion/functions/src/core/map.rs
Outdated
} | ||
else if chunk[1].data_type() != value_type { | ||
return exec_err!( | ||
"map requires all values to have the same type {}, got {} instead at position {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"map requires all values to have the same type {}, got {} instead at position {}", | |
"make_map requires all values to have the same type {}, got {} instead at position {}", |
datafusion/functions/src/core/map.rs
Outdated
if chunk[0].data_type().is_null() { | ||
return exec_err!("map key cannot be null"); | ||
} | ||
if !chunk[1].data_type().is_null() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this code to do null checking / coercion somewhat confusing
I would have expected that the planner had done the cocersion once at plan time rather than doing it on all inputs
Perhaps you could implement coerce_types
for the map
function once
datafusion/datafusion/expr/src/udf.rs
Lines 540 to 542 in 585504a
fn coerce_types(&self, _arg_types: &[DataType]) -> Result<Vec<DataType>> { | |
not_impl_err!("Function {} does not implement coerce_types", self.name()) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main purpose of this check is to decide the type of value array to create the null array in L76. If the values are
[null, null, 1, 2, 3]
We will know the value type by the third element. So, I think the null check is necessary. However, I agree the other check isn't necessary. Maybe I don't need to implement coerce_types
since I guess return_type
has checked them during planning, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that you are trying to replace null with the non-null type here. It is possible to coerce_types earlier before invoke()
You can use Signature::user_defined(Volatility::Immutable)
with your defined coerce_types
to coerce null to non-null, and you can check if all the value type is the same here as well. This function is called before return_types
and invoke
, so we don't deal with null type after. Note that although the null type is converted to other type like Int32, but the array value is still null
, it is something like Int32(Null).
fn coerce_types(&self, arg_types: &[DataType]) -> Result<Vec<DataType>> {
let mut dt = DataType::Int32;
for (i, t) in arg_types.iter().enumerate() {
if i % 2 == 1 && !t.is_null() {
dt = t.clone();
}
}
let mut dts = vec![];
for (i, t) in arg_types.iter().enumerate() {
if i % 2 == 1 && t.is_null() {
dts.push(dt.clone())
} else {
dts.push(t.clone())
}
}
Ok(dts)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks nice. Many thanks!!
let key_type = &arg_types[0]; | ||
let mut value_type = &arg_types[1]; | ||
|
||
for (i, chunk) in arg_types.chunks_exact(2).enumerate() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here about type coercion
query ? | ||
SELECT MAP(arrow_cast(make_array('POST', 'HEAD', 'PATCH'), 'LargeList(Utf8)'), arrow_cast(make_array(41, 33, 30), 'LargeList(Int64)')); | ||
---- | ||
{POST: 41, HEAD: 33, PATCH: 30} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if possible we should also add tests for array values (not just scalars)
I wrote up some examples, like this:
# test that maps can be created from arrays
statement ok
create table t as values
('a', 1, ['k1', 'k2'], [10.0, 20.0]),
('b', 2, ['k3'], [30.0]),
('d', 4, ['k5', 'k6'], [50.0, 60.0]);
query error
select make_map(column1, column2) from t;
----
DataFusion error: Internal error: UDF returned a different number of rows than expected. Expected: 3, Got: 1.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
query error
select map(column3, column4) from t;
----
DataFusion error: Execution error: Expected scalar, got ListArray
[
StringArray
[
"k1",
"k2",
],
StringArray
[
"k3",
],
StringArray
[
"k5",
"k6",
],
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. I think I misunderstood how the ColumnarValue
works. I expected it to be inputted per row, but it inputs all the rows at once.
[datafusion/functions/src/core/map.rs:59:5] &key = [
StringArray
[
"a",
"b",
"d",
],
]
[datafusion/functions/src/core/map.rs:60:5] &value = [
PrimitiveArray<Int64>
[
1,
2,
4,
],
]
I'll fix them. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some challenges in transforming the array values to the map array layout. If the given table and SQL are
statement ok
create table t as values
('a', 1, 'k1', 10),
('b', 2, 'k3', 30),
('d', 4, 'k5', 50);
query ?
SELECT make_map(column1, column2, column3, column4) FROM t;
----
{a: 1, k1: 10}
{b: 2, k3: 30}
{d: 4, k5: 50}
We'll get the keys like
[datafusion/functions/src/core/map.rs:37:5] &key = [
Array(
StringArray
[
"a",
"b",
"d",
],
),
Array(
StringArray
[
"k1",
"k3",
"k5",
],
),
]
I think I need to aggregate them into one array like
["a", "k1", "b", "k3", "d", "k5"]
And then, I can give the offsets when building MapArray
[0,2,4,6]
However, I have no good idea for array aggregation now. I handle the array value by throwing not_impl_err
currently. Maybe we can solve it in the follow-up PR.
I think we can have several functions frontend but one single function in functions crate, with ExprPlanner, we can arrange the args for the single MapFunc. Also, the biggest reason why What the MapFunc's args look like depend on what is more efficient more Arrow Array. |
Sure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks |
Thanks @alamb. I prefer to file new tickets for them.
I'll address the comments for the code today, and then file the following issues for it. (maybe after merging?) |
@@ -131,19 +131,19 @@ SELECT MAKE_MAP([1,2], ['a', 'b'], [3,4], ['b']); | |||
---- | |||
{[1, 2]: [a, b], [3, 4]: [b]} | |||
|
|||
query error DataFusion error: Error during planning: Execution error: User-defined coercion failed with Execution\("map requires all values to have the same type Int64, got Utf8 instead at position 1"\)(.|\n)* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern works in my local environment but doesn't work for CI. I'm not sure of the reason currently. Just simplify the assertion.
I'm not certain if there are any other tasks we need to complete before merging this. TODOs in the next PRs
I think we can simplify the two map functions to one first. |
I think the uniqueness of map key name is a todo ticket, too.
As we discussed about supporting Map literal in #11268 (comment), perhaps we can do it at the same time. |
I merge this first, thanks @goldmedal and @alamb |
Thanks @jayzhan211 @alamb. If needed, I can help to file the follow-up issues tonight. |
Thank yoU @goldmedal -- that would be awesome. I started collecting Map related tickets on an epic -- perhaps you could add the tickets you file there as well: #11429 |
* tmp * opt * modify test * add another version * implement make_map function * implement make_map function * implement map function * format and modify the doc * add benchmark for map function * add empty end-line * fix cargo check * update lock * upate lock * fix clippy * fmt and clippy * support FixedSizeList and LargeList * check type and handle null array in coerce_types * make array value throw todo error * fix clippy * simpify the error tests
* tmp * opt * modify test * add another version * implement make_map function * implement make_map function * implement map function * format and modify the doc * add benchmark for map function * add empty end-line * fix cargo check * update lock * upate lock * fix clippy * fmt and clippy * support FixedSizeList and LargeList * check type and handle null array in coerce_types * make array value throw todo error * fix clippy * simpify the error tests
* tmp * opt * modify test * add another version * implement make_map function * implement make_map function * implement map function * format and modify the doc * add benchmark for map function * add empty end-line * fix cargo check * update lock * upate lock * fix clippy * fmt and clippy * support FixedSizeList and LargeList * check type and handle null array in coerce_types * make array value throw todo error * fix clippy * simpify the error tests
* tmp * opt * modify test * add another version * implement make_map function * implement make_map function * implement map function * format and modify the doc * add benchmark for map function * add empty end-line * fix cargo check * update lock * upate lock * fix clippy * fmt and clippy * support FixedSizeList and LargeList * check type and handle null array in coerce_types * make array value throw todo error * fix clippy * simpify the error tests
Which issue does this PR close?
Closes #11268.
Rationale for this change
The benchmark result:
What changes are included in this PR?
Implement two scalar functions for creating map value:
MAKE_MAP
MAKE_MAP
isn't efficient. We shouldn't use it to create a large map.MAP
Are these changes tested?
yes
Are there any user-facing changes?
add two function.