-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move MAKE_MAP
to ExprPlanner
#11452
Move MAKE_MAP
to ExprPlanner
#11452
Conversation
query ? | ||
SELECT MAKE_MAP('POST', 41, 'HEAD', 'ab', 'PATCH', 30); | ||
---- | ||
{POST: 41, HEAD: ab, PATCH: 30} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expected the query would fail because similar behavior isn't allowed in other databases (e.g. DuckDB). However, seems make_array
will coercion the value to find a suitable type for them. In this case, all of them will be converted to utf8
.
> select make_array(1,'a',3);
+-----------------------------------------+
| make_array(Int64(1),Utf8("a"),Int64(3)) |
+-----------------------------------------+
| [1, a, 3] |
+-----------------------------------------+
1 row(s) fetched.
Elapsed 0.004 seconds.
> select arrow_typeof(make_array(1,'a',3));
+-----------------------------------------------------------------------------------------------------------------+
| arrow_typeof(make_array(Int64(1),Utf8("a"),Int64(3))) |
+-----------------------------------------------------------------------------------------------------------------+
| List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) |
+-----------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.
I think if DataFusion allows this type of coercion for make_array
, we can allow it for make_map
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need another make_array
to does not apply coercion. I prefer to align the behaviour to other system unless there is a good reason not to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Maybe I can create a scalar function make_array_strict
that won't implement the coerce_types
method for ScalarUDFImpl
, but other implementations are the same as make_array
.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need another
make_array
to does not apply coercion. I prefer to align the behaviour to other system unless there is a good reason not to.
Instead can we pass a boolean arg should_coercion
with default value as false, to control such behaviour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need another
make_array
to does not apply coercion. I prefer to align the behaviour to other system unless there is a good reason not to.Instead can we pass a boolean arg
should_coercion
with default value as false, to control such behaviour
The coercion logic is not simply work like if-else statement. The make_array_inner
doesn't care about coercion, the coercion is in type_coercion
pass in analzyer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. That's why I planned to implement another scalar function for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need another
make_array
to does not apply coercion. I prefer to align the behaviour to other system unless there is a good reason not to.
I did more tests for DuckDB behavior and found something interesting. I found they also try to coercion types when building arrays or maps.
I arranged some notes for the behaviors:
How DuckDB build a map
It seems that they also transform to two lists and call map function. Just like my first design, using make_array
.
D select map {1:102, 2:20};
┌───────────────────────────────────────────────────────────┐
│ main.map(main.list_value(1, 2), main.list_value(102, 20)) │
│ map(integer, integer) │
├───────────────────────────────────────────────────────────┤
│ {1=102, 2=20} │
└───────────────────────────────────────────────────────────┘
How DuckDB and DataFusion coercion array type
DuckDB
- Array constructed from
INT32 and numeric string
: DuckDB will make it beINTEGER[]
.
D select array[1,2,'3'];
┌────────────────────┐
│ (ARRAY[1, 2, '3']) │
│ int32[] │
├────────────────────┤
│ [1, 2, 3] │
└────────────────────┘
D select typeof(array[1,2,'3']);
┌────────────────────────────┐
│ typeof((ARRAY[1, 2, '3'])) │
│ varchar │
├────────────────────────────┤
│ INTEGER[] │
└────────────────────────────┘
- Array constructed from
INT32 and non-numeric string
: DuckDB can't construct the array.
D select array[1,2,'a'];
Conversion Error: Could not convert the string 'a' to INT32
LINE 1: select array[1,2,'a'];
DataFusion
- Array constructed from
INT32 and numeric string
: DataFusion will make it beUf8 array
.
> select [1,2,'1'];
+-----------------------------------------+
| make_array(Int64(1),Int64(2),Utf8("1")) |
+-----------------------------------------+
| [1, 2, 1] |
+-----------------------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
> select arrow_typeof([1,2,'1']);
+-----------------------------------------------------------------------------------------------------------------+
| arrow_typeof(make_array(Int64(1),Int64(2),Utf8("1"))) |
+-----------------------------------------------------------------------------------------------------------------+
| List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) |
+-----------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
- Array constructed from
INT32 and non-numeric string
: DataFusion will make it beUf8 array
.
> select [1,2,'a'];
+-----------------------------------------+
| make_array(Int64(1),Int64(2),Utf8("a")) |
+-----------------------------------------+
| [1, 2, a] |
+-----------------------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
> select arrow_typeof([1,2,'a']);
+-----------------------------------------------------------------------------------------------------------------+
| arrow_typeof(make_array(Int64(1),Int64(2),Utf8("a"))) |
+-----------------------------------------------------------------------------------------------------------------+
| List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) |
+-----------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
The behavior of type coercion between INT32 and String is really different.
How DuckDB coercion map type
- INT32 value and numeric string value: We can find the value
'20'
has been converted to20
.
D select map {1:10, 2:'20'};
┌────────────────────────────────────────────────────────────┐
│ main.map(main.list_value(1, 2), main.list_value(10, '20')) │
│ map(integer, integer) │
├────────────────────────────────────────────────────────────┤
│ {1=10, 2=20} │
└────────────────────────────────────────────────────────────┘
- INT32 value and non-numeric string value. (It's what I tried in the first time. That's why I thought it shouldn't be allowed)
D select map {1:10, 2:'abc'};
Conversion Error: Could not convert string 'abc' to INT32
LINE 1: select map {1:10, 2:'abc'};
^
Conclusion
Referring to these behaviors, I think we can just back to using make_array
to implement this. Because the behavior of type coercion is different, Our make_map
can allow map {1:10, 2:'a'}
but DuckDB can't do it. It makes sense for me.
@jayzhan211 WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, so the behaviour is actually depend on array
itself.
I think we can use make_array
in this case.
But, if we want to introduce nice dataframe API map(keys: Vec<Expr>, values: Vec<Expr>)
, I think we still need to pass Vec<Expr>
instead of the result of make_array
. However, we can introduce that in another PR.
current API expects map(vec![make_array(vec![lit("a"), lit("b")]), make_array(vec![lit("1"), lit("2")])])
a slightly better API is map(vec![lit("a"), lit("b")], vec![lit(1), lit(2)])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, so the behaviour is actually depend on
array
itself. I think we can usemake_array
in this case.
Ok, I'll roll back to make_array
first.
current API expects
map(vec![make_array(vec![lit("a"), lit("b")]), make_array(vec![lit("1"), lit("2")])])
a slightly better API ismap(vec![lit("a"), lit("b")], vec![lit(1), lit(2)])
I'm not very familiar with the data frame implementation. Curiously, does the API for data frames also use the UDF map
? I think the UDF is a logical layer function, but we don't have a corresponding logical expression for vec!
other than make_array
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataframe API is used for building Expr
.
map(vec![make_array(vec![lit("a"), lit("b")]), make_array(vec![lit("1"), lit("2")])])
is actually like Expr::ScalarFunction(map_udf(), args: ...)
.
The idea is something like
fn map(keys: Vec<Expr>, values: Vec<Expr>) {
let args: Vec<Expr> = concat (keys, values)
Expr::ScalarFunction(map_udf(), args)
}
let keys = make_array(keys); | ||
let values = make_array(values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to invoke the make_array
to do the aggregation. That's why I put the implementation in functions-array
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally i think this should be implemented in functions
inside core
.
Do we have any downside of adding functions-array
as depedency to functions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could move make-array
to functions core-feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any downside of adding functions-array as depedency to functions
Then you need to import unnecessary array function crate if you only care about funcitons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can reuse make_array_inner
if we move make_array
to functions
crate.
The alternative is to keep the code here in functions-array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think moving make_array
to functions is a good idea. It would be beneficial for many scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, okay. After some research, I believe it's not easy to move make_array
to functions
. It's tied to methods in utils.rs
and macro.rs
. Moving all the required methods to functions
could make the codebase chaotic. For now, I prefer to keep them in functions-arrays
first. We can do it in another PR.
make_map, | ||
"Returns a map created from the given keys and values pairs. This function isn't efficient for large maps. Use the `map` function instead.", | ||
args, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure where can put this doc. Maybe we can do it when #11435
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, We can use to document this function in https://datafusion.apache.org/user-guide/sql/scalar_functions.html
return exec_err!("make_map requires an even number of arguments"); | ||
} | ||
|
||
let (keys, values): (Vec<_>, Vec<_>) = args |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to avoid clone
let (keys, values): (Vec<_>, Vec<_>) = args | |
let (keys, values): (Vec<_>, Vec<_>) = args.into_iter().enumerate().partition(|(i, _)| i % 2 == 0); | |
let keys = make_array(keys.into_iter().map(|(_, expr)| expr).collect()); | |
let values = make_array(values.into_iter().map(|(_, expr)| expr).collect()); |
@@ -131,6 +138,77 @@ impl ScalarUDFImpl for MakeArray { | |||
} | |||
} | |||
|
|||
#[derive(Debug)] | |||
pub struct MakeArrayStrict { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just add the function that convert keys and values to list of expr instead of introducing another udf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function are public function that could be used in datafusion-cli or other project. We are just converting keys to array, we just need internal private function for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the high level idea is that
SELECT MAKE_MAP('POST', 41, 'PAST', 33,'PATCH', 30)
We arrange args to ['POST', 'PAST', 'PATCH'], [41, 33, 30], and call
MAP(['POST', 'PAST', 'PATCH'], [41, 33, 30])
I just noticed that we can't directly pass these two array to MapFunc
😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could figure how to build with dataframe API, map(keys, values)
Current function is like
pub fn map($($arg: datafusion_expr::Expr),*) -> datafusion_expr::Expr {
super::$FUNC().call(vec![$($arg),*])
}
Expected
pub fn map(keys: Vec<Expr>, values: Vec<Expr>) -> Expr
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR, we just call make_array_inner
and instead of make_array_strict
, we could deal with others in another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should find a way to avoid make_array_strict
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can change the MapFunc first, let it takes arguments with Vec<Expr>
. The first half is keys
, the other is values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I play around to make sure the suggestion makes sense #11526
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I will check it tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns for it. If we make MapFunc
to accept one array, it would be used like
SELECT map([1,2,3,'a','b','c'])
After planning, the input array would be ['1','2','3','a','b','c']
because of the type coercion for array elements. I think the behavior is wrong. If we change the signature of MapFunc
, we might need to have another implementation to solve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks @goldmedal . I will file an issue about the |
Thanks @jayzhan211 and @dharanad for reviewing |
* move make_map to ExprPlanner * add benchmark for make_map * remove todo comment * update lock * refactor plan_make_map * implement make_array_strict for type checking strictly * fix planner provider * roll back to `make_array` * update lock
* move make_map to ExprPlanner * add benchmark for make_map * remove todo comment * update lock * refactor plan_make_map * implement make_array_strict for type checking strictly * fix planner provider * roll back to `make_array` * update lock
Which issue does this PR close?
Parietally solve #11434
Rationale for this change
The benchmark result:
It's much faster than the previous implementation #11361. Although the benchmark doesn't invoke the function, it contains the bottleneck of the original scalar function, aggregating the keys and values.
Thanks to @jayzhan211 for the nice suggestion.
What changes are included in this PR?
Remove the scalar function
make_map
, and then plan it inExprPlanner
.Are these changes tested?
yes
Are there any user-facing changes?
no