Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move MAKE_MAP to ExprPlanner #11452

Merged
merged 11 commits into from
Jul 19, 2024

Conversation

goldmedal
Copy link
Contributor

Which issue does this PR close?

Parietally solve #11434

Rationale for this change

The benchmark result:

Gnuplot not found, using plotters backend
make_map_1000           time:   [234.07 µs 237.23 µs 240.86 µs]
                        change: [-91.706% -91.396% -91.136%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

It's much faster than the previous implementation #11361. Although the benchmark doesn't invoke the function, it contains the bottleneck of the original scalar function, aggregating the keys and values.
Thanks to @jayzhan211 for the nice suggestion.

What changes are included in this PR?

Remove the scalar function make_map, and then plan it in ExprPlanner.

Are these changes tested?

yes

Are there any user-facing changes?

no

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Jul 13, 2024
Comment on lines +134 to +137
query ?
SELECT MAKE_MAP('POST', 41, 'HEAD', 'ab', 'PATCH', 30);
----
{POST: 41, HEAD: ab, PATCH: 30}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected the query would fail because similar behavior isn't allowed in other databases (e.g. DuckDB). However, seems make_array will coercion the value to find a suitable type for them. In this case, all of them will be converted to utf8.


> select make_array(1,'a',3);
+-----------------------------------------+
| make_array(Int64(1),Utf8("a"),Int64(3)) |
+-----------------------------------------+
| [1, a, 3]                               |
+-----------------------------------------+
1 row(s) fetched. 
Elapsed 0.004 seconds.

> select arrow_typeof(make_array(1,'a',3));
+-----------------------------------------------------------------------------------------------------------------+
| arrow_typeof(make_array(Int64(1),Utf8("a"),Int64(3)))                                                           |
+-----------------------------------------------------------------------------------------------------------------+
| List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) |
+-----------------------------------------------------------------------------------------------------------------+
1 row(s) fetched. 
Elapsed 0.002 seconds.

I think if DataFusion allows this type of coercion for make_array, we can allow it for make_map too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need another make_array to does not apply coercion. I prefer to align the behaviour to other system unless there is a good reason not to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Maybe I can create a scalar function make_array_strict that won't implement the coerce_types method for ScalarUDFImpl, but other implementations are the same as make_array.
WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need another make_array to does not apply coercion. I prefer to align the behaviour to other system unless there is a good reason not to.

Instead can we pass a boolean arg should_coercion with default value as false, to control such behaviour

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need another make_array to does not apply coercion. I prefer to align the behaviour to other system unless there is a good reason not to.

Instead can we pass a boolean arg should_coercion with default value as false, to control such behaviour

The coercion logic is not simply work like if-else statement. The make_array_inner doesn't care about coercion, the coercion is in type_coercion pass in analzyer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. That's why I planned to implement another scalar function for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need another make_array to does not apply coercion. I prefer to align the behaviour to other system unless there is a good reason not to.

I did more tests for DuckDB behavior and found something interesting. I found they also try to coercion types when building arrays or maps.
I arranged some notes for the behaviors:

How DuckDB build a map

It seems that they also transform to two lists and call map function. Just like my first design, using make_array.

D select map {1:102, 2:20};
┌───────────────────────────────────────────────────────────┐
│ main.map(main.list_value(1, 2), main.list_value(102, 20)) │
│                   map(integer, integer)                   │
├───────────────────────────────────────────────────────────┤
│ {1=102, 2=20}                                             │
└───────────────────────────────────────────────────────────┘

How DuckDB and DataFusion coercion array type

DuckDB

  • Array constructed from INT32 and numeric string: DuckDB will make it be INTEGER[].
D select array[1,2,'3'];
┌────────────────────┐
│ (ARRAY[1, 2, '3']) │
│      int32[]       │
├────────────────────┤
│ [1, 2, 3]          │
└────────────────────┘
D select typeof(array[1,2,'3']);
┌────────────────────────────┐
│ typeof((ARRAY[1, 2, '3'])) │
│          varchar           │
├────────────────────────────┤
│ INTEGER[]                  │
└────────────────────────────┘
  • Array constructed from INT32 and non-numeric string: DuckDB can't construct the array.
D select array[1,2,'a'];
Conversion Error: Could not convert the string 'a' to INT32
LINE 1: select array[1,2,'a'];

DataFusion

  • Array constructed from INT32 and numeric string: DataFusion will make it be Uf8 array.
> select [1,2,'1'];
+-----------------------------------------+
| make_array(Int64(1),Int64(2),Utf8("1")) |
+-----------------------------------------+
| [1, 2, 1]                               |
+-----------------------------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.
  
> select arrow_typeof([1,2,'1']);
+-----------------------------------------------------------------------------------------------------------------+
| arrow_typeof(make_array(Int64(1),Int64(2),Utf8("1")))                                                           |
+-----------------------------------------------------------------------------------------------------------------+
| List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) |
+-----------------------------------------------------------------------------------------------------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.
  • Array constructed from INT32 and non-numeric string: DataFusion will make it be Uf8 array.
> select [1,2,'a'];
+-----------------------------------------+
| make_array(Int64(1),Int64(2),Utf8("a")) |
+-----------------------------------------+
| [1, 2, a]                               |
+-----------------------------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

> select arrow_typeof([1,2,'a']);
+-----------------------------------------------------------------------------------------------------------------+
| arrow_typeof(make_array(Int64(1),Int64(2),Utf8("a")))                                                           |
+-----------------------------------------------------------------------------------------------------------------+
| List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) |
+-----------------------------------------------------------------------------------------------------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

The behavior of type coercion between INT32 and String is really different.

How DuckDB coercion map type

  • INT32 value and numeric string value: We can find the value '20' has been converted to 20.
D select map {1:10, 2:'20'};
┌────────────────────────────────────────────────────────────┐
│ main.map(main.list_value(1, 2), main.list_value(10, '20')) │
│                   map(integer, integer)                    │
├────────────────────────────────────────────────────────────┤
│ {1=10, 2=20}                                               │
└────────────────────────────────────────────────────────────┘
  • INT32 value and non-numeric string value. (It's what I tried in the first time. That's why I thought it shouldn't be allowed)
D select map {1:10, 2:'abc'};
Conversion Error: Could not convert string 'abc' to INT32
LINE 1: select map {1:10, 2:'abc'};
                            ^

Conclusion

Referring to these behaviors, I think we can just back to using make_array to implement this. Because the behavior of type coercion is different, Our make_map can allow map {1:10, 2:'a'} but DuckDB can't do it. It makes sense for me.
@jayzhan211 WDYT?

Copy link
Contributor

@jayzhan211 jayzhan211 Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, so the behaviour is actually depend on array itself.
I think we can use make_array in this case.

But, if we want to introduce nice dataframe API map(keys: Vec<Expr>, values: Vec<Expr>), I think we still need to pass Vec<Expr> instead of the result of make_array. However, we can introduce that in another PR.

current API expects map(vec![make_array(vec![lit("a"), lit("b")]), make_array(vec![lit("1"), lit("2")])])
a slightly better API is map(vec![lit("a"), lit("b")], vec![lit(1), lit(2)])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, so the behaviour is actually depend on array itself. I think we can use make_array in this case.

Ok, I'll roll back to make_array first.

current API expects map(vec![make_array(vec![lit("a"), lit("b")]), make_array(vec![lit("1"), lit("2")])]) a slightly better API is map(vec![lit("a"), lit("b")], vec![lit(1), lit(2)])

I'm not very familiar with the data frame implementation. Curiously, does the API for data frames also use the UDF map? I think the UDF is a logical layer function, but we don't have a corresponding logical expression for vec! other than make_array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataframe API is used for building Expr.

map(vec![make_array(vec![lit("a"), lit("b")]), make_array(vec![lit("1"), lit("2")])]) is actually like Expr::ScalarFunction(map_udf(), args: ...).

The idea is something like

fn map(keys: Vec<Expr>, values: Vec<Expr>) {
    let args: Vec<Expr> =  concat (keys, values)
    Expr::ScalarFunction(map_udf(), args)
}

Comment on lines 112 to 113
let keys = make_array(keys);
let values = make_array(values);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to invoke the make_array to do the aggregation. That's why I put the implementation in functions-array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally i think this should be implemented in functions inside core.
Do we have any downside of adding functions-array as depedency to functions ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could move make-array to functions core-feature?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any downside of adding functions-array as depedency to functions

Then you need to import unnecessary array function crate if you only care about funcitons

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can reuse make_array_inner if we move make_array to functions crate.

The alternative is to keep the code here in functions-array

Copy link
Contributor Author

@goldmedal goldmedal Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think moving make_array to functions is a good idea. It would be beneficial for many scenarios.

Copy link
Contributor Author

@goldmedal goldmedal Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, okay. After some research, I believe it's not easy to move make_array to functions. It's tied to methods in utils.rs and macro.rs. Moving all the required methods to functions could make the codebase chaotic. For now, I prefer to keep them in functions-arrays first. We can do it in another PR.

Comment on lines -85 to -87
make_map,
"Returns a map created from the given keys and values pairs. This function isn't efficient for large maps. Use the `map` function instead.",
args,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where can put this doc. Maybe we can do it when #11435

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, We can use to document this function in https://datafusion.apache.org/user-guide/sql/scalar_functions.html

return exec_err!("make_map requires an even number of arguments");
}

let (keys, values): (Vec<_>, Vec<_>) = args
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to avoid clone

Suggested change
let (keys, values): (Vec<_>, Vec<_>) = args
let (keys, values): (Vec<_>, Vec<_>) = args.into_iter().enumerate().partition(|(i, _)| i % 2 == 0);
let keys = make_array(keys.into_iter().map(|(_, expr)| expr).collect());
let values = make_array(values.into_iter().map(|(_, expr)| expr).collect());

@@ -131,6 +138,77 @@ impl ScalarUDFImpl for MakeArray {
}
}

#[derive(Debug)]
pub struct MakeArrayStrict {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just add the function that convert keys and values to list of expr instead of introducing another udf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function are public function that could be used in datafusion-cli or other project. We are just converting keys to array, we just need internal private function for this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the high level idea is that

SELECT MAKE_MAP('POST', 41, 'PAST', 33,'PATCH', 30)
We arrange args to ['POST', 'PAST', 'PATCH'], [41, 33, 30], and call
MAP(['POST', 'PAST', 'PATCH'], [41, 33, 30])

I just noticed that we can't directly pass these two array to MapFunc 😕

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could figure how to build with dataframe API, map(keys, values)

Current function is like

pub fn map($($arg: datafusion_expr::Expr),*) -> datafusion_expr::Expr {
  super::$FUNC().call(vec![$($arg),*])
}

Expected

pub fn map(keys: Vec<Expr>, values: Vec<Expr>) -> Expr
 ...
}

Copy link
Contributor

@jayzhan211 jayzhan211 Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR, we just call make_array_inner and instead of make_array_strict, we could deal with others in another PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should find a way to avoid make_array_strict 🤔

Copy link
Contributor

@jayzhan211 jayzhan211 Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change the MapFunc first, let it takes arguments with Vec<Expr>. The first half is keys, the other is values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I play around to make sure the suggestion makes sense #11526

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I will check it tonight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns for it. If we make MapFunc to accept one array, it would be used like

SELECT map([1,2,3,'a','b','c'])

After planning, the input array would be ['1','2','3','a','b','c'] because of the type coercion for array elements. I think the behavior is wrong. If we change the signature of MapFunc, we might need to have another implementation to solve it.

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jayzhan211
Copy link
Contributor

Thanks @goldmedal . I will file an issue about the map API

@jayzhan211 jayzhan211 merged commit 5f0dfbb into apache:main Jul 19, 2024
25 checks passed
@goldmedal goldmedal deleted the feature/moving-make-map-to-expr branch July 19, 2024 09:39
@goldmedal
Copy link
Contributor Author

Thanks @jayzhan211 and @dharanad for reviewing

Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Jul 23, 2024
* move make_map to ExprPlanner

* add benchmark for make_map

* remove todo comment

* update lock

* refactor plan_make_map

* implement make_array_strict for type checking strictly

* fix planner provider

* roll back to `make_array`

* update lock
wiedld pushed a commit to influxdata/arrow-datafusion that referenced this pull request Jul 31, 2024
* move make_map to ExprPlanner

* add benchmark for make_map

* remove todo comment

* update lock

* refactor plan_make_map

* implement make_array_strict for type checking strictly

* fix planner provider

* roll back to `make_array`

* update lock
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants