Implement ScalarFunction `MAKE_MAP` and `MAP` #11361

goldmedal · 2024-07-09T17:39:37Z

Which issue does this PR close?

Rationale for this change

The benchmark result:

Gnuplot not found, using plotters backend
make_map_1000           time:   [2.6560 ms 2.7182 ms 2.8078 ms]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

map_1000                time:   [5.0431 µs 5.0783 µs 5.1168 µs]
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

What changes are included in this PR?

Implement two scalar functions for creating map value:

MAKE_MAP
- MAKE_MAP isn't efficient. We shouldn't use it to create a large map.

SELECT MAKE_MAP('a', 1, 'b', 2, 'c', 3)
----
{a:1, b:2, c:3}

MAP

SELECT MAP(['a', 'b', 'c'], [1, 2, 3])
----
{a:1, b:2, c:3}

Are these changes tested?

yes

Are there any user-facing changes?

add two function.

goldmedal · 2024-07-09T18:47:01Z

I noticed the map and struct of Arrow allow duplicate keys but I think this behavior is wrong.

> select named_struct('abc', 1, 'abc', 2);
+---------------------------------------------------------+
| named_struct(Utf8("abc"),Int64(1),Utf8("abc"),Int64(2)) |
+---------------------------------------------------------+
| {abc: 1, abc: 2}                                        |
+---------------------------------------------------------+
1 row(s) fetched.

Typically, a struct or map should not have a duplicate key (It would cause some problems when getting data). In DuckDB, the query will fail if exist the duplicate keys.

D select map_from_entries([('a',null),('a',2)]);
Invalid Input Error: Map keys must be unique.

I'm unsure where we should handle this issue (DataFusion or Arrow). To solve it, I prefer to fix MapArray and StructArray on the Arrow side.

jayzhan211 · 2024-07-10T08:42:24Z

For make_map(k1, v1, k2, v2...), ideally we can arrange the order of args and call map([k1, k2..], [v1, v2]). I think we can arrange it in ExprPlanner, so we can have only one function map instead of two.

goldmedal · 2024-07-10T14:44:32Z

For make_map(k1, v1, k2, v2...), ideally we can arrange the order of args and call map([k1, k2..], [v1, v2]). I think we can arrange it in ExprPlanner, so we can have only one function map instead of two.

Thanks @jayzhan211. Sounds great.
Curiously, what's the benefit of moving to ExprPlanner? I'm not sure, but I think it just moves the cost of aggregating column values to ExprPlanner, right? Or is it possible to get better performance?

alamb · 2024-07-10T17:49:12Z

I noticed the map and struct of Arrow allow duplicate keys but I think this behavior is wrong.

I did some research

For StructArrays https://arrow.apache.org/docs/format/Columnar.html#struct-layout doesn't say anything about requiring the field names to be unique. It certainly is going to cause issues for many applications (including datafusion) if there is a struct with duplicated field names

For map, the spec says the keys should be unique (and application, aka DataFusion) enforced

https://github.com/apache/arrow/blob/5e451d85d7269d3fb9c7eaab06caece5718c40e5/format/Schema.fbs#L117-L145

/// In this layout, the keys and values are each respectively contiguous. We do
/// not constrain the key and value types, so the application is responsible
/// for ensuring that the keys are hashable and unique. Whether the keys are sorted
/// may be set in the metadata for this field.

So my suggestion is:

File a separate datafusion ticket about the fact that named_struct allows repeated field names
either file a new ticket (or Update this PR) to enforce the uniqueness of map key name for the map/make_map function

alamb · 2024-07-10T17:55:18Z

For make_map(k1, v1, k2, v2...), ideally we can arrange the order of args and call map([k1, k2..], [v1, v2]). I think we can arrange it in ExprPlanner, so we can have only one function map instead of two.

Thanks @jayzhan211. Sounds great. Curiously, what's the benefit of moving to ExprPlanner? I'm not sure, but I think it just moves the cost of aggregating column values to ExprPlanner, right? Or is it possible to get better performance?

I agree with @jayzhan211 it seems less than ideal to have two functions rather than just one. However I agree with @goldmedal that it isn't clear that this is all that much better.

It looks to me like duckdb has several functions to create maps as well. https://duckdb.org/docs/sql/data_types/map

They support this kind of literal syntax

SELECT MAP {'key1': 10, 'key2': 20, 'key3': 30};

As well as the parallel lists implementation

SELECT MAP(['key1', 'key2', 'key3'], [10, 20, 30]);

They also have a function similar to make_map like this:

SELECT map_from_entries([('key1', 10), ('key2', 20), ('key3', 30)]);

alamb

Thank you @goldmedal -- this is a good start. I do think there are still several issues to work out before this feature would be usable by many people.

I think we have 2 options

merge this PR as is and file tickets to iterate/improve on main (much as we did with the array / list implementations)
Keep working on this PR

I would personally prefer 1 as it has worked well for us in the past and allows the development to proceed incrementally. What do you think @jayzhan211

Some initial thoughts on things left to do:

document map/make_map in the documentation. We can do this as a follow on PR: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#struct-functions
Support arrays (not just scalars)
Decide if we want to have make_map follow named_struct or if it would be better to have something more like what duckdb does

and probably more

alamb · 2024-07-10T17:46:36Z

datafusion/functions/src/core/map.rs

+use datafusion_expr::{ColumnarValue, ScalarUDFImpl, Signature, Volatility};
+
+fn make_map(args: &[ColumnarValue]) -> Result<ColumnarValue> {
+    if args.is_empty() {


these conditions should have been checked by the planner, so it would probably be ok to panic here or return an internal error. A real error is ok too, but I suspect it would be impossible to actually hit

alamb · 2024-07-10T17:55:49Z

datafusion/functions/src/core/map.rs

+                }
+                else if chunk[1].data_type() != value_type {
+                    return exec_err!(
+                        "map requires all values to have the same type {}, got {} instead at position {}",


Suggested change

"map requires all values to have the same type {}, got {} instead at position {}",

"make_map requires all values to have the same type {}, got {} instead at position {}",

alamb · 2024-07-10T17:59:20Z

datafusion/functions/src/core/map.rs

+            if chunk[0].data_type().is_null() {
+                return exec_err!("map key cannot be null");
+            }
+            if !chunk[1].data_type().is_null() {


I found this code to do null checking / coercion somewhat confusing

I would have expected that the planner had done the cocersion once at plan time rather than doing it on all inputs

Perhaps you could implement coerce_types for the map function once

datafusion/datafusion/expr/src/udf.rs

Lines 540 to 542 in 585504a

fn coerce_types(&self, _arg_types: &[DataType]) -> Result<Vec<DataType>> {

not_impl_err!("Function {} does not implement coerce_types", self.name())

}

The main purpose of this check is to decide the type of value array to create the null array in L76. If the values are

[null, null, 1, 2, 3]

We will know the value type by the third element. So, I think the null check is necessary. However, I agree the other check isn't necessary. Maybe I don't need to implement coerce_types since I guess return_type has checked them during planning, right?

It seems that you are trying to replace null with the non-null type here. It is possible to coerce_types earlier before invoke()

You can use Signature::user_defined(Volatility::Immutable) with your defined coerce_types to coerce null to non-null, and you can check if all the value type is the same here as well. This function is called before return_types and invoke, so we don't deal with null type after. Note that although the null type is converted to other type like Int32, but the array value is still null, it is something like Int32(Null).

fn coerce_types(&self, arg_types: &[DataType]) -> Result<Vec<DataType>> { let mut dt = DataType::Int32; for (i, t) in arg_types.iter().enumerate() { if i % 2 == 1 && !t.is_null() { dt = t.clone(); } } let mut dts = vec![]; for (i, t) in arg_types.iter().enumerate() { if i % 2 == 1 && t.is_null() { dts.push(dt.clone()) } else { dts.push(t.clone()) } } Ok(dts) }

It looks nice. Many thanks!!

alamb · 2024-07-10T18:00:43Z

datafusion/functions/src/core/map.rs

+        let key_type = &arg_types[0];
+        let mut value_type = &arg_types[1];
+
+        for (i, chunk) in arg_types.chunks_exact(2).enumerate() {


Same comment here about type coercion

alamb · 2024-07-10T18:06:58Z

datafusion/sqllogictest/test_files/map.slt

+query ?
+SELECT MAP(arrow_cast(make_array('POST', 'HEAD', 'PATCH'), 'LargeList(Utf8)'), arrow_cast(make_array(41, 33, 30), 'LargeList(Int64)'));
+----
+{POST: 41, HEAD: 33, PATCH: 30}


I think if possible we should also add tests for array values (not just scalars)

I wrote up some examples, like this:

# test that maps can be created from arrays statement ok create table t as values ('a', 1, ['k1', 'k2'], [10.0, 20.0]), ('b', 2, ['k3'], [30.0]), ('d', 4, ['k5', 'k6'], [50.0, 60.0]); query error select make_map(column1, column2) from t; ---- DataFusion error: Internal error: UDF returned a different number of rows than expected. Expected: 3, Got: 1. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker query error select map(column3, column4) from t; ---- DataFusion error: Execution error: Expected scalar, got ListArray [ StringArray [ "k1", "k2", ], StringArray [ "k3", ], StringArray [ "k5", "k6", ], ]

Oops. I think I misunderstood how the ColumnarValue works. I expected it to be inputted per row, but it inputs all the rows at once.

[datafusion/functions/src/core/map.rs:59:5] &key = [ StringArray [ "a", "b", "d", ], ] [datafusion/functions/src/core/map.rs:60:5] &value = [ PrimitiveArray<Int64> [ 1, 2, 4, ], ]

I'll fix them. Thanks

I had some challenges in transforming the array values to the map array layout. If the given table and SQL are

statement ok create table t as values ('a', 1, 'k1', 10), ('b', 2, 'k3', 30), ('d', 4, 'k5', 50); query ? SELECT make_map(column1, column2, column3, column4) FROM t; ---- {a: 1, k1: 10} {b: 2, k3: 30} {d: 4, k5: 50}

We'll get the keys like

[datafusion/functions/src/core/map.rs:37:5] &key = [ Array( StringArray [ "a", "b", "d", ], ), Array( StringArray [ "k1", "k3", "k5", ], ), ]

I think I need to aggregate them into one array like

["a", "k1", "b", "k3", "d", "k5"]

And then, I can give the offsets when building MapArray

[0,2,4,6]

However, I have no good idea for array aggregation now. I handle the array value by throwing not_impl_err currently. Maybe we can solve it in the follow-up PR.

jayzhan211 · 2024-07-10T23:37:47Z

Thanks @jayzhan211. Sounds great. Curiously, what's the benefit of moving to ExprPlanner? I'm not sure, but I think it just moves the cost of aggregating column values to ExprPlanner, right? Or is it possible to get better performance?

I agree with @jayzhan211 it seems less than ideal to have two functions rather than just one. However I agree with @goldmedal that it isn't clear that this is all that much better.

It looks to me like duckdb has several functions to create maps as well. https://duckdb.org/docs/sql/data_types/map

They support this kind of literal syntax
SELECT MAP {'key1': 10, 'key2': 20, 'key3': 30};
As well as the parallel lists implementation
SELECT MAP(['key1', 'key2', 'key3'], [10, 20, 30]);
They also have a function similar to make_map like this:
SELECT map_from_entries([('key1', 10), ('key2', 20), ('key3', 30)]);

I think we can have several functions frontend but one single function in functions crate, with ExprPlanner, we can arrange the args for the single MapFunc. Also, the biggest reason why make_map is slow is because the way we rearrange args involved tons of clone, it is relatively easier if we rearrange args in ExprPlanner because it is the place we collect args.

What the MapFunc's args look like depend on what is more efficient more Arrow Array.
SELECT MAP(['key1', 'key2', 'key3'], [10, 20, 30]); seems to be the most efficient one because we don't need any other args arrangement

jayzhan211 · 2024-07-10T23:43:03Z

merge this PR as is and file tickets to iterate/improve on main (much as we did with the array / list implementations)

Sure

jayzhan211

👍

goldmedal · 2024-07-11T01:34:28Z

I think we can have several functions frontend but one single function in functions crate, with ExprPlanner, we can arrange the args for the single MapFunc. Also, the biggest reason why make_map is slow is because the way we rearrange args involved tons of clone, it is relatively easier if we rearrange args in ExprPlanner because it is the place we collect args.

What the MapFunc's args look like depend on what is more efficient more Arrow Array. SELECT MAP(['key1', 'key2', 'key3'], [10, 20, 30]); seems to be the most efficient one because we don't need any other args arrangement

Thanks
It's a good reason for me. I can try it in the next PR (to support SQL syntax) and do some benchmarks for them.

goldmedal · 2024-07-11T01:45:49Z

So my suggestion is:

File a separate datafusion ticket about the fact that named_struct allows repeated field names

either file a new ticket (or Update this PR) to enforce the uniqueness of map key name for the map/make_map function

Thanks @alamb. I prefer to file new tickets for them.

Some initial thoughts on things left to do:

document map/make_map in the documentation. We can do this as a follow on PR: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#struct-functions

Support arrays (not just scalars)

Decide if we want to have make_map follow named_struct or if it would be better to have something more like what duckdb does

I'll address the comments for the code today, and then file the following issues for it. (maybe after merging?)

goldmedal · 2024-07-11T15:39:59Z

datafusion/sqllogictest/test_files/map.slt

@@ -131,19 +131,19 @@ SELECT MAKE_MAP([1,2], ['a', 'b'], [3,4], ['b']);
 ----
 {[1, 2]: [a, b], [3, 4]: [b]}

-query error DataFusion error: Error during planning: Execution error: User-defined coercion failed with Execution\("map requires all values to have the same type Int64, got Utf8 instead at position 1"\)(.|\n)*


This pattern works in my local environment but doesn't work for CI. I'm not sure of the reason currently. Just simplify the assertion.

datafusion/functions/src/core/map.rs

jayzhan211 · 2024-07-12T05:38:51Z

I'm not certain if there are any other tasks we need to complete before merging this.

TODOs in the next PRs

Decide if we want to have make_map follow named_struct or if it would be better to have something more like what duckdb does
document map/make_map in the documentation. We can do this as a follow on PR: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#struct-functions
Support arrays (not just scalars)

I think we can simplify the two map functions to one first.

goldmedal · 2024-07-12T05:57:36Z

TODOs in the next PRs

Decide if we want to have make_map follow named_struct or if it would be better to have something more like what duckdb does

document map/make_map in the documentation. We can do this as a follow on PR: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#struct-functions

Support arrays (not just scalars)

I think the uniqueness of map key name is a todo ticket, too.

I think we can simplify the two map functions to one first.

As we discussed about supporting Map literal in #11268 (comment), perhaps we can do it at the same time.

jayzhan211 · 2024-07-12T06:43:53Z

I merge this first, thanks @goldmedal and @alamb

goldmedal · 2024-07-12T06:50:09Z

Thanks @jayzhan211 @alamb. If needed, I can help to file the follow-up issues tonight.

alamb · 2024-07-12T10:59:24Z

Thanks @jayzhan211 @alamb. If needed, I can help to file the follow-up issues tonight.

Thank yoU @goldmedal -- that would be awesome. I started collecting Map related tickets on an epic -- perhaps you could add the tickets you file there as well: #11429

* tmp * opt * modify test * add another version * implement make_map function * implement make_map function * implement map function * format and modify the doc * add benchmark for map function * add empty end-line * fix cargo check * update lock * upate lock * fix clippy * fmt and clippy * support FixedSizeList and LargeList * check type and handle null array in coerce_types * make array value throw todo error * fix clippy * simpify the error tests

goldmedal added 9 commits July 6, 2024 18:04

tmp

0e9cbb9

opt

c4654e7

modify test

95198c1

add another version

d69f0de

implement make_map function

163de43

implement make_map function

9291b24

implement map function

a2179ea

format and modify the doc

e7b6a77

add benchmark for map function

aa37883

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jul 9, 2024

goldmedal added 5 commits July 10, 2024 01:41

add empty end-line

c202fcc

fix cargo check

90d877d

update lock

e288b9f

Merge branch 'main' into feature/11268-scalar-funciton-map-v3

f9bc626

upate lock

9610f32

fix clippy

6cf04fd

goldmedal marked this pull request as ready for review July 9, 2024 18:51

fmt and clippy

162a492

alamb mentioned this pull request Jul 9, 2024

DataFusion weekly project plan (Andrew Lamb) - July 8, 2024 #11334

Closed

9 tasks

support FixedSizeList and LargeList

8933411

alamb reviewed Jul 10, 2024

View reviewed changes

jayzhan211 approved these changes Jul 10, 2024

View reviewed changes

goldmedal added 4 commits July 11, 2024 22:18

check type and handle null array in coerce_types

c29d3d9

make array value throw todo error

0b0b068

fix clippy

ce808e8

simpify the error tests

fa4f809

goldmedal commented Jul 11, 2024

View reviewed changes

jayzhan211 reviewed Jul 11, 2024

View reviewed changes

datafusion/functions/src/core/map.rs Show resolved Hide resolved

jayzhan211 merged commit 5ba634a into apache:main Jul 12, 2024
25 checks passed

goldmedal deleted the feature/11268-scalar-funciton-map-v3 branch July 12, 2024 06:50

goldmedal mentioned this pull request Jul 12, 2024

Implement the rewrite from the Map literal to Map function #11434

Closed

This was referenced Jul 12, 2024

Document the Map function in the documentation #11435

Closed

Support Arrays for the Map scalar functions #11436

Closed

The struct value should not have duplicate and null name #11438

Closed

Move MAKE_MAP to ExprPlanner #11452

Merged

alamb mentioned this pull request Jul 15, 2024

DataFusion weekly project plan (Andrew Lamb) - July 15, 2024 #11474

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement ScalarFunction `MAKE_MAP` and `MAP` #11361

Implement ScalarFunction `MAKE_MAP` and `MAP` #11361

goldmedal commented Jul 9, 2024 •

edited

Loading

goldmedal commented Jul 9, 2024

jayzhan211 commented Jul 10, 2024

goldmedal commented Jul 10, 2024

alamb commented Jul 10, 2024

alamb commented Jul 10, 2024

alamb left a comment

alamb Jul 10, 2024

alamb Jul 10, 2024

alamb Jul 10, 2024

goldmedal Jul 11, 2024

jayzhan211 Jul 11, 2024 •

edited

Loading

goldmedal Jul 11, 2024

alamb Jul 10, 2024

alamb Jul 10, 2024

goldmedal Jul 11, 2024 •

edited

Loading

goldmedal Jul 11, 2024

jayzhan211 commented Jul 10, 2024 •

edited

Loading

jayzhan211 commented Jul 10, 2024

jayzhan211 left a comment

goldmedal commented Jul 11, 2024

goldmedal commented Jul 11, 2024

goldmedal Jul 11, 2024

jayzhan211 commented Jul 12, 2024 •

edited

Loading

goldmedal commented Jul 12, 2024

jayzhan211 commented Jul 12, 2024

goldmedal commented Jul 12, 2024

alamb commented Jul 12, 2024

	"map requires all values to have the same type {}, got {} instead at position {}",
	"make_map requires all values to have the same type {}, got {} instead at position {}",

	fn coerce_types(&self, _arg_types: &[DataType]) -> Result<Vec<DataType>> {
	not_impl_err!("Function {} does not implement coerce_types", self.name())
	}

Implement ScalarFunction MAKE_MAP and MAP #11361

Implement ScalarFunction MAKE_MAP and MAP #11361

Conversation

goldmedal commented Jul 9, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

goldmedal commented Jul 9, 2024

jayzhan211 commented Jul 10, 2024

goldmedal commented Jul 10, 2024

alamb commented Jul 10, 2024

alamb commented Jul 10, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goldmedal Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 commented Jul 10, 2024 • edited Loading

jayzhan211 commented Jul 10, 2024

jayzhan211 left a comment

Choose a reason for hiding this comment

goldmedal commented Jul 11, 2024

goldmedal commented Jul 11, 2024

Choose a reason for hiding this comment

jayzhan211 commented Jul 12, 2024 • edited Loading

goldmedal commented Jul 12, 2024

jayzhan211 commented Jul 12, 2024

goldmedal commented Jul 12, 2024

alamb commented Jul 12, 2024

Implement ScalarFunction `MAKE_MAP` and `MAP` #11361

Implement ScalarFunction `MAKE_MAP` and `MAP` #11361

goldmedal commented Jul 9, 2024 •

edited

Loading

jayzhan211 Jul 11, 2024 •

edited

Loading

goldmedal Jul 11, 2024 •

edited

Loading

jayzhan211 commented Jul 10, 2024 •

edited

Loading

jayzhan211 commented Jul 12, 2024 •

edited

Loading