Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ScalarFunction MAKE_MAP and MAP #11361

Merged
merged 21 commits into from
Jul 12, 2024

Conversation

goldmedal
Copy link
Contributor

@goldmedal goldmedal commented Jul 9, 2024

Which issue does this PR close?

Closes #11268.

Rationale for this change

The benchmark result:

Gnuplot not found, using plotters backend
make_map_1000           time:   [2.6560 ms 2.7182 ms 2.8078 ms]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

map_1000                time:   [5.0431 µs 5.0783 µs 5.1168 µs]
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

What changes are included in this PR?

Implement two scalar functions for creating map value:

  • MAKE_MAP
    • MAKE_MAP isn't efficient. We shouldn't use it to create a large map.
SELECT MAKE_MAP('a', 1, 'b', 2, 'c', 3)
----
{a:1, b:2, c:3}
  • MAP
SELECT MAP(['a', 'b', 'c'], [1, 2, 3])
----
{a:1, b:2, c:3}

Are these changes tested?

yes

Are there any user-facing changes?

add two function.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jul 9, 2024
@goldmedal
Copy link
Contributor Author

I noticed the map and struct of Arrow allow duplicate keys but I think this behavior is wrong.

> select named_struct('abc', 1, 'abc', 2);
+---------------------------------------------------------+
| named_struct(Utf8("abc"),Int64(1),Utf8("abc"),Int64(2)) |
+---------------------------------------------------------+
| {abc: 1, abc: 2}                                        |
+---------------------------------------------------------+
1 row(s) fetched. 

Typically, a struct or map should not have a duplicate key (It would cause some problems when getting data). In DuckDB, the query will fail if exist the duplicate keys.

D select map_from_entries([('a',null),('a',2)]);
Invalid Input Error: Map keys must be unique.

I'm unsure where we should handle this issue (DataFusion or Arrow). To solve it, I prefer to fix MapArray and StructArray on the Arrow side.

@goldmedal goldmedal marked this pull request as ready for review July 9, 2024 18:51
@jayzhan211
Copy link
Contributor

For make_map(k1, v1, k2, v2...), ideally we can arrange the order of args and call map([k1, k2..], [v1, v2]). I think we can arrange it in ExprPlanner, so we can have only one function map instead of two.

@goldmedal
Copy link
Contributor Author

For make_map(k1, v1, k2, v2...), ideally we can arrange the order of args and call map([k1, k2..], [v1, v2]). I think we can arrange it in ExprPlanner, so we can have only one function map instead of two.

Thanks @jayzhan211. Sounds great.
Curiously, what's the benefit of moving to ExprPlanner? I'm not sure, but I think it just moves the cost of aggregating column values to ExprPlanner, right? Or is it possible to get better performance?

@alamb
Copy link
Contributor

alamb commented Jul 10, 2024

I noticed the map and struct of Arrow allow duplicate keys but I think this behavior is wrong.

I did some research

For StructArrays https://arrow.apache.org/docs/format/Columnar.html#struct-layout doesn't say anything about requiring the field names to be unique. It certainly is going to cause issues for many applications (including datafusion) if there is a struct with duplicated field names

For map, the spec says the keys should be unique (and application, aka DataFusion) enforced

https://github.com/apache/arrow/blob/5e451d85d7269d3fb9c7eaab06caece5718c40e5/format/Schema.fbs#L117-L145

/// In this layout, the keys and values are each respectively contiguous. We do
/// not constrain the key and value types, so the application is responsible
/// for ensuring that the keys are hashable and unique. Whether the keys are sorted
/// may be set in the metadata for this field.

So my suggestion is:

  1. File a separate datafusion ticket about the fact that named_struct allows repeated field names
  2. either file a new ticket (or Update this PR) to enforce the uniqueness of map key name for the map/make_map function

@alamb
Copy link
Contributor

alamb commented Jul 10, 2024

For make_map(k1, v1, k2, v2...), ideally we can arrange the order of args and call map([k1, k2..], [v1, v2]). I think we can arrange it in ExprPlanner, so we can have only one function map instead of two.

Thanks @jayzhan211. Sounds great. Curiously, what's the benefit of moving to ExprPlanner? I'm not sure, but I think it just moves the cost of aggregating column values to ExprPlanner, right? Or is it possible to get better performance?

I agree with @jayzhan211 it seems less than ideal to have two functions rather than just one. However I agree with @goldmedal that it isn't clear that this is all that much better.

It looks to me like duckdb has several functions to create maps as well. https://duckdb.org/docs/sql/data_types/map

They support this kind of literal syntax

SELECT MAP {'key1': 10, 'key2': 20, 'key3': 30};

As well as the parallel lists implementation

SELECT MAP(['key1', 'key2', 'key3'], [10, 20, 30]);

They also have a function similar to make_map like this:

SELECT map_from_entries([('key1', 10), ('key2', 20), ('key3', 30)]);

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @goldmedal -- this is a good start. I do think there are still several issues to work out before this feature would be usable by many people.

I think we have 2 options

  1. merge this PR as is and file tickets to iterate/improve on main (much as we did with the array / list implementations)
  2. Keep working on this PR

I would personally prefer 1 as it has worked well for us in the past and allows the development to proceed incrementally. What do you think @jayzhan211

Some initial thoughts on things left to do:

  1. document map/make_map in the documentation. We can do this as a follow on PR: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#struct-functions
  2. Support arrays (not just scalars)
  3. Decide if we want to have make_map follow named_struct or if it would be better to have something more like what duckdb does

and probably more

use datafusion_expr::{ColumnarValue, ScalarUDFImpl, Signature, Volatility};

fn make_map(args: &[ColumnarValue]) -> Result<ColumnarValue> {
if args.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these conditions should have been checked by the planner, so it would probably be ok to panic here or return an internal error. A real error is ok too, but I suspect it would be impossible to actually hit

}
else if chunk[1].data_type() != value_type {
return exec_err!(
"map requires all values to have the same type {}, got {} instead at position {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"map requires all values to have the same type {}, got {} instead at position {}",
"make_map requires all values to have the same type {}, got {} instead at position {}",

if chunk[0].data_type().is_null() {
return exec_err!("map key cannot be null");
}
if !chunk[1].data_type().is_null() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this code to do null checking / coercion somewhat confusing

I would have expected that the planner had done the cocersion once at plan time rather than doing it on all inputs

Perhaps you could implement coerce_types for the map function once

fn coerce_types(&self, _arg_types: &[DataType]) -> Result<Vec<DataType>> {
not_impl_err!("Function {} does not implement coerce_types", self.name())
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main purpose of this check is to decide the type of value array to create the null array in L76. If the values are

[null, null, 1, 2, 3]

We will know the value type by the third element. So, I think the null check is necessary. However, I agree the other check isn't necessary. Maybe I don't need to implement coerce_types since I guess return_type has checked them during planning, right?

Copy link
Contributor

@jayzhan211 jayzhan211 Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that you are trying to replace null with the non-null type here. It is possible to coerce_types earlier before invoke()

You can use Signature::user_defined(Volatility::Immutable) with your defined coerce_types to coerce null to non-null, and you can check if all the value type is the same here as well. This function is called before return_types and invoke, so we don't deal with null type after. Note that although the null type is converted to other type like Int32, but the array value is still null, it is something like Int32(Null).

    fn coerce_types(&self, arg_types: &[DataType]) -> Result<Vec<DataType>> {
        let mut dt = DataType::Int32;
        for (i, t) in arg_types.iter().enumerate() {
            if i % 2 == 1 && !t.is_null() {
                dt = t.clone();
            }
        }
        let mut dts = vec![];
        for (i, t) in arg_types.iter().enumerate() {
            if i % 2 == 1 && t.is_null() {
                dts.push(dt.clone())
            } else {
                dts.push(t.clone())
            }
        }
        Ok(dts)
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks nice. Many thanks!!

let key_type = &arg_types[0];
let mut value_type = &arg_types[1];

for (i, chunk) in arg_types.chunks_exact(2).enumerate() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about type coercion

query ?
SELECT MAP(arrow_cast(make_array('POST', 'HEAD', 'PATCH'), 'LargeList(Utf8)'), arrow_cast(make_array(41, 33, 30), 'LargeList(Int64)'));
----
{POST: 41, HEAD: 33, PATCH: 30}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if possible we should also add tests for array values (not just scalars)

I wrote up some examples, like this:

# test that maps can be created from arrays
statement ok
create table t as values
('a', 1, ['k1', 'k2'], [10.0, 20.0]),
('b', 2, ['k3'], [30.0]),
('d', 4, ['k5', 'k6'], [50.0, 60.0]);

query error
select make_map(column1, column2) from t;
----
DataFusion error: Internal error: UDF returned a different number of rows than expected. Expected: 3, Got: 1.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker


query error
select map(column3, column4) from t;
----
DataFusion error: Execution error: Expected scalar, got ListArray
[
  StringArray
[
  "k1",
  "k2",
],
  StringArray
[
  "k3",
],
  StringArray
[
  "k5",
  "k6",
],
]

Copy link
Contributor Author

@goldmedal goldmedal Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. I think I misunderstood how the ColumnarValue works. I expected it to be inputted per row, but it inputs all the rows at once.

[datafusion/functions/src/core/map.rs:59:5] &key = [
    StringArray
    [
      "a",
      "b",
      "d",
    ],
]
[datafusion/functions/src/core/map.rs:60:5] &value = [
    PrimitiveArray<Int64>
    [
      1,
      2,
      4,
    ],
]

I'll fix them. Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some challenges in transforming the array values to the map array layout. If the given table and SQL are

statement ok
create table t as values
('a', 1, 'k1', 10),
('b', 2, 'k3', 30),
('d', 4, 'k5', 50);

query ?
SELECT make_map(column1, column2, column3, column4) FROM t;
----
{a: 1, k1: 10}
{b: 2, k3: 30}
{d: 4, k5: 50}

We'll get the keys like

[datafusion/functions/src/core/map.rs:37:5] &key = [
    Array(
        StringArray
        [
          "a",
          "b",
          "d",
        ],
    ),
    Array(
        StringArray
        [
          "k1",
          "k3",
          "k5",
        ],
    ),
]

I think I need to aggregate them into one array like

["a", "k1", "b", "k3", "d", "k5"]

And then, I can give the offsets when building MapArray

[0,2,4,6]

However, I have no good idea for array aggregation now. I handle the array value by throwing not_impl_err currently. Maybe we can solve it in the follow-up PR.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 10, 2024

Thanks @jayzhan211. Sounds great. Curiously, what's the benefit of moving to ExprPlanner? I'm not sure, but I think it just moves the cost of aggregating column values to ExprPlanner, right? Or is it possible to get better performance?

I agree with @jayzhan211 it seems less than ideal to have two functions rather than just one. However I agree with @goldmedal that it isn't clear that this is all that much better.

It looks to me like duckdb has several functions to create maps as well. https://duckdb.org/docs/sql/data_types/map

They support this kind of literal syntax

SELECT MAP {'key1': 10, 'key2': 20, 'key3': 30};

As well as the parallel lists implementation

SELECT MAP(['key1', 'key2', 'key3'], [10, 20, 30]);

They also have a function similar to make_map like this:

SELECT map_from_entries([('key1', 10), ('key2', 20), ('key3', 30)]);

I think we can have several functions frontend but one single function in functions crate, with ExprPlanner, we can arrange the args for the single MapFunc. Also, the biggest reason why make_map is slow is because the way we rearrange args involved tons of clone, it is relatively easier if we rearrange args in ExprPlanner because it is the place we collect args.

What the MapFunc's args look like depend on what is more efficient more Arrow Array.
SELECT MAP(['key1', 'key2', 'key3'], [10, 20, 30]); seems to be the most efficient one because we don't need any other args arrangement

@jayzhan211
Copy link
Contributor

merge this PR as is and file tickets to iterate/improve on main (much as we did with the array / list implementations)

Sure

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@goldmedal
Copy link
Contributor Author

I think we can have several functions frontend but one single function in functions crate, with ExprPlanner, we can arrange the args for the single MapFunc. Also, the biggest reason why make_map is slow is because the way we rearrange args involved tons of clone, it is relatively easier if we rearrange args in ExprPlanner because it is the place we collect args.

What the MapFunc's args look like depend on what is more efficient more Arrow Array. SELECT MAP(['key1', 'key2', 'key3'], [10, 20, 30]); seems to be the most efficient one because we don't need any other args arrangement

Thanks
It's a good reason for me. I can try it in the next PR (to support SQL syntax) and do some benchmarks for them.

@goldmedal
Copy link
Contributor Author

So my suggestion is:

  1. File a separate datafusion ticket about the fact that named_struct allows repeated field names
  2. either file a new ticket (or Update this PR) to enforce the uniqueness of map key name for the map/make_map function

Thanks @alamb. I prefer to file new tickets for them.

Some initial thoughts on things left to do:

  1. document map/make_map in the documentation. We can do this as a follow on PR: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#struct-functions
  2. Support arrays (not just scalars)
  3. Decide if we want to have make_map follow named_struct or if it would be better to have something more like what duckdb does

I'll address the comments for the code today, and then file the following issues for it. (maybe after merging?)

@@ -131,19 +131,19 @@ SELECT MAKE_MAP([1,2], ['a', 'b'], [3,4], ['b']);
----
{[1, 2]: [a, b], [3, 4]: [b]}

query error DataFusion error: Error during planning: Execution error: User-defined coercion failed with Execution\("map requires all values to have the same type Int64, got Utf8 instead at position 1"\)(.|\n)*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern works in my local environment but doesn't work for CI. I'm not sure of the reason currently. Just simplify the assertion.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 12, 2024

I'm not certain if there are any other tasks we need to complete before merging this.

TODOs in the next PRs

I think we can simplify the two map functions to one first.

@goldmedal
Copy link
Contributor Author

TODOs in the next PRs

I think the uniqueness of map key name is a todo ticket, too.

I think we can simplify the two map functions to one first.

As we discussed about supporting Map literal in #11268 (comment), perhaps we can do it at the same time.

@jayzhan211 jayzhan211 merged commit 5ba634a into apache:main Jul 12, 2024
25 checks passed
@jayzhan211
Copy link
Contributor

I merge this first, thanks @goldmedal and @alamb

@goldmedal
Copy link
Contributor Author

Thanks @jayzhan211 @alamb. If needed, I can help to file the follow-up issues tonight.

@goldmedal goldmedal deleted the feature/11268-scalar-funciton-map-v3 branch July 12, 2024 06:50
@alamb
Copy link
Contributor

alamb commented Jul 12, 2024

Thanks @jayzhan211 @alamb. If needed, I can help to file the follow-up issues tonight.

Thank yoU @goldmedal -- that would be awesome. I started collecting Map related tickets on an epic -- perhaps you could add the tickets you file there as well: #11429

Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Jul 12, 2024
* tmp

* opt

* modify test

* add another version

* implement make_map function

* implement make_map function

* implement map function

* format and modify the doc

* add benchmark for map function

* add empty end-line

* fix cargo check

* update lock

* upate lock

* fix clippy

* fmt and clippy

* support FixedSizeList and LargeList

* check type and handle null array in coerce_types

* make array value throw todo error

* fix clippy

* simpify the error tests
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* tmp

* opt

* modify test

* add another version

* implement make_map function

* implement make_map function

* implement map function

* format and modify the doc

* add benchmark for map function

* add empty end-line

* fix cargo check

* update lock

* upate lock

* fix clippy

* fmt and clippy

* support FixedSizeList and LargeList

* check type and handle null array in coerce_types

* make array value throw todo error

* fix clippy

* simpify the error tests
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 17, 2024
* tmp

* opt

* modify test

* add another version

* implement make_map function

* implement make_map function

* implement map function

* format and modify the doc

* add benchmark for map function

* add empty end-line

* fix cargo check

* update lock

* upate lock

* fix clippy

* fmt and clippy

* support FixedSizeList and LargeList

* check type and handle null array in coerce_types

* make array value throw todo error

* fix clippy

* simpify the error tests
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 18, 2024
* tmp

* opt

* modify test

* add another version

* implement make_map function

* implement make_map function

* implement map function

* format and modify the doc

* add benchmark for map function

* add empty end-line

* fix cargo check

* update lock

* upate lock

* fix clippy

* fmt and clippy

* support FixedSizeList and LargeList

* check type and handle null array in coerce_types

* make array value throw todo error

* fix clippy

* simpify the error tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement a scalar function for creating ScalarValue::Map
3 participants