Fix get_type for higher-order array functions #13756

findepi · 2024-12-13T13:32:23Z

Which issue does this PR close?

Rationale for this change

Fix a bug, see issue #13755
TL;DR: fix incorrect result of ExprSchemable::get_type for an array function invoked on array of list

What changes are included in this PR?

Just the fix

Are these changes tested?

unit test

Are there any user-facing changes?

yes

findepi · 2024-12-13T13:33:26Z

datafusion/functions-nested/src/extract.rs

+        assert_eq!(
+            ExprSchemable::get_type(&udf_expr, &schema).unwrap(),
+            complex_type
+        );


This didn't pass before the change. The assertions above did pass.

datafusion/functions-nested/src/flatten.rs

The fix is covered by recursive flatten test case in array.slt

jayzhan211 · 2024-12-13T14:11:52Z

datafusion/expr/src/type_coercion/functions.rs

+        }
+    }
+
+    fn recursive_array(array_type: &DataType) -> Option<DataType> {


Can we extend the existing array function for nested array instead of creating another signature for nested array

I don't know how to do this, please advise!
But this function should go away with #13757.

But this function should go away with #13757.

I don't understand -- if the goal is to remove recursive flattening, should we be adding new code to support it 🤔

the pre-existing array signature implied recursively array-infication (replacing FixedLengthList with List, recursively), didn't imply flattening.

the recursive type normalization matters for flatten only, cause it (currently) operates recursively and otherwise would need to gain code to handle FixedLengthList inputs

the recursive array-ification was useless for other array functions and was made non-recursive.
to compensate for this change, new RecursiveArray signature was added for flatten case.

jayzhan211 · 2024-12-13T14:12:48Z

datafusion/functions-nested/src/extract.rs

+    use std::collections::HashMap;
+
+    #[test]
+    fn test_array_element_return_type() {


I think we can add tests in slt file that cover the array signature test cases, so we can avoid creating rust test here.

The rust test allows explicitly exercising various ways of getting expression type.
Before i wrote it, I wasn't even sure whether it's a bug or a feature.

I can add slt test, how would it look like?

I did try to write some slt regression tests, but i couldn't expose the bug. Yet, the unit tests proves the bug exists.
I trust you have a better intuition how signature related bug can be exposed in SLT. Please advise.

alamb

THanks @findepi and @jayzhan211

From what I can see the point of this PR is to make array_element_udf have different type resolution rules (non recursive), which seems reasonable

However, as you both mention I don't seem to be able to trigger the problem from SQL (element access seems to work correctly): (e.g. the [[20]] isn't flattened in on main:

> create table t as values ([[[10]], [[20]]]);
0 row(s) fetched.
Elapsed 0.007 seconds.

> explain select column1[2] from t;
+---------------+---------------------------------------------------------------------------+
| plan_type     | plan                                                                      |
+---------------+---------------------------------------------------------------------------+
| logical_plan  | Projection: array_element(t.column1, Int64(2))                            |
|               |   TableScan: t projection=[column1]                                       |
| physical_plan | ProjectionExec: expr=[array_element(column1@0, 2) as t.column1[Int64(2)]] |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                           |
|               |                                                                           |
+---------------+---------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.001 seconds.

> select column1[2] from t;
+---------------------+
| t.column1[Int64(2)] |
+---------------------+
| [[20]]              |
+---------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.

And the type seems good too list(list(int))

> select arrow_typeof(column1[2]) from t;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| arrow_typeof(t.column1[Int64(2)])                                                                                                                                                                                           |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

So this problem is quite strange. How is it working today (without this change) 🤔

alamb · 2024-12-13T20:02:36Z

datafusion/expr/src/type_coercion/functions.rs

+        }
+    }
+
+    fn recursive_array(array_type: &DataType) -> Option<DataType> {


But this function should go away with #13757.

I don't understand -- if the goal is to remove recursive flattening, should we be adding new code to support it 🤔

findepi · 2024-12-13T20:40:49Z

So this problem is quite strange. How is it working today (without this change) 🤔

i believe the bug -- if we agree this is a bug -- is compensated by other factors.
For example, at early planning stage it's totally OK to change expression types.
Later, such a change triggers schema change assertion.

I found this bug in case where array_element was inserted into the plan as a result of ScalarUDFImpl::simplify. At this stage it's "loose typing" is no longer OK.

@alamb @jayzhan211 can you please review the attached unit test?
Does it look sound, ie should it pass?
Does it pass for you without other changes from the PR?

alamb · 2024-12-13T21:19:23Z

I am checking this out in more detail

alamb

I am still digging. This is so weird

I messed with the test and it seems like the failure only happens when the complex type is a FixedSizeList for some reason..

alamb · 2024-12-13T21:16:00Z

datafusion/expr/src/type_coercion/functions.rs

    fn array(array_type: &DataType) -> Option<DataType> {
+        match array_type {


so this says that if the type is a list, keep the type, but if the type is large list / fixed size list then take the field type?

Why doesn't it also take the field type for List 🤔 ? (Aka it doesn't make sense to me that List is treated differently than LargeList and FixedSizeList

for backwards compat i should keep LargeList so it stays LargeList, will push shortly

Aka it doesn't make sense to me that List is treated differently than LargeList and FixedSizeList

not my invention, it was like this before.
i think the intention is "converge List, LL and FSL into one type... or maybe two types... to keep UDF impl simpler".

i am not attached to this approach, but i think code may be reliant on that

alamb · 2024-12-13T21:24:24Z

datafusion/functions-nested/src/extract.rs

+
+    #[test]
+    fn test_array_element_return_type() {
+        let complex_type = DataType::FixedSizeList(


When I change this complex type to DataType::List the test passes 🤔

let complex_type = DataType::List( Field::new("some_arbitrary_test_field", DataType::Int32, false).into(), );

It also passes when complex_type is a Struct

let complex_type = DataType::Struct(Fields::from(vec![ Arc::new(Field::new("some_arbitrary_test_field", DataType::Int32, false)), ]));

It seems like there is something about FixedSizeList that is causing issues to me

Weird, when I remove this line in expr schema the test passes (with FixedSizedList):

diff --git a/datafusion/expr/src/expr_schema.rs b/datafusion/expr/src/expr_schema.rs index 3317deafb..50aeb222f 100644 --- a/datafusion/expr/src/expr_schema.rs +++ b/datafusion/expr/src/expr_schema.rs @@ -152,6 +152,7 @@ impl ExprSchemable for Expr { .map(|e| e.get_type(schema)) .collect::<Result<Vec<_>>>()?; + // Verify that function is invoked with correct number and type of arguments as defined in `TypeSignature` let new_data_types = data_types_with_scalar_udf(&arg_data_types, func) .map_err(|err| { @@ -168,7 +169,7 @@ impl ExprSchemable for Expr { // Perform additional function arguments validation (due to limited // expressiveness of `TypeSignature`), then infer return type - Ok(func.return_type_from_exprs(args, schema, &new_data_types)?) + Ok(func.return_type_from_exprs(args, schema, &arg_data_types)?) } Expr::WindowFunction(window_function) => self .data_type_and_nullable_with_window_function(schema, window_function)

Which basically says pass the input data types directly to the function call rather than calling data_types_with_scalar_udf first (which claims to type coercion)

datafusion/datafusion/expr/src/expr_schema.rs

Line 171 in 68ead28

Ok(func.return_type_from_exprs(args, schema, &new_data_types)?)

🤔 this looks like it was added in Sep via 1b3608d (before that the input types were passed directly) 🤔

It doesn't seem right to me that ExprSchema is coercing the arguments (implicitly) to me 🤔

It seems like there is something about FixedSizeList that is causing issues to me

correct, #13756 (comment)

Weird, when I remove this line in expr schema the test passes (with FixedSizedList):

i did the same, basically removing this block

datafusion/datafusion/expr/src/expr_schema.rs

Lines 155 to 167 in b30c200

// Verify that function is invoked with correct number and type of arguments as defined in `TypeSignature`

let new_data_types = data_types_with_scalar_udf(&arg_data_types, func)

.map_err(|err| {

plan_datafusion_err!(

"{} {}",

err,

utils::generate_signature_error_msg(

func.name(),

func.signature().clone(),

&arg_data_types,

)

)

})?;

it's enough to fix the unit test in this PR
but other things start to fail

It doesn't seem right to me that ExprSchema is coercing the arguments (implicitly) to me 🤔

agreed

the function arguments should already be of the right coerced type

I don't know the context of why we needed to apply coercion rules in the first place

The reason is because we can't guarantee the input is already coerced.

To determine the return type of a function for a given set of inputs, we follow these steps:

Input Validation: Check if the number of inputs is correct and whether their types match the expected types.

Type Coercion: If the input types don't match exactly, attempt to coerce them into compatible types.

Return Type Decision: Once coercion is complete (if applicable), decide the return type based on the resulting input types.

That is why we have coercion in get_type for return_type. We can move out the coercion in get_type to ScalarFunction::new_udf

How about we compute the return_type when the function is created, and get_type read the value.

I like the idea in principle.

It should be combined with a new ScalarUDFImpl sub-trait that doesn't have return type-related methods at all, since they are not to be used once the plan is constructed.

The reason is because we can't guarantee the input is already coerced.

in a logical plan we can.

My understanding is that coercing analyzer also calls the get_type functions.
It can be solved by changing how the coercing analyzer tracks its internal state.

But the real problem is that same types, the LogicalPlan & Expr, have two meanings: syntactic and semantic. So in the code we go back and forth about what should and what cannot be guaranteed for an Expr or LogicalPlan instance.

the LogicalPlan & Expr, have two meanings: syntactic and semantic.

Is there example about the difference of this two, especially for function. For Expr::ScalarFunction, it has no difference in LogicalPlan, we don't do anything special, but I think this is what you don't expect. What should we have in LogicalPlan, Expr::ScalarFunction but with coerced input?

since they are not to be used once the plan is constructed.

Why get_type is not supposed to be available after plan is constructed from Expr.

Is there example about the difference of this two, especially for function.

the difference is more apparent for duplicate syntax (such is IS NULL vs IS UNKNOWN), syntax sugar (order by 1, order by all, select *)
for function call the difference is about function being resolved (typed and inputs coerced) or not.

since they are not to be used once the plan is constructed.

Why get_type is not supposed to be available after plan is constructed from Expr.

for a fully resolved logical plan it's fair question to ask what is the type of an expression (and this may or may not be O(1) available answer)

however, there is no point to ask a UDF what is its type, since we already asked it

think of this as engine and UDF being implemented by independent parties, with UDF being a contract layer.
you go over a contract layer when you have to (analysis time), but going over contract layer multiple times with the same question should be avoided.

findepi · 2024-12-13T21:37:02Z

I messed with the test and it seems like the failure only happens when the complex type is a FixedSizeList for some reason..

because coerced_fixed_size_list_to_list called here is recursive

datafusion/datafusion/expr/src/type_coercion/functions.rs

Line 422 in 55e56c4

let array_type = coerced_fixed_size_list_to_list(array_type);

jayzhan211 · 2024-12-14T01:33:15Z

ExprSchemable::get_type for ScalarFunction is basically asking the return_type for the function. Given that we coerce fixed size list to list, the return type of array_element(fixed size list) makes sense to be list. Therefore, I think the unit test is expected to fail since it is coerced to List

findepi · 2024-12-14T14:39:26Z

Given that we coerce fixed size list to list, the return type of array_element(fixed size list) makes sense to be list.

in the unit test, we ask for array_element(list(fixed size list)) and we expect the return type to be fixed size list.
in the fix, we make so that array_element(list(T)) always returns T.

jayzhan211 · 2024-12-16T02:43:42Z

Given that we coerce fixed size list to list, the return type of array_element(fixed size list) makes sense to be list.

in the unit test, we ask for array_element(list(fixed size list)) and we expect the return type to be fixed size list. in the fix, we make so that array_element(list(T)) always returns T.

The idea to coerce fixed size list to list is to simplify the logic to handle both kinds of list. Unless this leads to issue otherwise I think we should keep this aggressive coercion.

findepi · 2024-12-17T12:24:55Z

The idea to coerce fixed size list to list is to simplify the logic to handle both kinds of list.

100% agreed

Unless this leads to issue otherwise I think we should keep this aggressive coercion.

it does, because the logic was too eager (recursive where only single-step is needed).
As proven by the unit test attached to the issue.

I am naturally biased towards merging this PR, as it solves a real-life problem I encountered and had to workaround.
@alamb @jayzhan211 what problem are we solving by not merging it?

jayzhan211 · 2024-12-17T13:44:20Z

it solves a real-life problem

I hope we can have an end2end test in slt if this is a real issue. I can help to find such test when I have time.

(recursive where only single-step is needed).

I expect such test shows this mentioned issue.

Can you explain more on the reason this eager coercion is an issue? The given unit test I don't think is correct, because the return type List is what I expect not FixedSizeList.

I think an example that coerce inner fixed size list to list result incorrect result in a valid sql query (from Postgres, DuckDB) would helps a lot.

alamb

I am naturally biased towards merging this PR, as it solves a real-life problem I encountered and had to workaround.
@alamb @jayzhan211 what problem are we solving by not merging it?

I had two concerns with this PR:

It introduces a new API that initially I thought was going to be removed again, which sounded confusing
It may introduce errors / other bugs or potentially mask additional problems

After more time to think about it, however, I am convinced that this PR is a step forwards.

The split between Array and RecursiveArray I think makes more sense as they are doing two fundamentally different things (aka flatten flattens some arbitrary number of levels)
While this may mask other bugs, all the existing tests pass and thus this PR seems to be a step forward. If we have a gap in test coverage we should fix that

In terms of @jayzhan211 's concerns:

Can you explain more on the reason this eager coercion is an issue? The given unit test I don't think is correct, because the return type List is what I expect not FixedSizeList.

In my mind, selecting an element of a list would return the same type as the element. For example, an element of List(FixedSizeList) is FixedSizeList which is what this PR does.

I tried quite hard to construct a List(FixedSizeList) via SQL and could not. This suggests to me we have some sort of gap / over eager conversion to List

> create table t as values (arrow_cast([1,2,3], 'FixedSizeList(3, Int64)'), arrow_cast([3,4,5], 'FixedSizeList(3, Int64)') );
0 row(s) fetched.
Elapsed 0.004 seconds.

> select * from t;
+-----------+-----------+
| column1   | column2   |
+-----------+-----------+
| [1, 2, 3] | [3, 4, 5] |
+-----------+-----------+
1 row(s) fetched.
Elapsed 0.001 seconds.

-- The elements are FixedSizedList
> select arrow_typeof(column1), arrow_typeof(column2) from t;
+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| arrow_typeof(t.column1)                                                                                                      | arrow_typeof(t.column2)                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| FixedSizeList(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 3) | FixedSizeList(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 3) |
+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.

> select [column1, column2] from t;
+---------------------------------+
| make_array(t.column1,t.column2) |
+---------------------------------+
| [[1, 2, 3], [3, 4, 5]]          |
+---------------------------------+
1 row(s) fetched.
Elapsed 0.003 seconds.

-- Note making a list of the two fixed sized lists converts them into lists
> select arrow_typeof([column1, column2]) from t;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| arrow_typeof(make_array(t.column1,t.column2))                                                                                                                                                                               |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.003 seconds.

I will file a ticket about over eager coercion to list

alamb · 2024-12-17T17:15:27Z

datafusion/expr/src/type_coercion/functions.rs

    fn array(array_type: &DataType) -> Option<DataType> {
+        match array_type {
+            DataType::List(_) | DataType::LargeList(_) => Some(array_type.clone()),
+            DataType::FixedSizeList(field, _) => Some(DataType::List(Arc::clone(field))),


Suggested change

DataType::FixedSizeList(field, _) => Some(DataType::List(Arc::clone(field))),

// Note array functions can often change the number of elements

// so convert from FixedSize --> variable

DataType::FixedSizeList(field, _) => Some(DataType::List(Arc::clone(field))),

alamb · 2024-12-17T17:33:14Z

Filed Can not create a List of FixedSizedList in SQL #13819 to track inability to create List(FixedSizeList)

findepi · 2024-12-17T20:57:15Z

Filed Can not create a List of FixedSizedList in SQL #13819 to track inability to create List(FixedSizeList)

Looks related indeed
thank you @alamb

findepi · 2024-12-17T21:00:53Z

since the bug turned out to be specific for list of fixed size list, i updated the test naming (and var naming inside the test)

Fix get_type for higher-order array functions

6903259

github-actions bot added the logical-expr Logical plan and expressions label Dec 13, 2024

findepi commented Dec 13, 2024

View reviewed changes

datafusion/functions-nested/src/flatten.rs Show resolved Hide resolved

Fix recursive flatten

6d81418

The fix is covered by recursive flatten test case in array.slt

findepi force-pushed the findepi/array-get-type branch from 1bd311a to 6d81418 Compare December 13, 2024 13:55

jayzhan211 reviewed Dec 13, 2024

View reviewed changes

findepi requested review from alamb, jayzhan211 and jonahgao December 13, 2024 15:39

alamb reviewed Dec 13, 2024

View reviewed changes

Restore "keep LargeList" in Array signature

038a015

buraksenn mentioned this pull request Dec 15, 2024

flatten should be single-step, not recursive #13757

Open

findepi requested a review from alamb December 17, 2024 12:45

alamb approved these changes Dec 17, 2024

View reviewed changes

alamb mentioned this pull request Dec 17, 2024

Can not create a List of FixedSizedList in SQL #13819

Open

clarify naming in the test

69fcf24

findepi merged commit 7e0fc14 into apache:main Dec 18, 2024
25 checks passed

findepi deleted the findepi/array-get-type branch December 18, 2024 07:15

		fn array(array_type: &DataType) -> Option<DataType> {
		match array_type {

	// Verify that function is invoked with correct number and type of arguments as defined in `TypeSignature`
	let new_data_types = data_types_with_scalar_udf(&arg_data_types, func)
	.map_err(\|err\| {
	plan_datafusion_err!(
	"{} {}",
	err,
	utils::generate_signature_error_msg(
	func.name(),
	func.signature().clone(),
	&arg_data_types,
	)
	)
	})?;

Fix get_type for higher-order array functions #13756

Fix get_type for higher-order array functions #13756

Conversation

findepi commented Dec 13, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi commented Dec 13, 2024

alamb commented Dec 13, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi commented Dec 13, 2024

jayzhan211 commented Dec 14, 2024

findepi commented Dec 14, 2024

jayzhan211 commented Dec 16, 2024

findepi commented Dec 17, 2024

jayzhan211 commented Dec 17, 2024 • edited Loading

alamb left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Dec 17, 2024

findepi commented Dec 17, 2024

findepi commented Dec 17, 2024

jayzhan211 Dec 17, 2024 •

edited

Loading

jayzhan211 commented Dec 17, 2024 •

edited

Loading

alamb left a comment •

edited

Loading