-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nullable in StateFieldArgs
#11433
Comments
Hold on, I think it is outdated, since we found that most of the sql database returns null if no row qualified. #11299 |
For comparison I ran a few queries from #11299 with DataFusion CLI v40.0.0
> create table t(a int, b float, c bigint) as values (1, 1.2, 2);
0 row(s) fetched.
> select nth_value(a, 1) from t where a > 2;
+-------------------------+
| nth_value(t.a,Int64(1)) |
+-------------------------+
| |
+-------------------------+
1 row(s) fetched.
> select nth_value(a, 1), count(1) from t where a > 3 group by a;
+-------------------------+-----------------+
| nth_value(t.a,Int64(1)) | count(Int64(1)) |
+-------------------------+-----------------+
+-------------------------+-----------------+
0 row(s) fetched.
> select nth_value(a, 1), count(1) from t where a > 3;
+-------------------------+-----------------+
| nth_value(t.a,Int64(1)) | count(Int64(1)) |
+-------------------------+-----------------+
| | 0 |
+-------------------------+-----------------+
1 row(s) fetched.
> select nth_value(a, 1 order by a desc), count(1) from t where a > 2;
+---------------------------------------------------------+-----------------+
| nth_value(t.a,Int64(1)) ORDER BY [t.a DESC NULLS FIRST] | count(Int64(1)) |
+---------------------------------------------------------+-----------------+
| | 0 |
+---------------------------------------------------------+-----------------+
1 row(s) fetched. |
This comment is also outdated because of #11299. datafusion/datafusion/functions-aggregate/src/nth_value.rs Lines 136 to 139 in 1dfac86
|
This one looks good to me too. But, it is nice to have a use case that is benefit on the nullability of the element (like the query optimized based on nullability) let mut fields = vec![Field::new_list(
format_state_name(self.name(), "nth_value"),
Field::new("item", args.input_type.clone(), self.nullable),
true)] Otherwise for simplicity, we can assume it is nullable let mut fields = vec![Field::new_list(
format_state_name(self.name(), "nth_value"),
Field::new("item", args.input_type.clone(), true),
true)] |
Oh, so actually the current code is incorrect 😕 I thought that is the one you proposed to change. But why the result is expected 🤔 |
Sorry, I do not follow. Could you please elaborate on the incorrect part. |
Like what we have in #11299 . To returns null if row is not qualified, the |
For
2 & 3 is covered by the else branch in datafusion/datafusion/functions-aggregate/src/nth_value.rs Lines 337 to 342 in 1dfac86
The datafusion/datafusion/functions-aggregate/src/nth_value.rs Lines 134 to 142 in 1dfac86
This possibly explains passing tests. Thoughts? |
My guess is because the query is not multiple phase aggregate so there is no result that check with |
When adding trace statements in
SELECT C1
, COUNT(C1) as n
, NTH_VALUE(C13, 2 ORDER BY C1, C13 ASC) as nth -- get 2nd row
FROM aggregate_test_100
GROUP BY C1
ORDER BY C1;
+----+----+--------------------------------+
| c1 | n | nth |
+----+----+--------------------------------+
| a | 21 | Amn2K87Db5Es3dFQO9cw9cvpAM6h35 |
| b | 19 | 6FPJlLAcaQ5uokyOWZ9HGdLZObFvOZ |
| c | 21 | 6WfVFBVGJSQb7FhA7E0lBwdvjfZnSW |
| d | 18 | 1aOcrEGd0cOqZe2I5XBOm0nDcwtBZO |
| e | 21 | 3BEOHQsMEFZ58VcNTOJYShTBpAPzbt |
+----+----+--------------------------------+
5 row(s) fetched.
SELECT C1
, COUNT(C1) as n
, NTH_VALUE(C13, 20 ORDER BY C1, C13 ASC) as nth -- get 20th row
FROM aggregate_test_100
GROUP BY C1
ORDER BY C1;
+----+----+--------------------------------+
| c1 | n | nth |
+----+----+--------------------------------+
| a | 21 | waIGbOGl1PM6gnzZ4uuZt4E2yDWRHs |
| b | 19 | |
| c | 21 | pLk3i59bZwd5KBZrI1FiweYTd5hteG |
| d | 18 | |
| e | 21 | ukOiFGGFnQJDHFgZxHMpvhD3zybF0M |
+----+----+--------------------------------+
5 row(s) fetched.
Create TableCREATE EXTERNAL TABLE aggregate_test_100 (
c1 VARCHAR NOT NULL,
c2 TINYINT NOT NULL,
c3 SMALLINT NOT NULL,
c4 SMALLINT,
c5 INT,
c6 BIGINT NOT NULL,
c7 SMALLINT NOT NULL,
c8 INT NOT NULL,
c9 BIGINT UNSIGNED NOT NULL,
c10 VARCHAR NOT NULL,
c11 FLOAT NOT NULL,
c12 DOUBLE NOT NULL,
c13 VARCHAR NOT NULL
)
STORED AS CSV
-- path is relative to `datafusion-cli`
LOCATION '../testing/data/csv/aggregate_test_100.csv'
OPTIONS ('format.has_header' 'true'); |
This is a recent change introduced in #11093. datafusion/datafusion/functions-aggregate/src/nth_value.rs Lines 404 to 407 in 0965455
In It is surprising to me that the schema defined in |
Maybe the schema matching is not correct |
I tested this out in multiple ways. Each time making sure it's multi-phase grouping. So that state fields are being used. This is the understanding I've reached about the correct
let mut fields = vec![Field::new_list(
format_state_name(self.name(), "nth_value"),
Field::new("item", args.input_type.clone(), true), // See [2]
false, // See [1]
)]; It will be beneficial to future readers of this code to add some comments here. It took me more time than I think was necessary to reach the conclusions stated above because of how they are now implicit in code. But other than that the good news is no code changes are required. |
cc @eejbyfeldt and @alamb I think @jcsherin find something interesting that nullable for list element should be fn state_fields(&self, args: StateFieldsArgs) -> Result<Vec<Field>> {
if args.is_distinct {
return Ok(vec![Field::new_list(
format_state_name(args.name, "distinct_array_agg"),
Field::new("item", args.input_type.clone(), true), // keep as always true for nulls
// true
false,
)]);
}
Ok(vec![Field::new_list(
format_state_name(args.name, "array_agg"),
Field::new("item", args.input_type.clone(), true), // keep as always true for nulls
// true,
false,
)])
} I think we should definitely add docs about this, we had several PRs dealing with nullability already. It would be helpful for others too. |
@jcsherin would you be willing to draft a PR updating the docs? |
Yes, I'll be happy to take care of this. |
Thanks again @jayzhan211. |
@eejbyfeldt is working on it in #11063
Originally posted by @jayzhan211 in #11287 (comment)
The text was updated successfully, but these errors were encountered: