Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve null handling for to_char #9689

Merged
merged 7 commits into from
Mar 24, 2024
Merged

Conversation

tinfoil-knight
Copy link
Contributor

@tinfoil-knight tinfoil-knight commented Mar 19, 2024

Which issue does this PR close?

Closes #9217.

Rationale for this change

Before this PR: On using null format with to_char, DataFusion would return the original data.

select to_char('2024-01-01 06:00:00'::timestamp, null);

> datafusion-cli

+-------------------------------------------+
| to_char(Utf8("2024-01-01 06:00:00"),NULL) |
+-------------------------------------------+
| 2024-01-01T06:00:00Z                      |
+-------------------------------------------+
1 row in set. Query took 0.021 seconds.

This is different from what Postgres does which is to return null.

> psql

to_char
---------

(1 row)

This PR intends to make the behaviour for Datafusion the same as Postgres for handling null format in to_char. So now, we''ll get null output too when the specified format is null.

> datafusion-cli

+-------------------------------------------+
| to_char(Utf8("2024-01-01 06:00:00"),NULL) |
+-------------------------------------------+
|                                           |
+-------------------------------------------+
1 row in set. Query took 0.022 seconds.

What changes are included in this PR?

We're passing null instead of the original data when the format is None.

Are these changes tested?

  • Existing tests have been updated.
  • 1 new test has been added.

Are there any user-facing changes?

Yes. Earlier to_char(timestamp, null) would result in getting the original value back. Now it results in a null string.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Mar 19, 2024
@@ -171,7 +171,11 @@ fn _to_char_scalar(
// of the implementation in arrow-rs we need to convert it to an array
let data_type = &expression.data_type();
let is_scalar_expression = matches!(&expression, ColumnarValue::Scalar(_));
let array = expression.into_array(1)?;
let array_from_expr = expression.into_array(1)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for Reviewer:
We can't directly return null when we see that the format is None because format_options (passed to ArrayFormatter::try_new) allow specifying the string to show for null & we need to respect this configuration option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it is true that ArrayFormatter::try_new allows specifying the string to use for null values, I don't think that functionality is exposed via to_char

Thus I think this actually should simply return a new StringArray of all null values (which confusingly is different than NullArray)

So in this case I think if the format is None the code should return a null string value (namely ColumnarValue::Scalar(ScalarValue::Utf8(None)))

Copy link
Contributor Author

@tinfoil-knight tinfoil-knight Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented this change. Slight difference in the type though. Please see #9689 (comment).

@tinfoil-knight
Copy link
Contributor Author

tinfoil-knight commented Mar 19, 2024

@Omega359 tagging you since you added the to_char implementations.

@Omega359
Copy link
Contributor

I'll do a full review of this tonight - at first look it looks good though.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tinfoil-knight -- this is looking like a good start

Thank you @Omega359 for offering to review as well

@@ -171,7 +171,11 @@ fn _to_char_scalar(
// of the implementation in arrow-rs we need to convert it to an array
let data_type = &expression.data_type();
let is_scalar_expression = matches!(&expression, ColumnarValue::Scalar(_));
let array = expression.into_array(1)?;
let array_from_expr = expression.into_array(1)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it is true that ArrayFormatter::try_new allows specifying the string to use for null values, I don't think that functionality is exposed via to_char

Thus I think this actually should simply return a new StringArray of all null values (which confusingly is different than NullArray)

So in this case I think if the format is None the code should return a null string value (namely ColumnarValue::Scalar(ScalarValue::Utf8(None)))

let formatter = ArrayFormatter::try_new(arrays[0].as_ref(), &format_options)?;
let result = formatter.value(idx).try_to_string();
let result = match format {
Some(_) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case we should also be formatting to None, as in change

let mut results: Vec<String> = vec![];

to

let mut results: Vec<Option<String>> = vec![];

so that it can properly represent nulls

Copy link
Contributor Author

@tinfoil-knight tinfoil-knight Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing to use the Vec<String> type. Reason is same as #9689 (comment) .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated:
Type has been changed to Vec<Option<String>> now.

ArrayFormatter::try_new(null_array.as_ref(), &format_options)?;
formatter.value(0).try_to_string()
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something else I noticed while looking at this code is that

        ColumnarValue::Array(_) => Ok(ColumnarValue::Array(Arc::new(StringArray::from(
            results,
        )) as ArrayRef)),

Effectively means the strings are copied twice (once to results and then again to the array).

As a follow on PR we could potentially use StringBuilder to build the final string array directly rather than allocating a bunch of small strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For someone whose role is to find ways to optimize things I seem to be rather poor at it sometimes :) Nice catch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well there are different degrees of optimization for sure ;)

@Omega359
Copy link
Contributor

The only additions I have beyond what @alamb covered is that we probably want to update the comment in invoke

// null format, use default formats

Since that wouldn't be true anymore.

In addition it would be nice to have the tests in to_char.rs updated to include a test that passes in None for the format and expecting a None result.

if format.is_none() {
if is_scalar_expression {
return Ok(ColumnarValue::Scalar(ScalarValue::Utf8(
Some(String::new()),
Copy link
Contributor Author

@tinfoil-knight tinfoil-knight Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using None instead of Some(String::new()) was causing NULL to appear instead of empty in SQL Logic Tests. (Output was same for CLI though.)

[SQL] select to_char(arrow_cast(123456, 'Duration(Second)'), null);
[Diff] (-expected|+actual)
-   (empty)
+   NULL
at test_files/timestamps.slt:2661

I don't think this is what we want. (empty) is the expected value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought None is the correct value (as it will semantically be aNULL, which is the correct results)

The sqllogictests have special formatting for NULL values (to distinguish them from empty strings)

https://github.com/apache/arrow-datafusion/blob/1d8a41bc8e08b56e90d6f8e6ef20e39a126987e4/datafusion/sqllogictest/src/engines/datafusion_engine/normalize.rs#L198-L200

However, when I double checked, the behavior of to_date in spark seems to be different yet (passing in a null format seems to simply ignore the format string and instead parses with the default -- it doesn't return null)

>>> df = spark.createDataFrame([('1997-02-28 10:30:00',)], ['t'])
>>> df.select(functions.to_date(df.t, 'yyyy-MM-dd HH:mm:ss').alias('date')).collect()
[Row(date=datetime.date(1997, 2, 28))]
>>> df.select(functions.to_date(df.t, None).alias('date')).collect()
[Row(date=datetime.date(1997, 2, 28))]
>>> df = spark.createDataFrame([('1997-02-2ddddd',)], ['t'])
>>> df.select(functions.to_date(df.t, None).alias('date')).collect()
[Row(date=None)]
>>> df.select(functions.to_date(df.t, 'yyyy-MM-dd HH:mm:ss').alias('date')).collect()
[Row(date=None)]

Any hints @Omega359 ?

)));
} else {
return Ok(ColumnarValue::Array(Arc::new(StringArray::from(
vec![String::new(); array.len()],

This comment was marked as outdated.

@tinfoil-knight

This comment was marked as outdated.

@tinfoil-knight tinfoil-knight requested a review from Omega359 March 21, 2024 13:08
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if format.is_none() {
if is_scalar_expression {
return Ok(ColumnarValue::Scalar(ScalarValue::Utf8(
Some(String::new()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought None is the correct value (as it will semantically be aNULL, which is the correct results)

The sqllogictests have special formatting for NULL values (to distinguish them from empty strings)

https://github.com/apache/arrow-datafusion/blob/1d8a41bc8e08b56e90d6f8e6ef20e39a126987e4/datafusion/sqllogictest/src/engines/datafusion_engine/normalize.rs#L198-L200

However, when I double checked, the behavior of to_date in spark seems to be different yet (passing in a null format seems to simply ignore the format string and instead parses with the default -- it doesn't return null)

>>> df = spark.createDataFrame([('1997-02-28 10:30:00',)], ['t'])
>>> df.select(functions.to_date(df.t, 'yyyy-MM-dd HH:mm:ss').alias('date')).collect()
[Row(date=datetime.date(1997, 2, 28))]
>>> df.select(functions.to_date(df.t, None).alias('date')).collect()
[Row(date=datetime.date(1997, 2, 28))]
>>> df = spark.createDataFrame([('1997-02-2ddddd',)], ['t'])
>>> df.select(functions.to_date(df.t, None).alias('date')).collect()
[Row(date=None)]
>>> df.select(functions.to_date(df.t, 'yyyy-MM-dd HH:mm:ss').alias('date')).collect()
[Row(date=None)]

Any hints @Omega359 ?

@Omega359
Copy link
Contributor

Omega359 commented Mar 21, 2024

I was targeting postgres with this as best as possible vs spark though null/empty testing was unfortunately not a big focus. I think the previous behavior (use default) was the spark behaviour - though that wasn't on purpose on my part.

Postgresql:

SELECT to_char('1997-02-28 10:30:00'::TIMESTAMP, null);
+-------+
|to_char|
+-------+
|null   |
+-------+

I knew about the behaviour difference with spark but it's not something that I can see can be fixed without a flag or dedicated functions per dialect. It actually affects me directly which I'll handle with preprocessing the formats to strip null/empty formats.

@alamb
Copy link
Contributor

alamb commented Mar 21, 2024

I was targeting postgres with this as best as possible vs spark though null/empty testing was unfortunately not a big focus. I think the previous behavior (use default) was the spark behaviour - though that wasn't on purpose on my part.

That makes sense -- thank you

So in that case I do agree we should return None (aka NULL) when the format string is NULL.

@tinfoil-knight
Copy link
Contributor Author

I've made the change. We're returning None instead of empty strings now.

@tinfoil-knight tinfoil-knight requested a review from alamb March 23, 2024 12:06
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @tinfoil-knight

return Ok(ColumnarValue::Scalar(ScalarValue::Utf8(None)));
} else {
let result: Vec<Option<String>> = vec![None; array.len()];
return Ok(ColumnarValue::Array(Arc::new(StringArray::from(result))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance of this likely doesn't matter, but you can do the same thing more efficiently via new_null_array:

https://docs.rs/arrow/latest/arrow/array/fn.new_null_array.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@tinfoil-knight tinfoil-knight requested a review from alamb March 23, 2024 21:36
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @tinfoil-knight -- this PR now makes the function compatible with postgres (though it does something different than spark)

I think this is inevitable until we have some notion of "spark compatible" function packages e.g. #5600

@alamb alamb merged commit 227d1f8 into apache:main Mar 24, 2024
23 checks passed
@tinfoil-knight tinfoil-knight deleted the 9217-to_char-null branch March 24, 2024 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_char(.., null) should be NULL, not ..
3 participants