-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve null handling for to_char #9689
Conversation
@@ -171,7 +171,11 @@ fn _to_char_scalar( | |||
// of the implementation in arrow-rs we need to convert it to an array | |||
let data_type = &expression.data_type(); | |||
let is_scalar_expression = matches!(&expression, ColumnarValue::Scalar(_)); | |||
let array = expression.into_array(1)?; | |||
let array_from_expr = expression.into_array(1)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for Reviewer:
We can't directly return null when we see that the format is None
because format_options
(passed to ArrayFormatter::try_new
) allow specifying the string to show for null & we need to respect this configuration option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it is true that ArrayFormatter::try_new
allows specifying the string to use for null values, I don't think that functionality is exposed via to_char
Thus I think this actually should simply return a new StringArray
of all null values (which confusingly is different than NullArray
)
So in this case I think if the format is None
the code should return a null string value (namely ColumnarValue::Scalar(ScalarValue::Utf8(None))
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented this change. Slight difference in the type though. Please see #9689 (comment).
@Omega359 tagging you since you added the |
I'll do a full review of this tonight - at first look it looks good though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tinfoil-knight -- this is looking like a good start
Thank you @Omega359 for offering to review as well
@@ -171,7 +171,11 @@ fn _to_char_scalar( | |||
// of the implementation in arrow-rs we need to convert it to an array | |||
let data_type = &expression.data_type(); | |||
let is_scalar_expression = matches!(&expression, ColumnarValue::Scalar(_)); | |||
let array = expression.into_array(1)?; | |||
let array_from_expr = expression.into_array(1)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it is true that ArrayFormatter::try_new
allows specifying the string to use for null values, I don't think that functionality is exposed via to_char
Thus I think this actually should simply return a new StringArray
of all null values (which confusingly is different than NullArray
)
So in this case I think if the format is None
the code should return a null string value (namely ColumnarValue::Scalar(ScalarValue::Utf8(None))
)
let formatter = ArrayFormatter::try_new(arrays[0].as_ref(), &format_options)?; | ||
let result = formatter.value(idx).try_to_string(); | ||
let result = match format { | ||
Some(_) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this case we should also be formatting to None, as in change
let mut results: Vec<String> = vec![];
to
let mut results: Vec<Option<String>> = vec![];
so that it can properly represent nulls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing to use the Vec<String>
type. Reason is same as #9689 (comment) .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated:
Type has been changed to Vec<Option<String>>
now.
ArrayFormatter::try_new(null_array.as_ref(), &format_options)?; | ||
formatter.value(0).try_to_string() | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something else I noticed while looking at this code is that
ColumnarValue::Array(_) => Ok(ColumnarValue::Array(Arc::new(StringArray::from(
results,
)) as ArrayRef)),
Effectively means the strings are copied twice (once to results
and then again to the array).
As a follow on PR we could potentially use StringBuilder
to build the final string array directly rather than allocating a bunch of small strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For someone whose role is to find ways to optimize things I seem to be rather poor at it sometimes :) Nice catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well there are different degrees of optimization for sure ;)
The only additions I have beyond what @alamb covered is that we probably want to update the comment in invoke
Since that wouldn't be true anymore. In addition it would be nice to have the tests in to_char.rs updated to include a test that passes in None for the format and expecting a None result. |
if format.is_none() { | ||
if is_scalar_expression { | ||
return Ok(ColumnarValue::Scalar(ScalarValue::Utf8( | ||
Some(String::new()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using None
instead of Some(String::new())
was causing NULL
to appear instead of empty
in SQL Logic Tests. (Output was same for CLI though.)
[SQL] select to_char(arrow_cast(123456, 'Duration(Second)'), null);
[Diff] (-expected|+actual)
- (empty)
+ NULL
at test_files/timestamps.slt:2661
I don't think this is what we want. (empty)
is the expected value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought None
is the correct value (as it will semantically be aNULL
, which is the correct results)
The sqllogictests have special formatting for NULL
values (to distinguish them from empty strings)
However, when I double checked, the behavior of to_date
in spark seems to be different yet (passing in a null format seems to simply ignore the format string and instead parses with the default -- it doesn't return null
)
>>> df = spark.createDataFrame([('1997-02-28 10:30:00',)], ['t'])
>>> df.select(functions.to_date(df.t, 'yyyy-MM-dd HH:mm:ss').alias('date')).collect()
[Row(date=datetime.date(1997, 2, 28))]
>>> df.select(functions.to_date(df.t, None).alias('date')).collect()
[Row(date=datetime.date(1997, 2, 28))]
>>> df = spark.createDataFrame([('1997-02-2ddddd',)], ['t'])
>>> df.select(functions.to_date(df.t, None).alias('date')).collect()
[Row(date=None)]
>>> df.select(functions.to_date(df.t, 'yyyy-MM-dd HH:mm:ss').alias('date')).collect()
[Row(date=None)]
Any hints @Omega359 ?
))); | ||
} else { | ||
return Ok(ColumnarValue::Array(Arc::new(StringArray::from( | ||
vec![String::new(); array.len()], |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tinfoil-knight
if format.is_none() { | ||
if is_scalar_expression { | ||
return Ok(ColumnarValue::Scalar(ScalarValue::Utf8( | ||
Some(String::new()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought None
is the correct value (as it will semantically be aNULL
, which is the correct results)
The sqllogictests have special formatting for NULL
values (to distinguish them from empty strings)
However, when I double checked, the behavior of to_date
in spark seems to be different yet (passing in a null format seems to simply ignore the format string and instead parses with the default -- it doesn't return null
)
>>> df = spark.createDataFrame([('1997-02-28 10:30:00',)], ['t'])
>>> df.select(functions.to_date(df.t, 'yyyy-MM-dd HH:mm:ss').alias('date')).collect()
[Row(date=datetime.date(1997, 2, 28))]
>>> df.select(functions.to_date(df.t, None).alias('date')).collect()
[Row(date=datetime.date(1997, 2, 28))]
>>> df = spark.createDataFrame([('1997-02-2ddddd',)], ['t'])
>>> df.select(functions.to_date(df.t, None).alias('date')).collect()
[Row(date=None)]
>>> df.select(functions.to_date(df.t, 'yyyy-MM-dd HH:mm:ss').alias('date')).collect()
[Row(date=None)]
Any hints @Omega359 ?
I was targeting postgres with this as best as possible vs spark though null/empty testing was unfortunately not a big focus. I think the previous behavior (use default) was the spark behaviour - though that wasn't on purpose on my part. Postgresql:
I knew about the behaviour difference with spark but it's not something that I can see can be fixed without a flag or dedicated functions per dialect. It actually affects me directly which I'll handle with preprocessing the formats to strip null/empty formats. |
That makes sense -- thank you So in that case I do agree we should return |
I've made the change. We're returning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @tinfoil-knight
return Ok(ColumnarValue::Scalar(ScalarValue::Utf8(None))); | ||
} else { | ||
let result: Vec<Option<String>> = vec![None; array.len()]; | ||
return Ok(ColumnarValue::Array(Arc::new(StringArray::from(result)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance of this likely doesn't matter, but you can do the same thing more efficiently via new_null_array
:
https://docs.rs/arrow/latest/arrow/array/fn.new_null_array.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @tinfoil-knight -- this PR now makes the function compatible with postgres (though it does something different than spark)
I think this is inevitable until we have some notion of "spark compatible" function packages e.g. #5600
Which issue does this PR close?
Closes #9217.
Rationale for this change
Before this PR: On using null format with
to_char
, DataFusion would return the original data.select to_char('2024-01-01 06:00:00'::timestamp, null);
This is different from what Postgres does which is to return null.
This PR intends to make the behaviour for Datafusion the same as Postgres for handling null format in
to_char
. So now, we''ll get null output too when the specified format is null.What changes are included in this PR?
We're passing null instead of the original data when the format is
None
.Are these changes tested?
Are there any user-facing changes?
Yes. Earlier
to_char(timestamp, null)
would result in getting the original value back. Now it results in a null string.