-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(5975/5976): timezone handling for timestamps and date_trunc
, date_part
and date_bin
#7614
Conversation
…h datetime scalar functions. * These test cases also document how our scalar functions are currently not correct. * Extra comments documenting the logical plan will be removed on test cleanup (after code fixes).
* Prior to this change, the outcome was always coerced to Timestamp(Nanoseconds, None) and the tz was dropped.
We need to support geographic as well as offset-based timezones, could you please add some test that set the timezone to "Europe/Athens" for example. Along with that comes daylight savings times so some tests to check the behviour around March and October would also be needed. In addition it would be good to add tests for the more funky timezones, such as "Australia/Adelaide" with its +09:30 offset. Making sure that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this contribution @wiedld 🏅 -- I think this PR is wonderfully written and tested. I think it is really close to mergeable. The only thing blocking it in my mind is the potential performance regression, but I think that will be straightforward to fix as I mentioned.
) -> Result<i64> { | ||
// Use chrono DateTime<Tz> to clear the various fields because need to clear per timezone, | ||
// and NaiveDateTime (ISO 8601) has no concept of timezones | ||
let tz = arrow_array::timezone::Tz::from_str(tz.as_deref().unwrap_or("+00"))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as written this will invoke as_datetime_with_timezone
for each row. This will effectively re-parse the same string for all rows which is likely to be expensive
This per-row parsing overhead will happen even for timestamps that have a tz of None
which the version on master doesn't do and thus I think this change would result in a performance regression if we merged this code as is.
However I think the fix should be relatively straightforward.
What would you think about parsing the timezone once per batch (basically parse it once in date_trunc
and then pass Option<Tz>
down through _date_trunc
rather than Option<Arc<str>>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an excellent idea. Tz parsing is now applied per batch.
Additionally, there was an inference of Tz (when None) in order to use DateTime<Tz>
. That has been replaced with the use of generics to accept either DateTime<Tz>
or NaiveDateTime
. Let me know how that looks!
cc @waitingkuo and @Weijun-H |
date_trunc
, date_part
and date_bin
# will not accept non-GMT geo abv | ||
# postgresql: accepts | ||
statement error | ||
SELECT TIMESTAMPTZ '2022-01-01 01:10:00 AEST' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concerning the non-GMT vs GMT:
- there are some standards that accept only
GMT
and no other abbreviations (e.g. web standards, not ansi sql). - postgresql does accept the non-GMT abbreviations. But right now we get errors in the arrow parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
postgres also mentions that timezone abbreviations are not well standardized. Not advocating for a change here; just documenting in the tests that we don't support timezone abbr (outside of GMT
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe arrow-rs uses https://docs.rs/chrono-tz/latest/chrono_tz/ so DataFusion will inherit the same behavior
2023-03-11T10:00:00Z | ||
|
||
# will error if provide geo longform with time not possible due to daylight savings | ||
# Arrow error: Parser error: Error parsing timestamp from '2023-03-12 02:00:00 America/Los_Angeles': error computing timezone offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Daylight savings is applied in postgresql via (at minimum) two ways:
- the non-GMT abbreviations (which the parse does not accepted)
- in the geo longform
Since we do support the geo longform, but selectively error for an invalid time (due to daylight savings) -- is this something we wish to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note: offsets are just offsets, and do not consider daylight savings. It's only the geo information which includes this weird construct. 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of timezone names to support, if there are forms of timezones that we need supported that aren't supported by chrono-tz, I think we should file a ticket / fix it upstream in arrow-rs (not try to add specific timezones in DataFusion)
Reference: #7614 (comment) and @mhilton 's comments in #7614 (comment)
…n, should have already failed in parser if invalid
* move parsing up to date_trunc() to apply per batch, not per value. * do not infer a default UTC timezone for missing tz. Instead use the appropriate type for with, or without, tz.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wiedld -- this looks great
|
||
# ok to use geo longform | ||
query P rowsort | ||
SELECT TIMESTAMPTZ '2022-01-01 01:10:00 Australia/Sydney' as ts_geo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty crazy -- I wonder what the output type is (whatever the session type is, I suppose 🤔 )
# will not accept non-GMT geo abv | ||
# postgresql: accepts | ||
statement error | ||
SELECT TIMESTAMPTZ '2022-01-01 01:10:00 AEST' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe arrow-rs uses https://docs.rs/chrono-tz/latest/chrono_tz/ so DataFusion will inherit the same behavior
@@ -280,6 +323,7 @@ fn date_trunc_coarse(granularity: &str, value: i64) -> Result<i64> { | |||
fn _date_trunc( | |||
tu: TimeUnit, | |||
value: &Option<i64>, | |||
tz: Arc<Option<Tz>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a Tz is an integer
println!("Size of a Tz: {}", std::mem::size_of::<arrow_array::timezone::Tz>());
Prints:
Size of a Tz: 4
So I think we can avoid this Arc. I'll make it a follow on PR: Update PR in #7630
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you @wiedld
sorry, didn't review in time. this is great, solved lots of issues. thank you @wiedld |
…ate_part` and `date_bin` (apache#7614) * test: enforce timestamptz contract * test(5975/5976): demonstrate what logical plan casting must occur with datetime scalar functions. * These test cases also document how our scalar functions are currently not correct. * Extra comments documenting the logical plan will be removed on test cleanup (after code fixes). * fix(5975/5976): enable type coercion to include specific timezones * Prior to this change, the outcome was always coerced to Timestamp(Nanoseconds, None) and the tz was dropped. * fix(5975/5976): have date_trunc use DateTime<Tz>, instead of NaiveDateTime * chore(5975/5976): test cleanup -- consolidate into the single timestamps test file * fix(5975/5976): enable all valid timezones to be supported in type coercion * chore: update cargo.lock in datafusion-cli * test(5975/5976): tests to document the bounds of timezone acceptance * test(5975/5976): document irregular offsets and daylight savings time * refactor(5975/5976): do not parse timezone string during type coersion, should have already failed in parser if invalid * chore: properly abbreviate abbreviations * fix(5975/5976): apply tz string parsing per batch. * move parsing up to date_trunc() to apply per batch, not per value. * do not infer a default UTC timezone for missing tz. Instead use the appropriate type for with, or without, tz.
Which issue does this PR close?
Closes #5975
Closes #5976
Closes #6818
Rationale for this change
Need builtin scalar functions (accepting timeszones) to properly handle the tz.
What changes are included in this PR?
Systematically(😅 😆) went through the logical plan, type coercion, then simplification steps — for the builtin scalar functions accepting timestamptz. Added a series of test cases, including 16 failing test cases (using Postgres as a control) to document the impact of the fixes.
Fixes occurred in three phases:
Timestamp(Nanosecond, None)
.Are these changes tested?
Tested for these scenarios:
Also added more date_trunc tests, since we had another fix there.
Are there any user-facing changes?
No APIs are changing, but we are now more accurately fulfilling the expected contract.