Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(5975/5976): timezone handling for timestamps and date_trunc, date_part and date_bin #7614

Merged
merged 13 commits into from
Sep 23, 2023

Conversation

wiedld
Copy link
Contributor

@wiedld wiedld commented Sep 21, 2023

Which issue does this PR close?

Closes #5975
Closes #5976
Closes #6818

Rationale for this change

Need builtin scalar functions (accepting timeszones) to properly handle the tz.

What changes are included in this PR?

Systematically(😅 😆) went through the logical plan, type coercion, then simplification steps — for the builtin scalar functions accepting timestamptz. Added a series of test cases, including 16 failing test cases (using Postgres as a control) to document the impact of the fixes.

Fixes occurred in three phases:

  1. Fix the type coercion, to no longer coerce to Timestamp(Nanosecond, None).
  2. Fix the simplification step where date_trunc is applied. It was using NaiveDateTime which does not have a concept of timezone.
  3. Generalize the signatures() and timestamp type coercion, such that each timezone does not need to be listed.

Are these changes tested?

Tested for these scenarios:

fun server is utc? src of tz
date_trunc no server
timestamp
yes server
timestamp
-------------- ---------------- -----------
date_bin no server
timestamp
yes server
timestamp
-------------- ---------------- -----------
date_part no server
timestamp
yes server
timestamp

Also added more date_trunc tests, since we had another fix there.

Are there any user-facing changes?

No APIs are changing, but we are now more accurately fulfilling the expected contract.

…h datetime scalar functions.

  * These test cases also document how our scalar functions are currently not correct.

  * Extra comments documenting the logical plan will be removed on test cleanup (after code fixes).
  * Prior to this change, the outcome was always coerced to Timestamp(Nanoseconds, None)
and the tz was dropped.
@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Sep 21, 2023
@wiedld wiedld marked this pull request as ready for review September 21, 2023 01:27
@mhilton
Copy link
Contributor

mhilton commented Sep 21, 2023

We need to support geographic as well as offset-based timezones, could you please add some test that set the timezone to "Europe/Athens" for example. Along with that comes daylight savings times so some tests to check the behviour around March and October would also be needed.

In addition it would be good to add tests for the more funky timezones, such as "Australia/Adelaide" with its +09:30 offset. Making sure that date_bin still divides the hours correctly.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this contribution @wiedld 🏅 -- I think this PR is wonderfully written and tested. I think it is really close to mergeable. The only thing blocking it in my mind is the potential performance regression, but I think that will be straightforward to fix as I mentioned.

datafusion/expr/src/type_coercion/functions.rs Outdated Show resolved Hide resolved
datafusion/expr/src/type_coercion/functions.rs Outdated Show resolved Hide resolved
) -> Result<i64> {
// Use chrono DateTime<Tz> to clear the various fields because need to clear per timezone,
// and NaiveDateTime (ISO 8601) has no concept of timezones
let tz = arrow_array::timezone::Tz::from_str(tz.as_deref().unwrap_or("+00"))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as written this will invoke as_datetime_with_timezone for each row. This will effectively re-parse the same string for all rows which is likely to be expensive

This per-row parsing overhead will happen even for timestamps that have a tz of None which the version on master doesn't do and thus I think this change would result in a performance regression if we merged this code as is.

However I think the fix should be relatively straightforward.

What would you think about parsing the timezone once per batch (basically parse it once in date_trunc and then pass Option<Tz> down through _date_trunc rather than Option<Arc<str>>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an excellent idea. Tz parsing is now applied per batch.

Additionally, there was an inference of Tz (when None) in order to use DateTime<Tz>. That has been replaced with the use of generics to accept either DateTime<Tz> or NaiveDateTime. Let me know how that looks!

@alamb
Copy link
Contributor

alamb commented Sep 21, 2023

cc @waitingkuo and @Weijun-H

@alamb alamb changed the title fix(5975/5976): timezone handling fix(5975/5976): timezone handling for timestamps and date_trunc, date_part and date_bin Sep 21, 2023
@github-actions github-actions bot added the core Core DataFusion crate label Sep 22, 2023
# will not accept non-GMT geo abv
# postgresql: accepts
statement error
SELECT TIMESTAMPTZ '2022-01-01 01:10:00 AEST'
Copy link
Contributor Author

@wiedld wiedld Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concerning the non-GMT vs GMT:

  • there are some standards that accept only GMT and no other abbreviations (e.g. web standards, not ansi sql).
  • postgresql does accept the non-GMT abbreviations. But right now we get errors in the arrow parser.

Copy link
Contributor Author

@wiedld wiedld Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postgres also mentions that timezone abbreviations are not well standardized. Not advocating for a change here; just documenting in the tests that we don't support timezone abbr (outside of GMT).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe arrow-rs uses https://docs.rs/chrono-tz/latest/chrono_tz/ so DataFusion will inherit the same behavior

2023-03-11T10:00:00Z

# will error if provide geo longform with time not possible due to daylight savings
# Arrow error: Parser error: Error parsing timestamp from '2023-03-12 02:00:00 America/Los_Angeles': error computing timezone offset
Copy link
Contributor Author

@wiedld wiedld Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Daylight savings is applied in postgresql via (at minimum) two ways:

  • the non-GMT abbreviations (which the parse does not accepted)
  • in the geo longform

Since we do support the geo longform, but selectively error for an invalid time (due to daylight savings) -- is this something we wish to change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note: offsets are just offsets, and do not consider daylight savings. It's only the geo information which includes this weird construct. 😆

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of timezone names to support, if there are forms of timezones that we need supported that aren't supported by chrono-tz, I think we should file a ticket / fix it upstream in arrow-rs (not try to add specific timezones in DataFusion)

Reference: #7614 (comment) and @mhilton 's comments in #7614 (comment)

…n, should have already failed in parser if invalid
  * move parsing up to date_trunc() to apply per batch, not per value.
  * do not infer a default UTC timezone for missing tz. Instead use the appropriate type for with, or without, tz.
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld -- this looks great


# ok to use geo longform
query P rowsort
SELECT TIMESTAMPTZ '2022-01-01 01:10:00 Australia/Sydney' as ts_geo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty crazy -- I wonder what the output type is (whatever the session type is, I suppose 🤔 )

# will not accept non-GMT geo abv
# postgresql: accepts
statement error
SELECT TIMESTAMPTZ '2022-01-01 01:10:00 AEST'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe arrow-rs uses https://docs.rs/chrono-tz/latest/chrono_tz/ so DataFusion will inherit the same behavior

@@ -280,6 +323,7 @@ fn date_trunc_coarse(granularity: &str, value: i64) -> Result<i64> {
fn _date_trunc(
tu: TimeUnit,
value: &Option<i64>,
tz: Arc<Option<Tz>>,
Copy link
Contributor

@alamb alamb Sep 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a Tz is an integer

    println!("Size of a Tz: {}", std::mem::size_of::<arrow_array::timezone::Tz>());

Prints:

Size of a Tz: 4

So I think we can avoid this Arc. I'll make it a follow on PR: Update PR in #7630

@alamb alamb merged commit d19e9d6 into apache:main Sep 23, 2023
21 checks passed
Copy link
Member

@Weijun-H Weijun-H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you @wiedld

@waitingkuo
Copy link
Contributor

sorry, didn't review in time. this is great, solved lots of issues. thank you @wiedld

Ted-Jiang pushed a commit to Ted-Jiang/arrow-datafusion that referenced this pull request Oct 7, 2023
…ate_part` and `date_bin` (apache#7614)

* test: enforce timestamptz contract

* test(5975/5976): demonstrate what logical plan casting must occur with datetime scalar functions.

  * These test cases also document how our scalar functions are currently not correct.

  * Extra comments documenting the logical plan will be removed on test cleanup (after code fixes).

* fix(5975/5976): enable type coercion to include specific timezones

  * Prior to this change, the outcome was always coerced to Timestamp(Nanoseconds, None)
and the tz was dropped.

* fix(5975/5976): have date_trunc use DateTime<Tz>, instead of NaiveDateTime

* chore(5975/5976): test cleanup -- consolidate into the single timestamps test file

* fix(5975/5976): enable all valid timezones to be supported in type coercion

* chore: update cargo.lock in datafusion-cli

* test(5975/5976): tests to document the bounds of timezone acceptance

* test(5975/5976): document irregular offsets and daylight savings time

* refactor(5975/5976): do not parse timezone string during type coersion, should have already failed in parser if invalid

* chore: properly abbreviate abbreviations

* fix(5975/5976): apply tz string parsing per batch.

  * move parsing up to date_trunc() to apply per batch, not per value.
  * do not infer a default UTC timezone for missing tz. Instead use the appropriate type for with, or without, tz.
@wiedld wiedld deleted the 5975/5976/timezone-handling branch October 24, 2023 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

date_trunc drops timezone date_part doesn't consider timezone
5 participants