fix(5975/5976): timezone handling for timestamps and `date_trunc`, `date_part` and `date_bin` #7614

wiedld · 2023-09-21T01:19:28Z

Which issue does this PR close?

Closes #5975
Closes #5976
Closes #6818

Rationale for this change

Need builtin scalar functions (accepting timeszones) to properly handle the tz.

What changes are included in this PR?

Systematically(😅 😆) went through the logical plan, type coercion, then simplification steps — for the builtin scalar functions accepting timestamptz. Added a series of test cases, including 16 failing test cases (using Postgres as a control) to document the impact of the fixes.

Fixes occurred in three phases:

Fix the type coercion, to no longer coerce to Timestamp(Nanosecond, None).
Fix the simplification step where date_trunc is applied. It was using NaiveDateTime which does not have a concept of timezone.
Generalize the signatures() and timestamp type coercion, such that each timezone does not need to be listed.

Are these changes tested?

Tested for these scenarios:

fun	server is utc?	src of tz
date_trunc	no	server
		timestamp
	yes	server
		timestamp
--------------	----------------	-----------
date_bin	no	server
		timestamp
	yes	server
		timestamp
--------------	----------------	-----------
date_part	no	server
		timestamp
	yes	server
		timestamp

Also added more date_trunc tests, since we had another fix there.

Are there any user-facing changes?

No APIs are changing, but we are now more accurately fulfilling the expected contract.

…h datetime scalar functions. * These test cases also document how our scalar functions are currently not correct. * Extra comments documenting the logical plan will be removed on test cleanup (after code fixes).

* Prior to this change, the outcome was always coerced to Timestamp(Nanoseconds, None) and the tz was dropped.

…eTime

…mps test file

…ercion

datafusion/expr/src/type_coercion/functions.rs

mhilton · 2023-09-21T06:53:44Z

We need to support geographic as well as offset-based timezones, could you please add some test that set the timezone to "Europe/Athens" for example. Along with that comes daylight savings times so some tests to check the behviour around March and October would also be needed.

In addition it would be good to add tests for the more funky timezones, such as "Australia/Adelaide" with its +09:30 offset. Making sure that date_bin still divides the hours correctly.

alamb

Thank you so much for this contribution @wiedld 🏅 -- I think this PR is wonderfully written and tested. I think it is really close to mergeable. The only thing blocking it in my mind is the potential performance regression, but I think that will be straightforward to fix as I mentioned.

datafusion/expr/src/type_coercion/functions.rs

alamb · 2023-09-21T16:32:38Z

datafusion/physical-expr/src/datetime_expressions.rs

+) -> Result<i64> {
+    // Use chrono DateTime<Tz> to clear the various fields because need to clear per timezone,
+    // and NaiveDateTime (ISO 8601) has no concept of timezones
+    let tz = arrow_array::timezone::Tz::from_str(tz.as_deref().unwrap_or("+00"))?;


I think as written this will invoke as_datetime_with_timezone for each row. This will effectively re-parse the same string for all rows which is likely to be expensive

This per-row parsing overhead will happen even for timestamps that have a tz of None which the version on master doesn't do and thus I think this change would result in a performance regression if we merged this code as is.

However I think the fix should be relatively straightforward.

What would you think about parsing the timezone once per batch (basically parse it once in date_trunc and then pass Option<Tz> down through _date_trunc rather than Option<Arc<str>>?

That's an excellent idea. Tz parsing is now applied per batch.

Additionally, there was an inference of Tz (when None) in order to use DateTime<Tz>. That has been replaced with the use of generics to accept either DateTime<Tz> or NaiveDateTime. Let me know how that looks!

datafusion/sqllogictest/test_files/timestamps.slt

alamb · 2023-09-21T16:40:40Z

cc @waitingkuo and @Weijun-H

datafusion/expr/src/type_coercion/functions.rs

datafusion/sqllogictest/test_files/timestamps.slt

wiedld · 2023-09-22T22:10:10Z

datafusion/sqllogictest/test_files/timestamps.slt

+# will not accept non-GMT geo abv
+# postgresql: accepts
+statement error
+SELECT TIMESTAMPTZ '2022-01-01 01:10:00 AEST'


concerning the non-GMT vs GMT:

there are some standards that accept only GMT and no other abbreviations (e.g. web standards, not ansi sql).

postgresql does accept the non-GMT abbreviations. But right now we get errors in the arrow parser.

postgres also mentions that timezone abbreviations are not well standardized. Not advocating for a change here; just documenting in the tests that we don't support timezone abbr (outside of GMT).

I believe arrow-rs uses https://docs.rs/chrono-tz/latest/chrono_tz/ so DataFusion will inherit the same behavior

wiedld · 2023-09-22T22:11:29Z

datafusion/sqllogictest/test_files/timestamps.slt

+2023-03-11T10:00:00Z
+
+# will error if provide geo longform with time not possible due to daylight savings
+# Arrow error: Parser error: Error parsing timestamp from '2023-03-12 02:00:00 America/Los_Angeles': error computing timezone offset


Daylight savings is applied in postgresql via (at minimum) two ways:

the non-GMT abbreviations (which the parse does not accepted)

in the geo longform

Since we do support the geo longform, but selectively error for an invalid time (due to daylight savings) -- is this something we wish to change?

Also note: offsets are just offsets, and do not consider daylight savings. It's only the geo information which includes this weird construct. 😆

In terms of timezone names to support, if there are forms of timezones that we need supported that aren't supported by chrono-tz, I think we should file a ticket / fix it upstream in arrow-rs (not try to add specific timezones in DataFusion)

Reference: #7614 (comment) and @mhilton 's comments in #7614 (comment)

…n, should have already failed in parser if invalid

* move parsing up to date_trunc() to apply per batch, not per value. * do not infer a default UTC timezone for missing tz. Instead use the appropriate type for with, or without, tz.

alamb

Thank you @wiedld -- this looks great

alamb · 2023-09-23T00:12:50Z

datafusion/sqllogictest/test_files/timestamps.slt

+
+# ok to use geo longform
+query P rowsort
+SELECT TIMESTAMPTZ '2022-01-01 01:10:00 Australia/Sydney' as ts_geo


this is pretty crazy -- I wonder what the output type is (whatever the session type is, I suppose 🤔 )

alamb · 2023-09-23T10:07:43Z

datafusion/sqllogictest/test_files/timestamps.slt

+# will not accept non-GMT geo abv
+# postgresql: accepts
+statement error
+SELECT TIMESTAMPTZ '2022-01-01 01:10:00 AEST'


I believe arrow-rs uses https://docs.rs/chrono-tz/latest/chrono_tz/ so DataFusion will inherit the same behavior

alamb · 2023-09-23T10:17:22Z

datafusion/physical-expr/src/datetime_expressions.rs

@@ -280,6 +323,7 @@ fn date_trunc_coarse(granularity: &str, value: i64) -> Result<i64> {
 fn _date_trunc(
    tu: TimeUnit,
    value: &Option<i64>,
+    tz: Arc<Option<Tz>>,


I think a Tz is an integer

println!("Size of a Tz: {}", std::mem::size_of::<arrow_array::timezone::Tz>());

Prints:

Size of a Tz: 4

So I think we can avoid this Arc. ~~I'll make it a follow on PR~~: Update PR in #7630

Weijun-H

LGTM! Thank you @wiedld

waitingkuo · 2023-09-25T04:25:56Z

sorry, didn't review in time. this is great, solved lots of issues. thank you @wiedld

…ate_part` and `date_bin` (apache#7614) * test: enforce timestamptz contract * test(5975/5976): demonstrate what logical plan casting must occur with datetime scalar functions. * These test cases also document how our scalar functions are currently not correct. * Extra comments documenting the logical plan will be removed on test cleanup (after code fixes). * fix(5975/5976): enable type coercion to include specific timezones * Prior to this change, the outcome was always coerced to Timestamp(Nanoseconds, None) and the tz was dropped. * fix(5975/5976): have date_trunc use DateTime<Tz>, instead of NaiveDateTime * chore(5975/5976): test cleanup -- consolidate into the single timestamps test file * fix(5975/5976): enable all valid timezones to be supported in type coercion * chore: update cargo.lock in datafusion-cli * test(5975/5976): tests to document the bounds of timezone acceptance * test(5975/5976): document irregular offsets and daylight savings time * refactor(5975/5976): do not parse timezone string during type coersion, should have already failed in parser if invalid * chore: properly abbreviate abbreviations * fix(5975/5976): apply tz string parsing per batch. * move parsing up to date_trunc() to apply per batch, not per value. * do not infer a default UTC timezone for missing tz. Instead use the appropriate type for with, or without, tz.

wiedld added 6 commits September 20, 2023 12:22

test: enforce timestamptz contract

2ea5007

fix(5975/5976): enable type coercion to include specific timezones

9e1f590

* Prior to this change, the outcome was always coerced to Timestamp(Nanoseconds, None) and the tz was dropped.

fix(5975/5976): have date_trunc use DateTime<Tz>, instead of NaiveDat…

fe8357d

…eTime

chore(5975/5976): test cleanup -- consolidate into the single timesta…

a075403

…mps test file

fix(5975/5976): enable all valid timezones to be supported in type co…

c53b6a4

…ercion

github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Sep 21, 2023

wiedld commented Sep 21, 2023

View reviewed changes

datafusion/expr/src/type_coercion/functions.rs Outdated Show resolved Hide resolved

wiedld marked this pull request as ready for review September 21, 2023 01:27

chore: update cargo.lock in datafusion-cli

a92fbd5

alamb reviewed Sep 21, 2023

View reviewed changes

alamb mentioned this pull request Sep 21, 2023

fix: date_trunc support timezone #6818

Closed

alamb changed the title ~~fix(5975/5976): timezone handling~~ fix(5975/5976): timezone handling for timestamps and date_trunc, date_part and date_bin Sep 21, 2023

alamb reviewed Sep 21, 2023

View reviewed changes

datafusion/expr/src/type_coercion/functions.rs Outdated Show resolved Hide resolved

datafusion/expr/src/type_coercion/functions.rs Outdated Show resolved Hide resolved

datafusion/sqllogictest/test_files/timestamps.slt Show resolved Hide resolved

wiedld added 3 commits September 22, 2023 14:47

test(5975/5976): tests to document the bounds of timezone acceptance

8672c46

test(5975/5976): document irregular offsets and daylight savings time

c6a6735

Merge branch 'main' into 5975/5976/timezone-handling

40b39cf

github-actions bot added the core Core DataFusion crate label Sep 22, 2023

wiedld commented Sep 22, 2023

View reviewed changes

wiedld added 3 commits September 22, 2023 15:31

refactor(5975/5976): do not parse timezone string during type coersio…

16757ee

…n, should have already failed in parser if invalid

chore: properly abbreviate abbreviations

9f6bc7f

fix(5975/5976): apply tz string parsing per batch.

9f4cd11

* move parsing up to date_trunc() to apply per batch, not per value. * do not infer a default UTC timezone for missing tz. Instead use the appropriate type for with, or without, tz.

alamb approved these changes Sep 23, 2023

View reviewed changes

alamb reviewed Sep 23, 2023

View reviewed changes

alamb mentioned this pull request Sep 23, 2023

Minor: remove unecessary Arcs in datetime_expressions #7630

Merged

alamb merged commit d19e9d6 into apache:main Sep 23, 2023
21 checks passed

Weijun-H reviewed Sep 23, 2023

View reviewed changes

wiedld deleted the 5975/5976/timezone-handling branch October 24, 2023 06:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(5975/5976): timezone handling for timestamps and `date_trunc`, `date_part` and `date_bin` #7614

fix(5975/5976): timezone handling for timestamps and `date_trunc`, `date_part` and `date_bin` #7614

wiedld commented Sep 21, 2023 •

edited by alamb

Loading

mhilton commented Sep 21, 2023 •

edited

Loading

alamb left a comment

alamb Sep 21, 2023

wiedld Sep 23, 2023

alamb commented Sep 21, 2023

wiedld Sep 22, 2023 •

edited

Loading

wiedld Sep 22, 2023 •

edited

Loading

alamb Sep 23, 2023

wiedld Sep 22, 2023 •

edited

Loading

wiedld Sep 22, 2023

alamb Sep 23, 2023

alamb left a comment

alamb Sep 23, 2023

alamb Sep 23, 2023

alamb Sep 23, 2023 •

edited

Loading

Weijun-H left a comment

waitingkuo commented Sep 25, 2023

fix(5975/5976): timezone handling for timestamps and date_trunc, date_part and date_bin #7614

fix(5975/5976): timezone handling for timestamps and date_trunc, date_part and date_bin #7614

Conversation

wiedld commented Sep 21, 2023 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

mhilton commented Sep 21, 2023 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb Sep 21, 2023

Choose a reason for hiding this comment

wiedld Sep 23, 2023

Choose a reason for hiding this comment

alamb commented Sep 21, 2023

wiedld Sep 22, 2023 • edited Loading

Choose a reason for hiding this comment

wiedld Sep 22, 2023 • edited Loading

Choose a reason for hiding this comment

alamb Sep 23, 2023

Choose a reason for hiding this comment

wiedld Sep 22, 2023 • edited Loading

Choose a reason for hiding this comment

wiedld Sep 22, 2023

Choose a reason for hiding this comment

alamb Sep 23, 2023

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Sep 23, 2023

Choose a reason for hiding this comment

alamb Sep 23, 2023

Choose a reason for hiding this comment

alamb Sep 23, 2023 • edited Loading

Choose a reason for hiding this comment

Weijun-H left a comment

Choose a reason for hiding this comment

waitingkuo commented Sep 25, 2023

fix(5975/5976): timezone handling for timestamps and `date_trunc`, `date_part` and `date_bin` #7614

fix(5975/5976): timezone handling for timestamps and `date_trunc`, `date_part` and `date_bin` #7614

wiedld commented Sep 21, 2023 •

edited by alamb

Loading

mhilton commented Sep 21, 2023 •

edited

Loading

wiedld Sep 22, 2023 •

edited

Loading

wiedld Sep 22, 2023 •

edited

Loading

wiedld Sep 22, 2023 •

edited

Loading

alamb Sep 23, 2023 •

edited

Loading