Fix parquet statistics for ListingTable and Utf8View with `schema_force_string_view`, rename config option to `schema_force_view_types` #12232

wiedld · 2024-08-29T10:38:19Z

Which issue does this PR close?

Rationale for this change

On write: parquet file written with utf8/large-utf & binary/large-binary schema (so is in metadata).
On read: we would like to be able to read as the more performant view types.

Previous work has already used the schema_force_string_view to read into utf8view and binaryview, by passing around a bool to the ParquetOpener.

This work is to focused on getting the parquet statistics, on read, to properly compute when reading as view types.

What changes are included in this PR?

move the schema_force_string_view up a few lines, to be with the "read" (not write) config options.
remove the passing around of bools
- this was done by merging table_schema (with views) and file_schema (without views)
add tests which run with true|false for schema_force_string_view

There are two tests marked as incomplete: an expected panic and a commented out test. Both tests will successfully pass once the next (anticipated) arrow release occurs. In order to prove this, there is a PoC over here which branches of the in-progress WIP for the arrow upgrade and then adds the commits from this PR.. I merged in the latest main with the arrow-rs release (thank you!) and those tests are now passing.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

…reading props

…schema

…tats

…ll solve the issue

alamb

Thank you @wiedld -- this PR is really nicely thought out and implemented.

I think the idea of putting the "should we use view types" on to the ParquetFormat is much nicer than passing around boolean flags

I left some suggestoons on this PR for naming and comments

I think now that the next arrow release is in #12032 , perhaps you can merge up this branch and fixup the tests in preparation for merge?

alamb · 2024-09-06T21:47:07Z

datafusion/common/src/config.rs

@@ -380,6 +380,10 @@ config_namespace! {
        /// the filters are applied in the same order as written in the query
        pub reorder_filters: bool, default = false

+        /// (reading) If true, parquet reader will read columns of `Utf8/Utf8Large` with `Utf8View`,


is this change so that all the reading configuration values are before the writing ones?

While reviewing / considering this PR, I wonder if we should (in a follow on PR) rename this config flag to be schema_force_view_types as it also applies to binary columns 🤔

Ah...I just did it all in this PR because it felt weird to have 2 naming conventions at once. Hopefully that's ok. 😅

datafusion/core/src/datasource/file_format/mod.rs

datafusion/core/src/datasource/file_format/parquet.rs

alamb

Thank you @wiedld -- I think this PR looks good to merge from my perspective

datafusion/core/src/datasource/file_format/parquet.rs

…public

…ing fields)

…ew types are used

alamb · 2024-09-10T11:32:07Z

Thanks again @wiedld

wiedld added 4 commits August 28, 2024 17:27

chore: move schema_force_string_view upwards to be listed with other …

a20897a

…reading props

refactor(12123): have file schema be merged on view types with table …

cb425b3

…schema

test(12123): test for with, and without schema_force_string_view

ad7898b

test(12123): demonstrate current upstream failure when reading page s…

0992ce1

…tats

github-actions bot added core Core DataFusion crate common Related to common crate labels Aug 29, 2024

chore(12123): update config.md

e697cdb

github-actions bot added the documentation Improvements or additions to documentation label Aug 29, 2024

chore: cleanup

57a9543

wiedld added a commit to influxdata/arrow-datafusion that referenced this pull request Aug 29, 2024

test: proof that once apache#12032 merges, the PR for apache#12232 wi…

a1edaa0

…ll solve the issue

wiedld mentioned this pull request Aug 29, 2024

Proof of issue fix -- demonstrated with rebase on in-progress arrow upgrade. influxdata/arrow-datafusion#37

Closed

chore(12123): temporarily remove test until next arrow release

8a19936

wiedld marked this pull request as ready for review August 29, 2024 15:34

Merge branch 'main' into 12123/view-type-on-parquet-read

3744e89

alamb changed the title ~~Parquet statistic on read with schema_force_string_view.~~ Fix parquet statistics for ListingTable and Utf8View with schema_force_string_view. Sep 6, 2024

alamb reviewed Sep 6, 2024

View reviewed changes

Merge branch 'main' into 12123/view-type-on-parquet-read

68cc745

github-actions bot added sqllogictest SQL Logic Tests (.slt) proto Related to proto crate functions labels Sep 9, 2024

wiedld force-pushed the 12123/view-type-on-parquet-read branch from cb30be7 to 2438136 Compare September 9, 2024 16:16

alamb changed the title ~~Fix parquet statistics for ListingTable and Utf8View with schema_force_string_view.~~ Fix parquet statistics for ListingTable and Utf8View with schema_force_string_view, rename config option to schema_force_view_types Sep 9, 2024

alamb approved these changes Sep 9, 2024

View reviewed changes

datafusion/core/src/datasource/file_format/parquet.rs Show resolved Hide resolved

wiedld added 6 commits September 9, 2024 09:31

chore(12123): rename all variables to force_view_types

04f471b

refactor(12123): make interface ParquetFormat::with_force_view_types …

34c9ae6

…public

chore(12123): rename helper method which coerces the schema (not merg…

1a84734

…ing fields)

chore(12123): add dosc to ParquetFormat to clarify exactly how the vi…

604ef1e

…ew types are used

test(12123): cleanup tests to be more explicit with ForceViews enum

64b62c9

test(12123): update tests to pass now that latest arrow-rs release is in

02036fb

wiedld force-pushed the 12123/view-type-on-parquet-read branch from 2438136 to 02036fb Compare September 9, 2024 16:31

fix: use proper naming on benchmark

b5c725f

alamb merged commit 3ece7a7 into apache:main Sep 10, 2024
27 checks passed

alamb deleted the 12123/view-type-on-parquet-read branch September 10, 2024 11:32

alamb mentioned this pull request Sep 11, 2024

DataFusion weekly project plan (Andrew Lamb) - Sep 2, 2024 #12336

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parquet statistics for ListingTable and Utf8View with `schema_force_string_view`, rename config option to `schema_force_view_types` #12232

Fix parquet statistics for ListingTable and Utf8View with `schema_force_string_view`, rename config option to `schema_force_view_types` #12232

wiedld commented Aug 29, 2024 •

edited

Loading

alamb left a comment

alamb Sep 6, 2024

alamb Sep 6, 2024

wiedld Sep 9, 2024 •

edited

Loading

alamb left a comment

alamb commented Sep 10, 2024

Fix parquet statistics for ListingTable and Utf8View with schema_force_string_view, rename config option to schema_force_view_types #12232

Fix parquet statistics for ListingTable and Utf8View with schema_force_string_view, rename config option to schema_force_view_types #12232

Conversation

wiedld commented Aug 29, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb Sep 6, 2024

Choose a reason for hiding this comment

alamb Sep 6, 2024

Choose a reason for hiding this comment

wiedld Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb commented Sep 10, 2024

Fix parquet statistics for ListingTable and Utf8View with `schema_force_string_view`, rename config option to `schema_force_view_types` #12232

Fix parquet statistics for ListingTable and Utf8View with `schema_force_string_view`, rename config option to `schema_force_view_types` #12232

wiedld commented Aug 29, 2024 •

edited

Loading

wiedld Sep 9, 2024 •

edited

Loading