Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parquet statistics for ListingTable and Utf8View with schema_force_string_view, rename config option to schema_force_view_types #12232

Merged
merged 16 commits into from
Sep 10, 2024

Conversation

wiedld
Copy link
Contributor

@wiedld wiedld commented Aug 29, 2024

Which issue does this PR close?

Closes #12123

Rationale for this change

On write: parquet file written with utf8/large-utf & binary/large-binary schema (so is in metadata).
On read: we would like to be able to read as the more performant view types.

Previous work has already used the schema_force_string_view to read into utf8view and binaryview, by passing around a bool to the ParquetOpener.

This work is to focused on getting the parquet statistics, on read, to properly compute when reading as view types.

What changes are included in this PR?

  • move the schema_force_string_view up a few lines, to be with the "read" (not write) config options.
  • remove the passing around of bools
    • this was done by merging table_schema (with views) and file_schema (without views)
  • add tests which run with true|false for schema_force_string_view

There are two tests marked as incomplete: an expected panic and a commented out test. Both tests will successfully pass once the next (anticipated) arrow release occurs. In order to prove this, there is a PoC over here which branches of the in-progress WIP for the arrow upgrade and then adds the commits from this PR.. I merged in the latest main with the arrow-rs release (thank you!) and those tests are now passing.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions github-actions bot added core Core DataFusion crate common Related to common crate labels Aug 29, 2024
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Aug 29, 2024
@wiedld wiedld marked this pull request as ready for review August 29, 2024 15:34
@alamb alamb changed the title Parquet statistic on read with schema_force_string_view. Fix parquet statistics for ListingTable and Utf8View with schema_force_string_view. Sep 6, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld -- this PR is really nicely thought out and implemented.

I think the idea of putting the "should we use view types" on to the ParquetFormat is much nicer than passing around boolean flags

I left some suggestoons on this PR for naming and comments

I think now that the next arrow release is in #12032 , perhaps you can merge up this branch and fixup the tests in preparation for merge?

@@ -380,6 +380,10 @@ config_namespace! {
/// the filters are applied in the same order as written in the query
pub reorder_filters: bool, default = false

/// (reading) If true, parquet reader will read columns of `Utf8/Utf8Large` with `Utf8View`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change so that all the reading configuration values are before the writing ones?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reviewing / considering this PR, I wonder if we should (in a follow on PR) rename this config flag to be schema_force_view_types as it also applies to binary columns 🤔

Copy link
Contributor Author

@wiedld wiedld Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah...I just did it all in this PR because it felt weird to have 2 naming conventions at once. Hopefully that's ok. 😅

datafusion/core/src/datasource/file_format/mod.rs Outdated Show resolved Hide resolved
datafusion/core/src/datasource/file_format/parquet.rs Outdated Show resolved Hide resolved
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) proto Related to proto crate functions labels Sep 9, 2024
@wiedld wiedld force-pushed the 12123/view-type-on-parquet-read branch from cb30be7 to 2438136 Compare September 9, 2024 16:16
@alamb alamb changed the title Fix parquet statistics for ListingTable and Utf8View with schema_force_string_view. Fix parquet statistics for ListingTable and Utf8View with schema_force_string_view, rename config option to schema_force_view_types Sep 9, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld -- I think this PR looks good to merge from my perspective

@wiedld wiedld force-pushed the 12123/view-type-on-parquet-read branch from 2438136 to 02036fb Compare September 9, 2024 16:31
@alamb alamb merged commit 3ece7a7 into apache:main Sep 10, 2024
27 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 10, 2024

Thanks again @wiedld

@alamb alamb deleted the 12123/view-type-on-parquet-read branch September 10, 2024 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation functions proto Related to proto crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet statistics missing when reading Utf8 as Utf8View
2 participants