Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure statistic defaults in parquet writers are in sync #11656

Merged
merged 8 commits into from
Jul 27, 2024

Conversation

wiedld
Copy link
Contributor

@wiedld wiedld commented Jul 25, 2024

Which issue does this PR close?

Closes #11367

Rationale for this change

Final step to ensure that all default configuration settings, between the parquet session options and the arrow writer options, remain in alignment.

What changes are included in this PR?

Doc that the compression defaults are intentionally different.
Make the statistics_enabled defaults match.
Fix the bloom filter tests.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) labels Jul 25, 2024
@alamb
Copy link
Contributor

alamb commented Jul 25, 2024

The clippy failure can likely be resolved by updating from main

@alamb alamb marked this pull request as ready for review July 26, 2024 13:53
@alamb alamb changed the title Ensure statistic defaults in parquet writers are in sync, and note the intitial difference in compression's default setting. Ensure statistic defaults in parquet writers are in sync Jul 26, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld -- this looks good to me. 🙏

@@ -202,7 +202,7 @@ datafusion.execution.parquet.pruning true
datafusion.execution.parquet.pushdown_filters false
datafusion.execution.parquet.reorder_filters false
datafusion.execution.parquet.skip_metadata true
datafusion.execution.parquet.statistics_enabled NULL
datafusion.execution.parquet.statistics_enabled page
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an improvement -- it doesn't change the default value (NULL means use arrow-rs defaults, which is page), but now the default value isexplicit in the config settings

Also there is a test to ensure the defaults don't drift from the arrow-rs defaults accidentally

@alamb alamb merged commit a598739 into apache:main Jul 27, 2024
26 checks passed
@alamb alamb deleted the 11367/parquet-statistics-defaults branch July 27, 2024 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent value for data_page_max_rows setting in DataFusion ParquetOptions and in ArrowWriterOptions
2 participants