Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests that show the different defaults for ArrowWriter and TableParquetOptions #11524

Merged
merged 2 commits into from
Jul 17, 2024

Conversation

wiedld
Copy link
Contributor

@wiedld wiedld commented Jul 17, 2024

Which issue does this PR close?

Part of #11367
This PR defines the current state.

Rationale for this change

When we switched from using the parquet's ArrowWriter (with options) to the parallelized parquet writer (with it's own options), we ran into unintended behaviors due to different default settings.

Here are the places where the current defaults differ:

setting_name applied default, datafusion (ParquetOptions) default, parquet (ArrowWriterOptions)
data_page_row_count_limit file usize::MAX 20_000
column_index_truncate_length file None Some(64)
compression column default zstd uncompressed
dictionary_enabled column default None or true † true
statistics_enabled column default None or page † page
max_statistics_size column default None or 4096 † 4096

† For these settings, datafusion has no default (None). However, once datafusion's ParquetOptions are used by the extern parquet (a.k.a. converted to parquet's ArrowWriterOptions) then it uses the extern parquet's defaults. Refer to the newly added tests.

.

Additionally, there are differences in the bloom filter configurations based upon partial definition (a.k.a. leaving some as default, and some as defined):

bloom_filter_on fpp ndv if build from datafusion's ParquetOptions if build from (arrow-rs) parquet's WriterPropertiesBuilder
false none none None None
TRUE none none Some(BloomFilterProperties::default) None
true SOME none Some(BloomFilterProperties: fpp,default_ndv) None
true none SOME Some(BloomFilterProperties: ndv,default_fpp) None

What changes are included in this PR?

There are no functional changes in this PR.

This PR is adding tests to define and demonstrate the differences in the defaults. After discussion and consensus, we'll have a follow up PR for implementing any desired changes.

Are these changes tested?

Yes, it's only tests.

Are there any user-facing changes?

No.

@wiedld wiedld marked this pull request as ready for review July 17, 2024 20:40
@wiedld
Copy link
Contributor Author

wiedld commented Jul 17, 2024

Coincidentally, the treatment of these defaults just came up in discussion regarding memory debugging. What these writer props are (when actually doing the encoding), have an impact.

This PR is defining the current behavior only.
Not defining what it should be (altho yes the test cases use "should" lingo 😆 ).

@alamb alamb changed the title fix(11367): parquet writer defaults Add tests that show the different defaults for ArrowWriter and TableParquetOptions Jul 17, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld -- I think this looks great.

Since this is a test only change, I am going to merge it quickly so then we can make a PR to change the actual values

@alamb
Copy link
Contributor

alamb commented Jul 17, 2024

This PR is defining the current behavior only.
Not defining what it should be (altho yes the test cases use "should" lingo 😆 ).

I changed the title to reflect what this PR does 🙏 (as that shows up as the final commit message in the logs)

In general it is my opinion that the DataFusion defaults should mirror the arrow defaults -- I believe that was the intention when @devinjdangelo made the first version of this code. Over time the defaults in arrow have evolved and DataFusion hasn't kept up.

@alamb alamb merged commit 4b840c0 into apache:main Jul 17, 2024
24 checks passed
@alamb alamb deleted the 11367/parquet-writer-defaults-v2 branch July 17, 2024 21:05
Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Jul 23, 2024
…leParquetOptions` (apache#11524)

* test(11367): define current behavior of parquet writer configuration defaults

* chore(11367): update code comments to make it more explicit on the mismatches
wiedld added a commit to influxdata/arrow-datafusion that referenced this pull request Jul 31, 2024
…leParquetOptions` (apache#11524)

* test(11367): define current behavior of parquet writer configuration defaults

* chore(11367): update code comments to make it more explicit on the mismatches
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants