Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default Parquet writer settings to match arrow-rs (except for compression & statistics) #11558

Merged
merged 7 commits into from
Jul 23, 2024

Conversation

wiedld
Copy link
Contributor

@wiedld wiedld commented Jul 19, 2024

Which issue does this PR close?

Part of #11367

Rationale for this change

Depending on how we construct our parquet writer settings, we are using a different set of default configurations as outlined here. This PR is the first step in reducing this misalignment.

What changes are included in this PR?

In order of commits:

  1. define a set of datafusion constants based upon parquet (arrow-rs) defaults.
  2. use the constants which are the same in each (a.k.a. no test changes)
  3. use the constants which do change the datafusion defaults, and update tests to match
  4. show the remaining defaults which datafusion is not yet using (a.k.a. what still doesn't match)

Are these changes tested?

Yes.

Are there any user-facing changes?

No, except if anyone is relying upon this configuration misalignment in their code.

Comment on lines 351 to 361
#[allow(dead_code)]
/// Default value for [`parquet::WriterProperties::statistics_enabled`]
pub const DEFAULT_STATISTICS_ENABLED: Option<&str> = Some("page");
#[allow(dead_code)]
/// Default value for [`parquet::BloomFilterProperties::fpp`]
pub const DEFAULT_BLOOM_FILTER_FPP: Option<f64> =
Some(props::DEFAULT_BLOOM_FILTER_FPP);
#[allow(dead_code)]
/// Default value for [`parquet::BloomFilterProperties::ndv`]
pub const DEFAULT_BLOOM_FILTER_NDV: Option<u64> =
Some(props::DEFAULT_BLOOM_FILTER_NDV);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These will be addressed in the next PR.

Comment on lines 363 to 365
#[allow(dead_code)]
/// Default value for [`parquet::WriterProperties::compression`]
pub const DEFAULT_COMPRESSION: Compression = Compression::UNCOMPRESSED;
Copy link
Contributor Author

@wiedld wiedld Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow-rs parquet's default is uncompressed. Whereas the datafusion default is zstd(3).

I think it makes sense for datafusion's default to be compressed, which argues to change the other default instead. But I'm unclear as to why the parquet (arrow-rs) default is uncompressed. Is this leftover? Or is intentional for the ArrowWriterOptions?

Copy link
Contributor

@devinjdangelo devinjdangelo Jul 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this in #7692 and decided that DataFusion would have parquet compression by default. I am not sure if arrow-rs has different factors that make uncompressed by default preferable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a tension between raw read performance (uncompressed) and storage efficiency (compressed). @tustvold I think preferred arrow-rs to be on the read performance side of the equation

@wiedld wiedld force-pushed the 11367/update-defaults branch from 4828be1 to d863f01 Compare July 19, 2024 23:39
@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) labels Jul 19, 2024
@wiedld wiedld force-pushed the 11367/update-defaults branch from d863f01 to 38124ba Compare July 19, 2024 23:47
@wiedld wiedld marked this pull request as ready for review July 20, 2024 00:56
@alamb alamb changed the title Minor: have datafusion use the same defaults as arrow-rs's parquet (except for compression & statistics) Change default Parquet writer settings to match arrow-rs (except for compression & statistics) Jul 20, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld

I changed the title of this PR to more accurately reflect its contents / effects for others

In general I think DataFusion should be following the arrow writer's default values so I think this is a good step forward.

I left a suggestion that I think would dramatically simplify this PR and reduce its size, but I think we could also do it as a follow on PR

datafusion/common/src/config.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Jul 20, 2024

cc @devinjdangelo -- I don't recall any reason the defaults for the parquet writer in datafusion were meant to be different than the defaults in parquet-rs. Do you?

@devinjdangelo
Copy link
Contributor

cc @devinjdangelo -- I don't recall any reason the defaults for the parquet writer in datafusion were meant to be different than the defaults in parquet-rs. Do you?

The only default we changed intentionally after discussion was compression. I believe all others were intended to inherit the arrow-rs default.

@alamb
Copy link
Contributor

alamb commented Jul 21, 2024

The only default we changed intentionally after discussion was compression. I believe all others were intended to inherit the arrow-rs default.

@wiedld could you make sure this rationale is documented somewhere (comments and the option's description?)

@wiedld wiedld force-pushed the 11367/update-defaults branch from 3bd55de to 3e85dc4 Compare July 22, 2024 03:08
…rquet feature, instead rely upon regression testing
@wiedld wiedld force-pushed the 11367/update-defaults branch from 3e85dc4 to d972416 Compare July 22, 2024 03:21
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld -- I think this PR nicely improves the system's behavior and is clear what has changed. ❤️

cc @hveiga

docs/source/user-guide/configs.md Show resolved Hide resolved
// Confirm all other settings are equal.
// First resolve the known discrepancies, (set as the same).
// TODO: once we fix the above mis-matches, we should be able to remove this.
let mut from_extern_parquet =
session_config_from_writer_props(&default_writer_props);
from_extern_parquet.global.compression = Some("zstd(3)".into());
from_extern_parquet.global.data_page_row_count_limit = usize::MAX;
from_extern_parquet.global.column_index_truncate_length = None;
from_extern_parquet.global.dictionary_enabled = None;
from_extern_parquet.global.statistics_enabled = None;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any rationale for not setting the statistics value to the same as in the arrow-rs writer?

Copy link
Contributor Author

@wiedld wiedld Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statistics_enabled and the bloom filter setting will require a bit more updates to do. In order to keep the diff small, (and more easily reviewable), I was going to do in a follow up PR. Is that ok @alamb ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SOunds good to me 👍 thank you

@alamb
Copy link
Contributor

alamb commented Jul 22, 2024

I restarted the failed pyarrow check as it seemed to be an infrastructure problem

@alamb
Copy link
Contributor

alamb commented Jul 22, 2024

I'll plan to merge this tomorrow unless there are any other comments

@alamb alamb merged commit 9f74dcc into apache:main Jul 23, 2024
25 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 23, 2024

Thanks again @wiedld and @devinjdangelo

@alamb alamb deleted the 11367/update-defaults branch July 23, 2024 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants