Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: more clearly delineate between TableParquetOptions and ParquetWriterOptions #11444

Merged
merged 11 commits into from
Jul 17, 2024

Conversation

wiedld
Copy link
Contributor

@wiedld wiedld commented Jul 12, 2024

Which issue does this PR close?

Is a cleanup before doing #11367.

This PR delineates btwn the session-level config vs the per-writer config.
It also adds tests to demonstrate how these two configurations get out-of-sync. Tests in the followup PR.

Rationale for this change

  • We have two session-level configurations for parquet writes:
    • TableParquetOptions and ParquetOptions.
  • We have two writer-level options for writer (a.k.a. props to be handed to a writer, per a single write action):
    • WriterProperties and ParquetWriterOptions (which wraps WriterProperties).

At first, it looks like we should be able to get the writer props from either session-level config options. (In fact, we made exactly this suggestion ourselves.) But this is misleading; as the ParquetOptions has an incomplete set of all the config needed for the arrow writer (it's missing the kv_metadata & col-specific configurations).

What changes are included in this PR?

Instead, I've done some cleanup and added documentation to hopefully make the relationship more clear. Additionally, there's a new ParquetOptions::writer_props_from_global_opts() that does fulfill our suggestion -- while still making the difference clear.

Are these changes tested?

Yes.

Are there any user-facing changes?

A new ParquetOptions::writer_props_from_global_opts().

@alamb
Copy link
Contributor

alamb commented Jul 12, 2024

Possibly related: #11367

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld. I think this PR is a good step towards #11367 (it almost does it all).

I had a few structural / naming suggestions but otherwise I think the basic ideas here

as the ParquetOptions has an incomplete set of all the config needed for the arrow writer (it's missing the kv_metadata & col-specific configurations).

I wonder if we should make ParquetOptions be consistent with the WriterOptions? It seems like missing kv_metadata and column overrides might be an oversight rather than an intentional design)

datafusion/common/src/config.rs Outdated Show resolved Hide resolved
datafusion/common/src/config.rs Outdated Show resolved Hide resolved
}
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could also add a test here that described in #11367 -- that takes a default ParquetOptions, creates the ArrowWriterOption and then ensures the fields are the same as ArrowWriterOptions::default()

Copy link
Contributor Author

@wiedld wiedld Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this^^ requested test in this commit, but did the equivalency assertion a bit differently. Parquet's WriterProperties and it's builder do not implement Eq or PartialEq.

Rather than asserting each prop independently based upon the getter, I elected to (a) explicitly define the props that do not match, and then (b) use the same identify function (session_config_from_writer_props) for the assertion.

Copy link
Contributor Author

@wiedld wiedld Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to make a slight change to avoid a test regression in the sqllogictests.

That^^ change (to avoid test regression) also brings us further from the matching-default-settings between the extern parquet vs datafusion. I added more tests to demonstrate this divergence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the follow up PR, I plan to propose (and open for discussion) which defaults should be changed in datafusion vs parquet.

Copy link
Contributor Author

@wiedld wiedld Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per request, moving these into a follow up PR. (Make more reviewable.)

datafusion/common/src/config.rs Outdated Show resolved Hide resolved
@wiedld
Copy link
Contributor Author

wiedld commented Jul 13, 2024

I wonder if we should make ParquetOptions be consistent with the WriterOptions? It seems like missing kv_metadata and column overrides might be an oversight rather than an intentional design)

The ParquetOptions are explicitly stated as the global part of TableParquetOptions.

pub struct TableParquetOptions {
/// Global Parquet options that propagates to all columns.
pub global: ParquetOptions,

Is there any reason to have the global ParquetOptions, without the TableParquetOptions?

  • the two additional fields are kv_metadata a col-specific hashmap, both of those can be default/empty.
  • proposal: should we make ParquetOptions be private to TableParquetOptions?

@wiedld
Copy link
Contributor Author

wiedld commented Jul 13, 2024

The SessionState contains multiple copies of the ParquetOptions:

  • ( denotes "contained within")
  • SessionState.config ⊃ SessionConfig ⊃ ConfigOptions ⊃ ExecutionOptions ⊃ ParquetOptions
  • SessionState.file_formats ⊃ ParquetFormat ⊃ TableParquetOptions ⊃ ParquetOptions
  • SessionState.table_options ⊃ TableOptions ⊃ TableParquetOptions ⊃ ParquetOptions

The SessionState.config is used as a default for SessionState.table_options.

.

But what is the intended use case for table_options vs file_formats?

  • From what I can see, it appears that the table_options duplicates the information within the file_formats (e.g. CsvFormat ⊃ CsvOptions, JsonFormat ⊃ JsonOptions, etc).
  • table_options does have two unique fields: the assumed-file-type and extensions. But the rest is all file_formats.
  • file_formats was added 2 weeks ago.
  • proposal: unify to a single file_formats (with csv, json, parquet factories added) to be contained within the TableOptions.

@alamb
Copy link
Contributor

alamb commented Jul 14, 2024

But what is the intended use case for table_options vs file_formats?

I am not entirely sure. Given it is not clear from the code, what I would normally do in these circumstances is make a new PR with documentation to those structures and try and explain the differences

It would likely take me a while, but the process of trying to document the differences would force me to do the code archeology / spelunking to figure it out, and then once I did I would have a good artifact (new docs) that would

  1. Ensure I didn't forget the answer
  2. Help others who probably have the same questions
  3. A vehicle for others to help correct and improve

I of course would love to help review such a PR 🎣

@wiedld
Copy link
Contributor Author

wiedld commented Jul 15, 2024

ok, I'm going to:

Then in separate tickets, and separate followup PRs, will handle these two proposals: cleanup ParquetOptions vs TableParquetOptions, and the proposed cleanup of file_formats.

@alamb
Copy link
Contributor

alamb commented Jul 15, 2024

Thanks @wiedld

A suggestion that I think will help speed up reviews / merging is to try and keep the PRs smaller (like for example make one PR to add docs, and other PRs to add new apis (like conversion from ParquetOptions to WriterOptions, etc)

@wiedld wiedld force-pushed the refactor/session-options-ve-writer-props branch from a484974 to 9c43625 Compare July 16, 2024 01:57
wiedld added 2 commits July 15, 2024 20:37
…ired to avoid test regression), result in different default behavior than parquet crate
@wiedld wiedld force-pushed the refactor/session-options-ve-writer-props branch from ab950c5 to 4f43390 Compare July 16, 2024 03:47
@wiedld wiedld marked this pull request as ready for review July 16, 2024 22:25
@alamb alamb changed the title Refactor: more clearly delineate btwn writer options vs session configuration Refactor: more clearly delineate between writer options vs session configuration Jul 17, 2024
@alamb alamb changed the title Refactor: more clearly delineate between writer options vs session configuration Refactor: more clearly delineate between TableParquetOptions and ParquetWriterOptions Jul 17, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld

I think this is a step forward and improves the situation

I had a few API suggestions that I think would make it even better but I don't think they are required

@alamb alamb merged commit c95556d into apache:main Jul 17, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 17, 2024

Thanks @wiedld

Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Jul 23, 2024
…arquetWriterOptions` (apache#11444)

* refactor: make more explicit the relationship btwn TableParquetOptions vs ParquetOptions vs WriterProperties

* test: demonstrate the relationship btwn session configs and writer props

* refactor: move parquet-format specific functionality to the parquet submodule, leaving only the config options in the config module.

* test: update test fixtures to use the ParquetOptions::default

* test: update test helper session_config_from_writer_props, to not add column configuration when none exists

* test(11367): write test to demonstrate issue 11367

* fix: existing sqllogictests require specific ParquetOptions settings to be left as None

* test(11367): demonstrate how the require bloom filter defaults, (required to avoid test regression), result in different default behavior than parquet crate

* chore: make more reviewable, by pulling tests for issue 11367 into followup PR

* refactor: move all parquet-associated features into parquet-writer mod

* chore: better function naming convention
wiedld added a commit to influxdata/arrow-datafusion that referenced this pull request Jul 31, 2024
…arquetWriterOptions` (apache#11444)

* refactor: make more explicit the relationship btwn TableParquetOptions vs ParquetOptions vs WriterProperties

* test: demonstrate the relationship btwn session configs and writer props

* refactor: move parquet-format specific functionality to the parquet submodule, leaving only the config options in the config module.

* test: update test fixtures to use the ParquetOptions::default

* test: update test helper session_config_from_writer_props, to not add column configuration when none exists

* test(11367): write test to demonstrate issue 11367

* fix: existing sqllogictests require specific ParquetOptions settings to be left as None

* test(11367): demonstrate how the require bloom filter defaults, (required to avoid test regression), result in different default behavior than parquet crate

* chore: make more reviewable, by pulling tests for issue 11367 into followup PR

* refactor: move all parquet-associated features into parquet-writer mod

* chore: better function naming convention
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants