forked from apache/datafusion
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: changes to upstream DF, in order to enable parallelized writes with ParquetSink #11
Closed
Closed
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
391e074
chore: make explicit what ParquetWriterOptions are created from a sub…
wiedld 60fbdac
refactor: restore the ability to add kv metadata into the generated f…
wiedld c964df5
refactor: use hashmap instead of KeyValue, to avoid dependency requir…
wiedld 8beb16a
test: demomnstrate API contract for metadata TableParquetOptions
wiedld File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,11 +17,17 @@ | |
|
||
//! Options related to how parquet files should be written | ||
|
||
use crate::{config::TableParquetOptions, DataFusionError, Result}; | ||
use crate::{ | ||
config::{ParquetOptions, TableParquetOptions}, | ||
DataFusionError, Result, | ||
}; | ||
|
||
use parquet::{ | ||
basic::{BrotliLevel, GzipLevel, ZstdLevel}, | ||
file::properties::{EnabledStatistics, WriterProperties, WriterVersion}, | ||
file::{ | ||
metadata::KeyValue, | ||
properties::{EnabledStatistics, WriterProperties, WriterVersion}, | ||
}, | ||
schema::types::ColumnPath, | ||
}; | ||
|
||
|
@@ -47,53 +53,87 @@ impl TryFrom<&TableParquetOptions> for ParquetWriterOptions { | |
type Error = DataFusionError; | ||
|
||
fn try_from(parquet_options: &TableParquetOptions) -> Result<Self> { | ||
let parquet_session_options = &parquet_options.global; | ||
let mut builder = WriterProperties::builder() | ||
.set_data_page_size_limit(parquet_session_options.data_pagesize_limit) | ||
.set_write_batch_size(parquet_session_options.write_batch_size) | ||
.set_writer_version(parse_version_string( | ||
&parquet_session_options.writer_version, | ||
)?) | ||
.set_dictionary_page_size_limit( | ||
parquet_session_options.dictionary_page_size_limit, | ||
) | ||
.set_max_row_group_size(parquet_session_options.max_row_group_size) | ||
.set_created_by(parquet_session_options.created_by.clone()) | ||
.set_column_index_truncate_length( | ||
parquet_session_options.column_index_truncate_length, | ||
let ParquetOptions { | ||
data_pagesize_limit, | ||
write_batch_size, | ||
writer_version, | ||
dictionary_page_size_limit, | ||
max_row_group_size, | ||
created_by, | ||
column_index_truncate_length, | ||
data_page_row_count_limit, | ||
bloom_filter_enabled, | ||
encoding, | ||
dictionary_enabled, | ||
compression, | ||
statistics_enabled, | ||
max_statistics_size, | ||
bloom_filter_fpp, | ||
bloom_filter_ndv, | ||
// below is not part of ParquetWriterOptions | ||
enable_page_index: _, | ||
pruning: _, | ||
skip_metadata: _, | ||
metadata_size_hint: _, | ||
pushdown_filters: _, | ||
reorder_filters: _, | ||
allow_single_file_parallelism: _, | ||
maximum_parallel_row_group_writers: _, | ||
maximum_buffered_record_batches_per_stream: _, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a nice change |
||
} = &parquet_options.global; | ||
|
||
let key_value_metadata = if !parquet_options.key_value_metadata.is_empty() { | ||
Some( | ||
parquet_options | ||
.key_value_metadata | ||
.clone() | ||
.drain() | ||
.map(|(key, value)| KeyValue { key, value }) | ||
.collect::<Vec<_>>(), | ||
) | ||
.set_data_page_row_count_limit( | ||
parquet_session_options.data_page_row_count_limit, | ||
) | ||
.set_bloom_filter_enabled(parquet_session_options.bloom_filter_enabled); | ||
} else { | ||
None | ||
}; | ||
|
||
if let Some(encoding) = &parquet_session_options.encoding { | ||
let mut builder = WriterProperties::builder() | ||
.set_data_page_size_limit(*data_pagesize_limit) | ||
.set_write_batch_size(*write_batch_size) | ||
.set_writer_version(parse_version_string(writer_version.as_str())?) | ||
.set_dictionary_page_size_limit(*dictionary_page_size_limit) | ||
.set_max_row_group_size(*max_row_group_size) | ||
.set_created_by(created_by.clone()) | ||
.set_column_index_truncate_length(*column_index_truncate_length) | ||
.set_data_page_row_count_limit(*data_page_row_count_limit) | ||
.set_bloom_filter_enabled(*bloom_filter_enabled) | ||
.set_key_value_metadata(key_value_metadata); | ||
|
||
if let Some(encoding) = &encoding { | ||
builder = builder.set_encoding(parse_encoding_string(encoding)?); | ||
} | ||
|
||
if let Some(enabled) = parquet_session_options.dictionary_enabled { | ||
builder = builder.set_dictionary_enabled(enabled); | ||
if let Some(enabled) = dictionary_enabled { | ||
builder = builder.set_dictionary_enabled(*enabled); | ||
} | ||
|
||
if let Some(compression) = &parquet_session_options.compression { | ||
if let Some(compression) = &compression { | ||
builder = builder.set_compression(parse_compression_string(compression)?); | ||
} | ||
|
||
if let Some(statistics) = &parquet_session_options.statistics_enabled { | ||
if let Some(statistics) = &statistics_enabled { | ||
builder = | ||
builder.set_statistics_enabled(parse_statistics_string(statistics)?); | ||
} | ||
|
||
if let Some(size) = parquet_session_options.max_statistics_size { | ||
builder = builder.set_max_statistics_size(size); | ||
if let Some(size) = max_statistics_size { | ||
builder = builder.set_max_statistics_size(*size); | ||
} | ||
|
||
if let Some(fpp) = parquet_session_options.bloom_filter_fpp { | ||
builder = builder.set_bloom_filter_fpp(fpp); | ||
if let Some(fpp) = bloom_filter_fpp { | ||
builder = builder.set_bloom_filter_fpp(*fpp); | ||
} | ||
|
||
if let Some(ndv) = parquet_session_options.bloom_filter_ndv { | ||
builder = builder.set_bloom_filter_ndv(ndv); | ||
if let Some(ndv) = bloom_filter_ndv { | ||
builder = builder.set_bloom_filter_ndv(*ndv); | ||
} | ||
|
||
for (column, options) in &parquet_options.column_specific_options { | ||
|
@@ -141,6 +181,8 @@ impl TryFrom<&TableParquetOptions> for ParquetWriterOptions { | |
builder.set_column_max_statistics_size(path, max_statistics_size); | ||
} | ||
} | ||
|
||
// ParquetWriterOptions will have defaults for the remaining fields (e.g. key_value_metadata & sorting_columns) | ||
Ok(ParquetWriterOptions { | ||
writer_options: builder.build(), | ||
}) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
ParquetOptions
are the configuration which can be provided within a SQL query, and therefore are intended for use in an easily parsible format (refer to theConfigField
trait and associated macros in the linked file).The sorting_columns may lend itself to this use case, of being provided within a SQL query and being easier to parse. However, the same is not true for the user-provided kv_metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question regarding the use case for the WriterProperties sorting_columns. It's listed in the parquet interface; is this referring to a per-row-group applied sorting that only occurs on write? Is there a use case for datafusion, given that we already sort earlier in the batch stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory it is supposed to be used to let readers infer information from the file. I don't know how widely it is written or used by other parquet readers/writers.
IOx stores its sort information in its own metadata, so I think setting the fields in the parquet metadata could be a separate project