-
Notifications
You must be signed in to change notification settings - Fork 820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write null counts in parquet files when they are present #6257
Draft
alamb
wants to merge
5
commits into
apache:main
Choose a base branch
from
alamb:alamb/parquet_null_counts
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
e00e160
Write null counts in parquet files when they are present
alamb 09c4047
Merge remote-tracking branch 'apache/master' into alamb/parquet_null_…
alamb 6c05e60
Merge remote-tracking branch 'apache/master' into alamb/parquet_null_…
alamb 60aec4e
Merge remote-tracking branch 'apache/master' into alamb/parquet_null_…
alamb da86f6c
Update parquet/src/file/statistics.rs
alamb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -125,22 +125,35 @@ pub fn from_thrift( | |||||
) -> Result<Option<Statistics>> { | ||||||
Ok(match thrift_stats { | ||||||
Some(stats) => { | ||||||
// Number of nulls recorded, when it is not available, we just mark it as 0. | ||||||
// TODO this should be `None` if there is no information about NULLS. | ||||||
// see https://github.com/apache/arrow-rs/pull/6216/files | ||||||
let null_count = stats.null_count.unwrap_or(0); | ||||||
|
||||||
if null_count < 0 { | ||||||
return Err(ParquetError::General(format!( | ||||||
"Statistics null count is negative {}", | ||||||
null_count | ||||||
))); | ||||||
} | ||||||
// transform null count to u64 | ||||||
let null_count = stats | ||||||
.null_count | ||||||
.map(|null_count| { | ||||||
if null_count < 0 { | ||||||
return Err(ParquetError::General(format!( | ||||||
"Statistics null count is negative {}", | ||||||
null_count | ||||||
))); | ||||||
} | ||||||
Ok(null_count as u64) | ||||||
}) | ||||||
.transpose()?; | ||||||
|
||||||
// Generic null count. | ||||||
let null_count = Some(null_count as u64); | ||||||
// Generic distinct count (count of distinct values occurring) | ||||||
let distinct_count = stats.distinct_count.map(|value| value as u64); | ||||||
let distinct_count = stats | ||||||
.distinct_count | ||||||
.map(|distinct_count| { | ||||||
if distinct_count < 0 { | ||||||
return Err(ParquetError::General(format!( | ||||||
"Statistics distinct count is negative {}", | ||||||
distinct_count | ||||||
))); | ||||||
} | ||||||
|
||||||
Ok(distinct_count as u64) | ||||||
}) | ||||||
.transpose()?; | ||||||
|
||||||
// Whether or not statistics use deprecated min/max fields. | ||||||
let old_format = stats.min_value.is_none() && stats.max_value.is_none(); | ||||||
// Generic min value as bytes. | ||||||
|
@@ -416,9 +429,20 @@ impl Statistics { | |||||
/// Returns number of null values for the column, if known. | ||||||
/// Note that this includes all nulls when column is part of the complex type. | ||||||
/// | ||||||
/// Note this API returns Some(0) even if the null count was not present | ||||||
/// in the statistics. | ||||||
/// See <https://github.com/apache/arrow-rs/pull/6216/files> | ||||||
/// Note: Versions of this library prior to `53.0.0` returned 0 if the null count was | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
/// not available. This method returns `None` in that case. | ||||||
/// | ||||||
/// Also, versions of this library prior to `53.1.0` did not store the null count in the | ||||||
/// statistics if the null count was `0`. | ||||||
/// | ||||||
/// To preserve the prior behavior and read null counts properly from older files | ||||||
/// you should default to zero: | ||||||
Comment on lines
+438
to
+439
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps we should make it clearer that this behaviour is actually incorrect, it will claim a null count of 0, when it actually isn't known |
||||||
/// | ||||||
/// ```no_run | ||||||
/// # use parquet::file::statistics::Statistics; | ||||||
/// # let statistics: Statistics = todo!(); | ||||||
/// let null_count = statistics.null_count_opt().unwrap_or(0); | ||||||
/// ``` | ||||||
pub fn null_count_opt(&self) -> Option<u64> { | ||||||
statistics_enum_func![self, null_count_opt] | ||||||
} | ||||||
|
@@ -1081,6 +1105,19 @@ mod tests { | |||||
assert_eq!(thrift_stats.null_count, None); // can' store u64 max --> null | ||||||
} | ||||||
|
||||||
#[test] | ||||||
fn test_count_decoding_distinct_invalid() { | ||||||
let tstatistics = TStatistics { | ||||||
distinct_count: Some(-42), | ||||||
..Default::default() | ||||||
}; | ||||||
let err = from_thrift(Type::BOOLEAN, Some(tstatistics)).unwrap_err(); | ||||||
assert_eq!( | ||||||
err.to_string(), | ||||||
"Parquet error: Statistics distinct count is negative -42" | ||||||
); | ||||||
} | ||||||
|
||||||
#[test] | ||||||
fn test_count_decoding_null_invalid() { | ||||||
let tstatistics = TStatistics { | ||||||
|
@@ -1110,21 +1147,7 @@ mod tests { | |||||
let round_tripped = from_thrift(Type::BOOLEAN, Some(thrift_stats)) | ||||||
.unwrap() | ||||||
.unwrap(); | ||||||
// TODO: remove branch when we no longer support assuming null_count==None in the thrift | ||||||
// means null_count = Some(0) | ||||||
if null_count.is_none() { | ||||||
assert_ne!(round_tripped, statistics); | ||||||
assert!(round_tripped.null_count_opt().is_some()); | ||||||
assert_eq!(round_tripped.null_count_opt(), Some(0)); | ||||||
assert_eq!(round_tripped.min_bytes_opt(), statistics.min_bytes_opt()); | ||||||
assert_eq!(round_tripped.max_bytes_opt(), statistics.max_bytes_opt()); | ||||||
assert_eq!( | ||||||
round_tripped.distinct_count_opt(), | ||||||
statistics.distinct_count_opt() | ||||||
); | ||||||
} else { | ||||||
assert_eq!(round_tripped, statistics); | ||||||
} | ||||||
assert_eq!(round_tripped, statistics); | ||||||
} | ||||||
|
||||||
fn make_bool_stats(distinct_count: Option<u64>, null_count: Option<u64>) -> Statistics { | ||||||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the removal of this
unwrap_or
is what changes the semantics while reading