Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write null counts in parquet files when they are present #6257

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 55 additions & 32 deletions parquet/src/file/statistics.rs
Original file line number Diff line number Diff line change
Expand Up @@ -125,22 +125,35 @@ pub fn from_thrift(
) -> Result<Option<Statistics>> {
Ok(match thrift_stats {
Some(stats) => {
// Number of nulls recorded, when it is not available, we just mark it as 0.
// TODO this should be `None` if there is no information about NULLS.
// see https://github.com/apache/arrow-rs/pull/6216/files
let null_count = stats.null_count.unwrap_or(0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the removal of this unwrap_or is what changes the semantics while reading


if null_count < 0 {
return Err(ParquetError::General(format!(
"Statistics null count is negative {}",
null_count
)));
}
// transform null count to u64
let null_count = stats
.null_count
.map(|null_count| {
if null_count < 0 {
return Err(ParquetError::General(format!(
"Statistics null count is negative {}",
null_count
)));
}
Ok(null_count as u64)
})
.transpose()?;

// Generic null count.
let null_count = Some(null_count as u64);
// Generic distinct count (count of distinct values occurring)
let distinct_count = stats.distinct_count.map(|value| value as u64);
let distinct_count = stats
.distinct_count
.map(|distinct_count| {
if distinct_count < 0 {
return Err(ParquetError::General(format!(
"Statistics distinct count is negative {}",
distinct_count
)));
}

Ok(distinct_count as u64)
})
.transpose()?;

// Whether or not statistics use deprecated min/max fields.
let old_format = stats.min_value.is_none() && stats.max_value.is_none();
// Generic min value as bytes.
Expand Down Expand Up @@ -416,9 +429,20 @@ impl Statistics {
/// Returns number of null values for the column, if known.
/// Note that this includes all nulls when column is part of the complex type.
///
/// Note this API returns Some(0) even if the null count was not present
/// in the statistics.
/// See <https://github.com/apache/arrow-rs/pull/6216/files>
/// Note: Versions of this library prior to `53.0.0` returned 0 if the null count was
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Note: Versions of this library prior to `53.0.0` returned 0 if the null count was
/// Note: Versions of this library prior to `54.0.0` returned 0 if the null count was

/// not available. This method returns `None` in that case.
///
/// Also, versions of this library prior to `53.1.0` did not store the null count in the
/// statistics if the null count was `0`.
///
/// To preserve the prior behavior and read null counts properly from older files
/// you should default to zero:
Comment on lines +438 to +439
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should make it clearer that this behaviour is actually incorrect, it will claim a null count of 0, when it actually isn't known

///
/// ```no_run
/// # use parquet::file::statistics::Statistics;
/// # let statistics: Statistics = todo!();
/// let null_count = statistics.null_count_opt().unwrap_or(0);
/// ```
pub fn null_count_opt(&self) -> Option<u64> {
statistics_enum_func![self, null_count_opt]
}
Expand Down Expand Up @@ -1081,6 +1105,19 @@ mod tests {
assert_eq!(thrift_stats.null_count, None); // can' store u64 max --> null
}

#[test]
fn test_count_decoding_distinct_invalid() {
let tstatistics = TStatistics {
distinct_count: Some(-42),
..Default::default()
};
let err = from_thrift(Type::BOOLEAN, Some(tstatistics)).unwrap_err();
assert_eq!(
err.to_string(),
"Parquet error: Statistics distinct count is negative -42"
);
}

#[test]
fn test_count_decoding_null_invalid() {
let tstatistics = TStatistics {
Expand Down Expand Up @@ -1110,21 +1147,7 @@ mod tests {
let round_tripped = from_thrift(Type::BOOLEAN, Some(thrift_stats))
.unwrap()
.unwrap();
// TODO: remove branch when we no longer support assuming null_count==None in the thrift
// means null_count = Some(0)
if null_count.is_none() {
assert_ne!(round_tripped, statistics);
assert!(round_tripped.null_count_opt().is_some());
assert_eq!(round_tripped.null_count_opt(), Some(0));
assert_eq!(round_tripped.min_bytes_opt(), statistics.min_bytes_opt());
assert_eq!(round_tripped.max_bytes_opt(), statistics.max_bytes_opt());
assert_eq!(
round_tripped.distinct_count_opt(),
statistics.distinct_count_opt()
);
} else {
assert_eq!(round_tripped, statistics);
}
assert_eq!(round_tripped, statistics);
}

fn make_bool_stats(distinct_count: Option<u64>, null_count: Option<u64>) -> Statistics {
Expand Down
Loading