Write null counts in parquet files when they are present #6257

alamb · 2024-08-15T17:03:22Z

Which issue does this PR close?

Closes #6256

Note this PR contains some of the code that was originally in @Michael-J-Ward 's PR #6256

Rationale for this change

See #6256.

Current behavior:

parquet-rs writer always has the null count when writing statistics, but writes None to thrift when the null count is zero
parquet-rs reader treats a missing null count (None) as Some(0) (aka that it is known there are no nulls)
parquet-rs will write negative numbers if the null count or distinct count is greater than what fits in i64 (e.g. u64::MAX) -- this is likely a theoretical concern only

This is inconsistent with the parquet spec as well as what parquet-java and parquet-cpp do

What changes are included in this PR?

Update parquet reader/writer to follow the spec
Add error checking for values that are too large to fit into i64
documented that older versions of parquet-rs wrote None.
added tests

Are there any user-facing changes?

Yes

Changes

parquet-rs writer always writes Some(..) to thrift
parquet-rs reader returns None (aka that it is unknown if there are nulls) if there are no null counts in the thrift
parquet-rs writer writes None if the null count / distinct count is too large to fit in i64
documented that older versions of parquet-rs wrote none.
Changed the StatisticsConverter code to read statistics consistently with older versions of parquet-rs (treat missing null counts as known zero) and added a flag to alter the behavior

This change means the generated parquet files are slightly larger (as now they encode Some(0) for null counts) but the behavior is more correct and consistent.

alamb · 2024-08-15T17:05:13Z

parquet/tests/arrow_writer_layout.rs

@@ -189,7 +189,7 @@ fn test_primitive() {
                    pages: (0..8)
                        .map(|_| Page {
                            rows: 250,
-                            page_header_size: 36,
+                            page_header_size: 38,


The page headers (and files) are larger now because Some(0) takes more space than None

alamb · 2024-08-15T18:30:08Z

parquet/src/file/statistics.rs

+
+    #[test]
+    fn test_count_encoding() {
+        statistics_count_test(None, None);


This test fails like this without the changes in this PR:

assertion `left == right` failed left: Boolean({min: Some(true), max: Some(false), distinct_count: None, null_count: Some(0), ... right: Boolean({min: Some(true), max: Some(false), distinct_count: None, null_count: None, ...

alamb · 2024-08-15T18:32:34Z

parquet/src/arrow/arrow_reader/statistics.rs

@@ -1195,6 +1197,23 @@ impl<'a> StatisticsConverter<'a> {
        self.arrow_field
    }

+    /// Set the statistics converter to treat missing null counts as missing


By default reading null counts will work with files written with older versions of parquet-rs

For breaking / major release, is it acceptable to include an upgrade instruction of "add this config to maintain to old behavior"?

Absolutely

It is important to note that the default behavior in this PR is the old behavior (in other words there should be changes needed in downstream consumers of this code)

The default in this PR is missing_null_counts_as_zero = true, which maintains the old behavior, right?

If "add this config to maintain old behavior" is acceptable for a breaking release, then I would expect the default to be the new behavior.

IOW, I'd expect what you said on the parquet mailing list

Applications that use parquet-rs to read parquet_files and interpret the
null_count will need to be changed after the upgrade to explicitly continue
the old behavior of "treat no null_count as 0" which is also documented
now.

The default in this PR is missing_null_counts_as_zero = true, which maintains the old behavior, right?

Yes

then I would expect the default to be the new behavior.

My thinking was

Since there are two different apis:

Statistics::null_count would now return Option<..> so users of the library will ned to update their code anyways and thus can choose at that time which behavior they wanted

StatisticsConverter's API didn't change and thus it keeps the previous behavior. This is what I would persoanlly want for a system -- no change for reading parquet files that were written with previous versions of the rust writer.

alamb · 2024-08-15T18:39:30Z

parquet/src/file/statistics.rs

-            // Number of nulls recorded, when it is not available, we just mark it as 0.
-            // TODO this should be `None` if there is no information about NULLS.
-            // see https://github.com/apache/arrow-rs/pull/6216/files
-            let null_count = stats.null_count.unwrap_or(0);


the removal of this unwrap_or is what changes the semantics while reading

alamb · 2024-08-15T18:40:13Z

parquet/src/file/statistics.rs

    let null_count = stats
        .null_count_opt()
-        .map(|value| value as i64)
-        .filter(|&x| x > 0);


The removal of this filter is what fixes the statistics while writing

Maybe it was intended to be x >= 0 originally 🤔

I was intrigued, so went to do some code archaeology, but that filter was only introduced a few days ago in #6216, by you! 😄

For 6 years before it was:

null_count: if stats.has_nulls() { Some(stats.null_count() as } else { None },

Your commit:
7d4e650

Prior code:
https://github.com/apache/arrow-rs/blame/25bfccca58ff219d9f59ba9f4d75550493238a4f/parquet/src/file/statistics.rs#L228-L242

alamb · 2024-08-20T17:10:37Z

@etseidl I wonder if you have any thoughts on this code / the writing of null counts

etseidl · 2024-08-20T17:37:01Z

@etseidl I wonder if you have any thoughts on this code / the writing of null counts

Sorry, I was away when this discussion started (interesting things always happen when I'm on vacation 😉). I think this PR is heading the right direction. Writing Some(0) when the null count is known is the right behavior IMO. On the read side I've always treated missing null counts as unknown rather than 0, so the changes here are welcome. I think it's fine for the read side to continue with the old behavior for some time.

alamb · 2024-08-28T11:06:10Z

Unless someone else has time to review this PR and thinks it should go into 53.0.0 (#6016) my personal preference would be to merge this PR early in the next major release cycle (e.g. 54.0.0) so it gets maximum bake / testing time before release

etseidl · 2024-08-28T16:48:06Z

Unless someone else has time to review this PR and thinks it should go into 53.0.0 (#6016) my personal preference would be to merge this PR early in the next major release cycle (e.g. 54.0.0) so it gets maximum bake / testing time before release

I've looked this PR over several times and haven't found any issues with it. The only thing giving me pause is this apache/parquet-format#449 (comment)

It seems java and cpp ignore the definition levels and write Some(0) regardless, so I'm fine with merging this and worrying about micro optimizations down the road (in step with the rest of the parquet community).

alamb · 2024-08-31T12:40:05Z

Thank you @etseidl -- I think

given the potential for this PR to cause unintended consequences
we haven't acutally had any bug reports related to this issue,
the 53.0.0 release is imminent and has several people waiting on features

I am not going to merge this PR until after we have released 53.0.0

alamb · 2024-09-18T20:13:20Z

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

…counts

etseidl

Took another look and found no nits. I'd say ship it.

andygrove

LGTM. Thanks @alamb

alamb · 2024-09-22T12:28:31Z

I go back and forth on this PR -- it isn't technically an API change in the sense of a breaking API change, but also it changes the content of written parquet files. Arguable the content better matches the parquet spec, I am just really worried about unintended consequences of doing this

Maybe I am overly concerned.

etseidl · 2024-09-22T19:24:00Z

I go back and forth on this PR -- it isn't technically an API change in the sense of a breaking API change, but also it changes the content of written parquet files. Arguable the content better matches the parquet spec, I am just really worried about unintended consequences of doing this

Maybe I am overly concerned.

Then how about splitting this up? First change the read behavior so None isn't treated as 0, keep the missing_null_counts_as_zero setting, but default to false. If anyone is broken by that behavior they can workaround by setting it true. After that's been around a release, then change the behavior on write.

alamb · 2024-09-24T18:31:24Z

I go back and forth on this PR -- it isn't technically an API change in the sense of a breaking API change, but also it changes the content of written parquet files. Arguable the content better matches the parquet spec, I am just really worried about unintended consequences of doing this
Maybe I am overly concerned.

Then how about splitting this up? First change the read behavior so None isn't treated as 0, keep the missing_null_counts_as_zero setting, but default to false. If anyone is broken by that behavior they can workaround by setting it true. After that's been around a release, then change the behavior on write.

That is a (very) good idea -- I will plan to do that when I can find some time

etseidl · 2024-09-24T18:57:55Z

That is a (very) good idea -- I will plan to do that when I can find some time

@alamb Let me know if you'd like some help with this. I have spare cycles right now.

alamb · 2024-09-25T16:23:48Z

That is a (very) good idea -- I will plan to do that when I can find some time

@alamb Let me know if you'd like some help with this. I have spare cycles right now.

that would be super helpful @etseidl -- I do not have many spare cycles now (as you have probably guessed). All your help is most appreciated

etseidl · 2024-10-01T17:26:01Z

To clear up my own thinking on this, I made a table of what happens with a round trip of the statistics.

null_count read and write behaviors
|-------|-------------------|-------------------|-------------------|-------------------|
| start |       current     |    change write   |    change read    |    change both    |
| value |  write  |   read  |  write  |   read  |  write  |   read  |  write  |   read  |
|-------|---------|---------|---------|---------|---------|---------|---------|---------|
| None  |  None   | Some(0) |  None   | Some(0) |  None   |  None   |  None   |  None   |
| 0     |  None   | Some(0) | Some(0) | Some(0) |  None   |  None   | Some(0) | Some(0) |
| n > 0 | Some(n) | Some(n) | Some(n) | Some(n) | Some(n) | Some(n) | Some(n) | Some(n) |
|-------|---------|---------|---------|---------|---------|---------|---------|---------|

write means what ends up in parquet file
read is return value of null_count_opt()

Currently we write None for None or 0, and Some(n) otherwise, and on read these will become either Some(0) or Some(n). If we just change the current write behavior, the written Parquet will be consistent with other writers (parquet-java, parquet-cpp), and we get the same result on a round trip. This should not break old parquet-rs readers, and will allow written files to be more in line with the spec (albeit a few bytes larger). Changing both read and write will return None when the original value was None, which is correct, but could potentially break old readers.

So in theory we can write the null counts properly now, but should wait until 54.0.0 to make the last change on the read side. Does this sound right to you @alamb?

…counts

alamb · 2024-10-01T20:11:19Z

Changing both read and write will return None when the original value was None, which is correct, but could potentially break old readers.

I think it also would break reading of OLD files (aka files written with the older software / older versions of arrow-rs would now be interpreted as "unknown null count" rather than "0 null count") which I worry about a lot.

So in theory we can write the null counts properly now, but should wait until 54.0.0 to make the last change on the read side.

Yes, I suppose writing the null counts properly seems like a good idea. I am (likely overly) paranoid about the read side changes

etseidl · 2024-10-01T20:46:57Z

Changing both read and write will return None when the original value was None, which is correct, but could potentially break old readers.

I think it also would break reading of OLD files (aka files written with the older software / older versions of arrow-rs would now be interpreted as "unknown null count" rather than "0 null count") which I worry about a lot.

Yes, that's the "change read" column from my table, I suppose. So we agree it's the read side where the danger lies.

So in theory we can write the null counts properly now, but should wait until 54.0.0 to make the last change on the read side.

Yes, I suppose writing the null counts properly seems like a good idea. I am (likely overly) paranoid about the read side changes

As I am often told, I am not paranoid enough, so maybe we balance out 😆

…counts

alamb · 2024-10-01T22:20:15Z

Update here is that @etseidl has broken out the non backwards compatible changes in two PRs:

Add configuration option to StatisticsConverter to control interpretation of missing null counts in Parquet statistics #6485
Write null counts in Parquet statistics when they are known #6490

etseidl · 2024-11-25T23:44:04Z

Gentle bump...do we want to merge this now?

alamb · 2024-11-26T11:32:19Z

I am still quite worried about the subtle semantic implications of this change -- see for example the discussion on

[DISCUSSION] Making it easier to use DataFusion (lessons from GlareDB) datafusion#13525

This is the kind of change that could easily lead to lots of debugging / head scratching I think, and even worse would be hard to catch with tests as it would only affect files written in the past

So I am torn, to be honest

etseidl · 2024-11-26T18:39:31Z

Not my call, but I would just throw out there that with this change, there is still a way to get the old behavior if desired. But without this change, there is no way for a user to figure out if null_count is truly 0 or is not present. Food for thought.

tustvold · 2024-11-29T17:38:57Z

This is inconsistent with the parquet spec as well as what parquet-java and parquet-cpp do

FWIW the current behaviour is a bug IMO and so my vote would be to proceed with this change. We've postponed it to a breaking release, and are calling it out as a major change in the changelog, so I think we've done all that we can reasonably be expected to.

tustvold · 2024-11-29T17:40:28Z

parquet/src/file/statistics.rs

+    /// To preserve the prior behavior and read null counts properly from older files
+    /// you should default to zero:


Perhaps we should make it clearer that this behaviour is actually incorrect, it will claim a null count of 0, when it actually isn't known

etseidl

One last nit if we're doing this

etseidl · 2024-11-29T18:29:46Z

parquet/src/file/statistics.rs

-    /// Note this API returns Some(0) even if the null count was not present
-    /// in the statistics.
-    /// See <https://github.com/apache/arrow-rs/pull/6216/files>
+    /// Note: Versions of this library prior to `53.0.0` returned 0 if the null count was


Suggested change

/// Note: Versions of this library prior to `53.0.0` returned 0 if the null count was

/// Note: Versions of this library prior to `54.0.0` returned 0 if the null count was

parquet/src/file/statistics.rs

Co-authored-by: Ed Seidl <[email protected]>

alamb · 2024-12-10T19:09:06Z

After more thought, I would like to hold off merging this into arrow 54.0.0 because:

It isn't super urgent as I understand: It has been wrong for many years, and I don't know anyone who is actually affected by this
It has the potential for causing subtle downstream performance issues.

Maybe it is just that I am burned out on the fallout from the DataFusion 43 release (where there were a bunch of regressions and other challenges) but I don't want to introduce some other potentially subtle issue for a while longer

(I agree this may just me being weak)

etseidl · 2024-12-10T21:19:10Z

Fair enough...I'll check back in February or March 😉

github-actions bot added the parquet Changes to the parquet crate label Aug 15, 2024

alamb force-pushed the alamb/parquet_null_counts branch from 6070619 to 2d70413 Compare August 15, 2024 18:20

Write null counts in parquet files when they are present

e00e160

alamb force-pushed the alamb/parquet_null_counts branch from 2d70413 to e00e160 Compare August 15, 2024 18:31

alamb commented Aug 15, 2024

View reviewed changes

alamb marked this pull request as ready for review August 15, 2024 18:40

alamb mentioned this pull request Aug 15, 2024

parquet writer does not encode null count = 0 correctly #6256

Open

Michael-J-Ward approved these changes Aug 20, 2024

View reviewed changes

mapleFU mentioned this pull request Aug 25, 2024

Clarify num-nulls handling in Statistics and ColumnIndex apache/parquet-format#449

Merged

Merge remote-tracking branch 'apache/master' into alamb/parquet_null_…

09c4047

…counts

etseidl approved these changes Sep 19, 2024

View reviewed changes

andygrove approved these changes Sep 19, 2024

View reviewed changes

alamb added next-major-release the PR has API changes and it waiting on the next major version api-change Changes to the arrow API and removed next-major-release the PR has API changes and it waiting on the next major version api-change Changes to the arrow API labels Sep 22, 2024

alamb added the next-major-release the PR has API changes and it waiting on the next major version label Sep 24, 2024

etseidl mentioned this pull request Sep 30, 2024

Add configuration option to StatisticsConverter to control interpretation of missing null counts in Parquet statistics #6485

Merged

Merge remote-tracking branch 'apache/master' into alamb/parquet_null_…

6c05e60

…counts

etseidl mentioned this pull request Oct 1, 2024

Write null counts in Parquet statistics when they are known #6490

Merged

Merge remote-tracking branch 'apache/master' into alamb/parquet_null_…

60aec4e

…counts

alamb mentioned this pull request Oct 13, 2024

Parquet Statistics null_count does not distinguish between 0 and not specified #6215

Closed

tustvold reviewed Nov 29, 2024

View reviewed changes

etseidl reviewed Nov 29, 2024

View reviewed changes

parquet/src/file/statistics.rs Outdated Show resolved Hide resolved

Update parquet/src/file/statistics.rs

da86f6c

Co-authored-by: Ed Seidl <[email protected]>

tustvold assigned tustvold and unassigned tustvold Dec 15, 2024

tustvold marked this pull request as draft December 15, 2024 12:06

		/// To preserve the prior behavior and read null counts properly from older files
		/// you should default to zero:

	/// Note: Versions of this library prior to `53.0.0` returned 0 if the null count was
	/// Note: Versions of this library prior to `54.0.0` returned 0 if the null count was

Write null counts in parquet files when they are present #6257

Are you sure you want to change the base?

Write null counts in parquet files when they are present #6257

Conversation

alamb commented Aug 15, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Aug 20, 2024

etseidl commented Aug 20, 2024

alamb commented Aug 28, 2024

etseidl commented Aug 28, 2024

alamb commented Aug 31, 2024

alamb commented Sep 18, 2024

etseidl left a comment

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

alamb commented Sep 22, 2024

etseidl commented Sep 22, 2024

alamb commented Sep 24, 2024

etseidl commented Sep 24, 2024

alamb commented Sep 25, 2024

etseidl commented Oct 1, 2024 • edited Loading

alamb commented Oct 1, 2024

etseidl commented Oct 1, 2024

alamb commented Oct 1, 2024

etseidl commented Nov 25, 2024

alamb commented Nov 26, 2024

etseidl commented Nov 26, 2024

tustvold commented Nov 29, 2024

Choose a reason for hiding this comment

etseidl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Dec 10, 2024

etseidl commented Dec 10, 2024

alamb commented Aug 15, 2024 •

edited

Loading

etseidl commented Oct 1, 2024 •

edited

Loading