Support 'NULL' as Null in csv parser. #13228

dhegberg · 2024-11-01T23:23:53Z

Which issue does this PR close?

Rationale for this change

Bring consistency between handling of NULL in CSV and SQL parsing.

What changes are included in this PR?

Treat a NULL field in CSV as a null entry. Maintain treatment of 0 length entry as a null.

Are these changes tested?

Added unit tests that test NULL and 0 length entries in the data CSV. Associated change in the testing data adds null columns: apache/arrow-testing#103

Are there any user-facing changes?

Yes, a NULL field in a .csv would no longer be treated as a string.

jayzhan211 · 2024-11-02T00:19:39Z

I think we can add such test under datafusion/core/tests/data/ instead of testing

jayzhan211

👍

alamb

Thanks @dhegberg and @jayzhan211

alamb · 2024-11-03T11:27:28Z

datafusion/core/Cargo.toml

@@ -127,6 +127,7 @@ parquet = { workspace = true, optional = true, default-features = true }
 paste = "1.0.15"
 pin-project-lite = "^0.2.7"
 rand = { workspace = true }
+regex = { workspace = true }


I believe this is not a new actual dependency as it is already a dependency of arrow-csv

When I remove it I have this error during cargo build:

error[E0432]: unresolved import `regex` --> datafusion/core/src/datasource/file_format/csv.rs:59:5 | 59 | use regex::Regex; | ^^^^^ help: a similar path exists: `datafusion_functions::regex` For more information about this error, try `rustc --explain E0432`. error: could not compile `datafusion` (lib) due to 1 previous error

Yeah, the "better" think to do would be for arrow-rs to re-export the RegEx structure it used as it appears in the public API

datafusion/core/src/datasource/file_format/csv.rs

Dandandan · 2024-11-04T08:31:24Z

datafusion/core/src/datasource/file_format/csv.rs

-                .with_delimiter(self.options.delimiter);
+                .with_delimiter(self.options.delimiter)
+                // Support literal NULL or empty string as null
+                .with_null_regex(Regex::new(r"^NULL$|^$").unwrap());


Shouldn't this be a configuration option instead?
Most csv writers will use an empty string (,,) for null which also makes sense as default.
Having the regex enabled might also impact performance?

A configuration option makes sense to me

I poked around for a csv querying benchmark and I can't find one. Perhaps we should write one 🤔 I don't know if this would have any effect

So, to confirm:

I'll open up a separate PR to implement a benchmark.

When that gets merged I'll update this PR to use a configuration option and add the new config to the benchmark test to confirm if there is a regression.

Sounds like a good plan to me!

Agreed -- thank you @dhegberg

I think it might make sense to make it an configuration regardless if there is a performance degradation or not. This is because the new behavior will be a breaking change and potentially undesired behavior for some users.

alamb · 2024-11-05T16:46:18Z

Marking as draft so we don't accidentally merge this

Co-authored-by: Andrew Lamb <[email protected]>

dhegberg · 2024-12-17T03:16:16Z

Updated to move Null parsing regex to a config.

Benchmark comparison when using regex does show some regression:

Before:

     Running benches/csv_load.rs (/Users/dhegberg/workplace/datafusion/target/release/deps/csv_load-0ca64cec5e99a8c3)
Gnuplot not found, using plotters backend
Generated test dataset with 69642 rows
Benchmarking load csv testing/default csv read options
Benchmarking load csv testing/default csv read options: Warming up for 3.0000 s
Benchmarking load csv testing/default csv read options: Collecting 100 samples in estimated 20.457 s (1200 iterations)
Benchmarking load csv testing/default csv read options: Analyzing
load csv testing/default csv read options
                        time:   [20.305 ms 20.536 ms 20.763 ms]
mean   [20.305 ms 20.763 ms] std. dev.      [1.0513 ms 1.2800 ms]
median [20.127 ms 21.042 ms] med. abs. dev. [1.0398 ms 1.6551 ms]

After:

Gnuplot not found, using plotters backend
Generated test dataset with 69642 rows
Benchmarking load csv testing/default csv read options
Benchmarking load csv testing/default csv read options: Warming up for 3.0000 s
Benchmarking load csv testing/default csv read options: Collecting 100 samples in estimated 21.606 s (1200 iterations)
Benchmarking load csv testing/default csv read options: Analyzing
load csv testing/default csv read options
                        time:   [21.583 ms 21.856 ms 22.166 ms]
                        change: [+1.9609% +3.6130% +5.3682%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
mean   [21.583 ms 22.166 ms] std. dev.      [988.76 µs 2.0538 ms]
median [21.438 ms 21.965 ms] med. abs. dev. [776.50 µs 1.3261 ms]

dhegberg · 2024-12-17T04:36:30Z

@jayzhan211 @alamb

I've revised this to set the null import via config.

Shows a small regression when this config is used, but it should be acceptable since this is opt in.

alamb

Thanks @dhegberg -- this looks great to me. Thank you for your diligence on this. I will plan to merge the PR tomorrow unless there are additional comments or others would like time to review

alamb · 2024-12-18T23:26:13Z

🚀

github-actions bot added the core Core DataFusion crate label Nov 1, 2024

dhegberg mentioned this pull request Nov 1, 2024

Add csv with nulls. apache/arrow-testing#103

Closed

jayzhan211 approved these changes Nov 3, 2024

View reviewed changes

alamb approved these changes Nov 3, 2024

View reviewed changes

Dandandan reviewed Nov 4, 2024

View reviewed changes

alamb marked this pull request as draft November 5, 2024 16:46

dhegberg force-pushed the support_nulls_in_csv branch from a6224b3 to f615857 Compare December 17, 2024 03:09

github-actions bot added common Related to common crate proto Related to proto crate labels Dec 17, 2024

Support Null regex override in csv parser options.

d67c27c

Co-authored-by: Andrew Lamb <[email protected]>

dhegberg force-pushed the support_nulls_in_csv branch from f615857 to d67c27c Compare December 17, 2024 03:13

dhegberg marked this pull request as ready for review December 17, 2024 12:29

alamb approved these changes Dec 17, 2024

View reviewed changes

alamb merged commit 01ffb64 into apache:main Dec 18, 2024
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support 'NULL' as Null in csv parser. #13228

Support 'NULL' as Null in csv parser. #13228

dhegberg commented Nov 1, 2024

jayzhan211 commented Nov 2, 2024 •

edited

Loading

jayzhan211 left a comment

alamb left a comment

alamb Nov 3, 2024

dhegberg Nov 3, 2024

alamb Nov 4, 2024

Dandandan Nov 4, 2024

alamb Nov 4, 2024 •

edited

Loading

dhegberg Nov 4, 2024

Dandandan Nov 5, 2024

alamb Nov 5, 2024

eejbyfeldt Nov 8, 2024

alamb commented Nov 5, 2024

dhegberg commented Dec 17, 2024 •

edited

Loading

dhegberg commented Dec 17, 2024

alamb left a comment

alamb commented Dec 18, 2024

Support 'NULL' as Null in csv parser. #13228

Support 'NULL' as Null in csv parser. #13228

Conversation

dhegberg commented Nov 1, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jayzhan211 commented Nov 2, 2024 • edited Loading

jayzhan211 left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Nov 5, 2024

dhegberg commented Dec 17, 2024 • edited Loading

dhegberg commented Dec 17, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb commented Dec 18, 2024

jayzhan211 commented Nov 2, 2024 •

edited

Loading

alamb Nov 4, 2024 •

edited

Loading

dhegberg commented Dec 17, 2024 •

edited

Loading