Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support 'NULL' as Null in csv parser. #13228

Merged
merged 1 commit into from
Dec 18, 2024
Merged

Conversation

dhegberg
Copy link
Contributor

@dhegberg dhegberg commented Nov 1, 2024

Which issue does this PR close?

Closes #12904.

Rationale for this change

Bring consistency between handling of NULL in CSV and SQL parsing.

What changes are included in this PR?

Treat a NULL field in CSV as a null entry. Maintain treatment of 0 length entry as a null.

Are these changes tested?

Added unit tests that test NULL and 0 length entries in the data CSV. Associated change in the testing data adds null columns: apache/arrow-testing#103

Are there any user-facing changes?

Yes, a NULL field in a .csv would no longer be treated as a string.

@github-actions github-actions bot added the core Core DataFusion crate label Nov 1, 2024
@jayzhan211
Copy link
Contributor

jayzhan211 commented Nov 2, 2024

I think we can add such test under datafusion/core/tests/data/ instead of testing

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dhegberg and @jayzhan211

@@ -127,6 +127,7 @@ parquet = { workspace = true, optional = true, default-features = true }
paste = "1.0.15"
pin-project-lite = "^0.2.7"
rand = { workspace = true }
regex = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is not a new actual dependency as it is already a dependency of arrow-csv

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I remove it I have this error during cargo build:

error[E0432]: unresolved import `regex`
  --> datafusion/core/src/datasource/file_format/csv.rs:59:5
   |
59 | use regex::Regex;
   |     ^^^^^ help: a similar path exists: `datafusion_functions::regex`

For more information about this error, try `rustc --explain E0432`.
error: could not compile `datafusion` (lib) due to 1 previous error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the "better" think to do would be for arrow-rs to re-export the RegEx structure it used as it appears in the public API

datafusion/core/src/datasource/file_format/csv.rs Outdated Show resolved Hide resolved
.with_delimiter(self.options.delimiter);
.with_delimiter(self.options.delimiter)
// Support literal NULL or empty string as null
.with_null_regex(Regex::new(r"^NULL$|^$").unwrap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a configuration option instead?
Most csv writers will use an empty string (,,) for null which also makes sense as default.
Having the regex enabled might also impact performance?

Copy link
Contributor

@alamb alamb Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A configuration option makes sense to me

I poked around for a csv querying benchmark and I can't find one. Perhaps we should write one 🤔 I don't know if this would have any effect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, to confirm:

I'll open up a separate PR to implement a benchmark.

When that gets merged I'll update this PR to use a configuration option and add the new config to the benchmark test to confirm if there is a regression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good plan to me!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed -- thank you @dhegberg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might make sense to make it an configuration regardless if there is a performance degradation or not. This is because the new behavior will be a breaking change and potentially undesired behavior for some users.

@alamb alamb marked this pull request as draft November 5, 2024 16:46
@alamb
Copy link
Contributor

alamb commented Nov 5, 2024

Marking as draft so we don't accidentally merge this

@dhegberg dhegberg force-pushed the support_nulls_in_csv branch from a6224b3 to f615857 Compare December 17, 2024 03:09
@github-actions github-actions bot added common Related to common crate proto Related to proto crate labels Dec 17, 2024
@dhegberg dhegberg force-pushed the support_nulls_in_csv branch from f615857 to d67c27c Compare December 17, 2024 03:13
@dhegberg
Copy link
Contributor Author

dhegberg commented Dec 17, 2024

Updated to move Null parsing regex to a config.

Benchmark comparison when using regex does show some regression:

Before:

     Running benches/csv_load.rs (/Users/dhegberg/workplace/datafusion/target/release/deps/csv_load-0ca64cec5e99a8c3)
Gnuplot not found, using plotters backend
Generated test dataset with 69642 rows
Benchmarking load csv testing/default csv read options
Benchmarking load csv testing/default csv read options: Warming up for 3.0000 s
Benchmarking load csv testing/default csv read options: Collecting 100 samples in estimated 20.457 s (1200 iterations)
Benchmarking load csv testing/default csv read options: Analyzing
load csv testing/default csv read options
                        time:   [20.305 ms 20.536 ms 20.763 ms]
mean   [20.305 ms 20.763 ms] std. dev.      [1.0513 ms 1.2800 ms]
median [20.127 ms 21.042 ms] med. abs. dev. [1.0398 ms 1.6551 ms]

After:

Gnuplot not found, using plotters backend
Generated test dataset with 69642 rows
Benchmarking load csv testing/default csv read options
Benchmarking load csv testing/default csv read options: Warming up for 3.0000 s
Benchmarking load csv testing/default csv read options: Collecting 100 samples in estimated 21.606 s (1200 iterations)
Benchmarking load csv testing/default csv read options: Analyzing
load csv testing/default csv read options
                        time:   [21.583 ms 21.856 ms 22.166 ms]
                        change: [+1.9609% +3.6130% +5.3682%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
mean   [21.583 ms 22.166 ms] std. dev.      [988.76 µs 2.0538 ms]
median [21.438 ms 21.965 ms] med. abs. dev. [776.50 µs 1.3261 ms]

@dhegberg
Copy link
Contributor Author

@jayzhan211 @alamb

I've revised this to set the null import via config.

Shows a small regression when this config is used, but it should be acceptable since this is opt in.

@dhegberg dhegberg marked this pull request as ready for review December 17, 2024 12:29
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dhegberg -- this looks great to me. Thank you for your diligence on this. I will plan to merge the PR tomorrow unless there are additional comments or others would like time to review

@alamb alamb merged commit 01ffb64 into apache:main Dec 18, 2024
27 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 18, 2024

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSV can't parse null value for non-string type (i32, i64, float)
5 participants