-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support 'NULL' as Null in csv parser. #13228
Conversation
I think we can add such test under |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dhegberg and @jayzhan211
@@ -127,6 +127,7 @@ parquet = { workspace = true, optional = true, default-features = true } | |||
paste = "1.0.15" | |||
pin-project-lite = "^0.2.7" | |||
rand = { workspace = true } | |||
regex = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is not a new actual dependency as it is already a dependency of arrow-csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I remove it I have this error during cargo build
:
error[E0432]: unresolved import `regex`
--> datafusion/core/src/datasource/file_format/csv.rs:59:5
|
59 | use regex::Regex;
| ^^^^^ help: a similar path exists: `datafusion_functions::regex`
For more information about this error, try `rustc --explain E0432`.
error: could not compile `datafusion` (lib) due to 1 previous error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the "better" think to do would be for arrow-rs to re-export the RegEx structure it used as it appears in the public API
.with_delimiter(self.options.delimiter); | ||
.with_delimiter(self.options.delimiter) | ||
// Support literal NULL or empty string as null | ||
.with_null_regex(Regex::new(r"^NULL$|^$").unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a configuration option instead?
Most csv writers will use an empty string (,,
) for null which also makes sense as default.
Having the regex enabled might also impact performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A configuration option makes sense to me
I poked around for a csv querying benchmark and I can't find one. Perhaps we should write one 🤔 I don't know if this would have any effect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, to confirm:
I'll open up a separate PR to implement a benchmark.
When that gets merged I'll update this PR to use a configuration option and add the new config to the benchmark test to confirm if there is a regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good plan to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed -- thank you @dhegberg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might make sense to make it an configuration regardless if there is a performance degradation or not. This is because the new behavior will be a breaking change and potentially undesired behavior for some users.
Marking as draft so we don't accidentally merge this |
a6224b3
to
f615857
Compare
Co-authored-by: Andrew Lamb <[email protected]>
f615857
to
d67c27c
Compare
Updated to move Null parsing regex to a config. Benchmark comparison when using regex does show some regression: Before:
After:
|
I've revised this to set the null import via config. Shows a small regression when this config is used, but it should be acceptable since this is opt in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dhegberg -- this looks great to me. Thank you for your diligence on this. I will plan to merge the PR tomorrow unless there are additional comments or others would like time to review
🚀 |
Which issue does this PR close?
Closes #12904.
Rationale for this change
Bring consistency between handling of
NULL
in CSV and SQL parsing.What changes are included in this PR?
Treat a
NULL
field in CSV as a null entry. Maintain treatment of 0 length entry as a null.Are these changes tested?
Added unit tests that test
NULL
and 0 length entries in the data CSV. Associated change in the testing data adds null columns: apache/arrow-testing#103Are there any user-facing changes?
Yes, a
NULL
field in a .csv would no longer be treated as a string.