Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional regexp function regexp_count() #12080

Closed
wants to merge 15 commits into from

Conversation

xinlifoobar
Copy link
Contributor

@xinlifoobar xinlifoobar commented Aug 20, 2024

Which issue does this PR close?

Closes #12079 and part of #11946

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) functions labels Aug 20, 2024
@github-actions github-actions bot added development-process Related to development process of DataFusion core Core DataFusion crate substrait proto Related to proto crate labels Aug 21, 2024

# NULL tests

query I
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly different from PostgreSQL. Datafusion treat NULL literary as StringArray of 1 element NULL instead of null array or empty array

@xinlifoobar xinlifoobar force-pushed the dev/xinli/regexp_count branch from 45afcf0 to 08343dd Compare August 22, 2024 06:53
@xinlifoobar xinlifoobar marked this pull request as draft August 23, 2024 09:33
@github-actions github-actions bot removed core Core DataFusion crate substrait proto Related to proto crate labels Aug 29, 2024
@xinlifoobar xinlifoobar marked this pull request as ready for review August 29, 2024 04:15
@github-actions github-actions bot removed the development-process Related to development process of DataFusion label Aug 29, 2024
@xinlifoobar
Copy link
Contributor Author

Finalizing the details takes a lot more effort than expected. Would you like to take a look? Thanks! @alamb @jayzhan211

Benchmark

regexp_count_1000 string
                        time:   [6.6158 ms 6.6634 ms 6.7108 ms]

regexp_count_1000 utf8view
                        time:   [6.7117 ms 6.7647 ms 6.8183 ms]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

regexp_like_1000        time:   [3.7056 ms 3.7170 ms 3.7289 ms]
                        change: [-5.9843% -5.0861% -4.1952%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

regexp_match_1000       time:   [4.4132 ms 4.4287 ms 4.4466 ms]
                        change: [-9.8318% -8.4331% -7.0280%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

regexp_replace_1000     time:   [3.4351 ms 3.4697 ms 3.5142 ms]
                        change: [-13.734% -11.848% -10.019%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

@@ -52,7 +52,7 @@ encoding_expressions = ["base64", "hex"]
# enable math functions
math_expressions = []
# enable regular expressions
regex_expressions = ["regex"]
regex_expressions = ["regex", "string_expressions"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to do this for StringArrayType. Maybe we should consider relocating it to a common package?

Copy link
Contributor

@Omega359 Omega359 Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be very open to moving the StringArrayType, StringArrayBuilder, StringViewArrayBuilder, etc to a file in functions such as string_array.rs as they are used in multiple modules (unicode, regex, likely datetime in the future, etc) and are quite useful in general for any external UDF that might need them.

)));
}

let mut regex_cache = HashMap::new();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a global regex cache?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be quite hesitant about adding any global cache for a UDF.

signature: Signature::one_of(
vec![
Uniform(2, vec![Utf8, LargeUtf8, Utf8View]),
Exact(vec![Utf8, Utf8, Int64]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to invert the order of these:

Planner attempts coercion to the target type starting with the most preferred candidate.
For example, given input `(Utf8View, Utf8)`, it first tries coercing to `(Utf8View, Utf8View)`.
If that fails, it proceeds to `(Utf8, Utf8)`.

- [regexp_like](#regexp_like)
- [regexp_match](#regexp_match)
- [regexp_replace](#regexp_replace)

[pcre-like]: https://en.wikibooks.org/wiki/Regular_Expressions/Perl-Compatible_Regular_Expressions
[syntax]: https://docs.rs/regex/latest/regex/#syntax

### `regexp_count`

Returns the number of matchs that a [regular expression] has in a string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matches*

@Omega359
Copy link
Contributor

Thanks for the very nice PR @xinlifoobar ! I'll try and take the time to do a full review of this PR next week if no one beats me to it.

if values.len() != regex_array.len() {
return Err(ArrowError::ComputeError(format!(
"regex_array must be the same length as values array; got {} and {}",
values.len(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest aligning the parameters with the text by having regex_array.len() first

Exact(vec![LargeUtf8, LargeUtf8, Int64, LargeUtf8]),
Exact(vec![Utf8View, Utf8View, Int64]),
Exact(vec![Utf8View, Utf8View, Int64, Utf8View]),
],
Copy link
Contributor

@Omega359 Omega359 Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note on this signature. The postgresql version of this has start as optional - I think it would be nice to allow that as well in this UDF. The UDF could just use a scalar of 1 if it's not present to make the existing count functions work as is.
Sorry, I missed the Uniform portion of the signature - I see the tests cover this as well :)

));
}

let find_slice = value.chars().skip(start as usize - 1).collect::<String>();
Copy link
Contributor

@Omega359 Omega359 Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be worth it to have a fast path if start is 1? Untested suggestion:

Suggested change
let find_slice = value.chars().skip(start as usize - 1).collect::<String>();
let count = if start == 1 { pattern.find_iter(value).count() } else {
let find_slice = value.chars().skip(start as usize - 1).collect::<String>()
pattern.find_iter(find_slice.as_str()).count()
} ;

@Omega359
Copy link
Contributor

Omega359 commented Oct 2, 2024

I think this is a good PR and worthy of merging into main. My only thoughts are some small things noted in my comments and the fact that counts seems to be twice as slow as like and replace. I was able to reproduce the benchmark but unfortunately I cannot run flamegraph on my machine (perf and WSL is a black art) so I wasn't really able to narrow down the cause.

@Omega359
Copy link
Contributor

@alamb - since @xinlifoobar seems to be dormant my thoughts on this PR is to merge it in and file a couple of tickets to improve it, primarily the performance discrepancy.

@alamb
Copy link
Contributor

alamb commented Oct 11, 2024

Makes sense to me -- thank you @Omega359 -- would you be willing to make a new PR (merged/ rebased to fix the conflicts on main)? I can help with the review / merge / file follow on tickets.

@Omega359
Copy link
Contributor

Sure thing. I'll hopefully work on it this weekend after #12149

@alamb
Copy link
Contributor

alamb commented Oct 18, 2024

Merged in #12970

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation functions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add additional regexp function regexp_count()
3 participants