Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional regexp functions #11946

Open
6 tasks
timsaucer opened this issue Aug 12, 2024 · 10 comments
Open
6 tasks

Add additional regexp functions #11946

timsaucer opened this issue Aug 12, 2024 · 10 comments
Labels
enhancement New feature or request

Comments

@timsaucer
Copy link
Contributor

Is your feature request related to a problem or challenge?

I would like to see the following regexp functions implemented. These exist in some, but not all, versions of PostgreSQL.

Describe the solution you'd like

Implement these functions.

Describe alternatives you've considered

These operations can be performed using the existing functions, so I am currently unblocked for my immediate use case but having these functions built in would be convenient.

Additional context

We currently have the following regexp functions implemented. The source is in datafusion/functions/src/regex/mod.rs

regexp_like()
regexp_match()
regexp_replace()

@xinlifoobar
Copy link
Contributor

I could work on this. The only concern is whether we implement the regexp function in this project or in arrow-rs.

Hey @alamb, would you prefer implement function in arrow-rs directly or we put them in datafusion and port later?

@alamb
Copy link
Contributor

alamb commented Aug 17, 2024

I could work on this. The only concern is whether we implement the regexp function in this project or in arrow-rs.

Hey @alamb, would you prefer implement function in arrow-rs directly or we put them in datafusion and port later?

Thanks @xinlifoobar

I would personally recommend we start implementing them in datafusion as that will avoid the need to wait for coordinated releases of arrow-rs, and then port backupstream to arrow-rs as a follow on step.

@alamb
Copy link
Contributor

alamb commented Aug 17, 2024

@xinlifoobar I suspect there will be several other contribtuors interested in helping out and learning during the process. If we have a good example to follow the work would be straightforward to scale I think

One way to do this might be:

  1. You implement one of these functions in a PR, along with good docs, tests, etc
  2. Then we can file additional tickets for the other functions, linking to your first implementation

@nrc
Copy link
Contributor

nrc commented Aug 20, 2024

Related to this, substring in Postgres supports regex matching (see https://www.postgresql.org/docs/current/functions-matching.html), would it be reasonable for DataFusion to also support it?

The currently accepted argument types are:

                    Exact(vec![Utf8, Int64]),
                    Exact(vec![LargeUtf8, Int64]),
                    Exact(vec![Utf8, Int64, Int64]),
                    Exact(vec![LargeUtf8, Int64, Int64]),
                    Exact(vec![Utf8View, Int64]),
                    Exact(vec![Utf8View, Int64, Int64]),

Postgres's regex substring takes a string, a pattern, and an escape character, so I don't think there would be a conflict.

@Omega359
Copy link
Contributor

Related to this, substring in Postgres supports regex matching (see https://www.postgresql.org/docs/current/functions-matching.html), would it be reasonable for DataFusion to also support it?
Postgres's regex substring takes a string, a pattern, and an escape character, so I don't think there would be a conflict.

Spark's version of this is https://spark.apache.org/docs/latest/api/sql/#regexp_substr

@timsaucer
Copy link
Contributor Author

Based on prior conversations it sounds like the group is most interested in making sure we are supporting Postgresql so I think adding this is a very good idea. We can also have regexp_substr as an alias.

@hengfeiyang
Copy link
Contributor

hengfeiyang commented Oct 18, 2024

It is a good idea to support these functions, especially for regexp_matches, Thank you.

@Omega359
Copy link
Contributor

It is a good idea to support these functions, especially for regexp_matches, Thank you.

Adding regexp_matches would be trivial - it's essentially the current regexp_match but the 'g' (global) flag is always present (whereas with the regexp_match it's disallowed)

@Omega359
Copy link
Contributor

Related to this, substring in Postgres supports regex matching (see https://www.postgresql.org/docs/current/functions-matching.html), would it be reasonable for DataFusion to also support it?

The currently accepted argument types are:

                    Exact(vec![Utf8, Int64]),
                    Exact(vec![LargeUtf8, Int64]),
                    Exact(vec![Utf8, Int64, Int64]),
                    Exact(vec![LargeUtf8, Int64, Int64]),
                    Exact(vec![Utf8View, Int64]),
                    Exact(vec![Utf8View, Int64, Int64]),

Postgres's regex substring takes a string, a pattern, and an escape character, so I don't think there would be a conflict.

I took a look at the syntax for that pg function, and frankly it's awful. Personally I think that is a function best ignored.

@timsaucer
Copy link
Contributor Author

I see your point for the postgres - since it essentially does either substring from this character index to that OR regex matching, I think it would probably add more confusion than value to support in exactly the way they do. However I'm far from a SQL expert.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants