Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegEx expressions are un-evaluatable due to missing (pc)re imports #421

Closed
woodly0 opened this issue May 27, 2024 · 8 comments
Closed

RegEx expressions are un-evaluatable due to missing (pc)re imports #421

woodly0 opened this issue May 27, 2024 · 8 comments

Comments

@woodly0
Copy link

woodly0 commented May 27, 2024

Hello Villu,

I am coming back with a topic that we have already discussed here.
Was wondering if meanwhile there is a possibility to match one string with another dynamically, e.g.

transformer = ExpressionTransformer("re.search(X['name'], X['email'])")

Maybe you remember that I wanted to create a binary feature reflecting whether or not the name could be found within the email address.

Thanks in advance!

@vruusmann
Copy link
Member

Was wondering if meanwhile there is a possibility to match one string with another dynamically

Right now, what happens? Does your code raise some sort of exception?

Looking into (J)PMML source code, then there doesn't seem to be any restrictions to RegEx pattern specification - it can be a string literal or a string variable (feature).

@vruusmann
Copy link
Member

Right now, what happens? Does your code raise some sort of exception?

The conversion to PMML definitely works:

from pandas import DataFrame
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.preprocessing import ExpressionTransformer

X = DataFrame([["Hello World", "Hello"], ["One Two Three", "Zero"], ["alpha omega", "beta"]], columns = ["sentence", "word"])
print(X)

#transformer = ExpressionTransformer("X[0] + X[1]")
transformer = ExpressionTransformer("re.search(X[1], X[0])")
transformer.n_features_in_ = 2

sklearn2pmml(transformer, "Expression.pmml")

However, there seems to be an imports issue on the Python side:

import re

Xt = transformer.transform(X)
print(Xt)

The above raises an import error regarding the "re" module:

Traceback (most recent call last):
  File "main.py", line 3, in <module>
    Xt = transformer.transform(X)
  File "/trunk/.local/lib/python3.9/site-packages/sklearn/utils/_set_output.py", line 295, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/trunk/.local/lib/python3.9/site-packages/sklearn2pmml/preprocessing/__init__.py", line 286, in transform
    Xt = self._eval(X)
  File "/trunk/.local/lib/python3.9/site-packages/sklearn2pmml/preprocessing/__init__.py", line 272, in _eval
    Xt = eval_rows(X, _eval_row, to_numpy = (not is_1d(X)), shape = (-1, 1))
  File "/trunk/.local/lib/python3.9/site-packages/sklearn2pmml/util/__init__.py", line 205, in eval_rows
    Xt = X.apply(func, axis = 1)
  File "/trunk/.local/lib/python3.9/site-packages/pandas/core/frame.py", line 9565, in apply
    return op.apply().__finalize__(self, method="apply")
  File "/trunk/.local/lib/python3.9/site-packages/pandas/core/apply.py", line 746, in apply
    return self.apply_standard()
  File "/trunk/.local/lib/python3.9/site-packages/pandas/core/apply.py", line 873, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/trunk/.local/lib/python3.9/site-packages/pandas/core/apply.py", line 889, in apply_series_generator
    results[i] = self.f(v)
  File "/trunk/.local/lib/python3.9/site-packages/sklearn2pmml/preprocessing/__init__.py", line 257, in _eval_row
    xt = expr_func(x)
  File "/trunk/.local/lib/python3.9/site-packages/sklearn2pmml/util/__init__.py", line 188, in evaluate
    return eval(expr, env)
  File "<string>", line 1, in <module>
NameError: name 're' is not defined

@vruusmann vruusmann changed the title RegEx Expression based on multiple input columns RegEx expressions are un-evaluatable due to missing (pc)re imports May 27, 2024
@vruusmann
Copy link
Member

vruusmann commented May 27, 2024

Right now, Python expressions are evaluated in an environment that imports math, numpy and pandas modules:
https://github.com/jpmml/sklearn2pmml/blob/0.108.0/sklearn2pmml/util/__init__.py#L176

To make the current example work, it would need to include the re module (or even better, the pcre module).

Unfortunately, there is no way to override/customize the module list from the ExpressionTransformer object itself. Could be implemented as a new ExpressionTransformer.modules attribute.

@woodly0
Copy link
Author

woodly0 commented May 27, 2024

Maybe RegEx is overkill for what I am trying to do. It would be enough to have some sort of True if "bc" in "abcd" else False logic. However, I don't know how that works behind the scene..

@vruusmann
Copy link
Member

Maybe RegEx is overkill for what I am trying to do.

RegEx is definitely expensive, but it is the only tool for custom string manipulation in PMML.

It would be enough to have some sort of True if "bc" in "abcd" else False logic.

There is no built-in method for "find index of substring in string" in PMML (analogous to Python's find() method). It's RegEx or bust!

Anyway, if string (pre-)processing functionality is available in the form of standalone Python and Java libraries, then it's possible to use the UDF approach - the JPMML-Evaluator library can invoke a 3rd party Java library function.

But the UDF approach loses portability, and is better to be avoided if RegExes will do.

@woodly0
Copy link
Author

woodly0 commented May 27, 2024

OK, I see.

Unfortunately, there is no way to override/customize the module list from the ExpressionTransformer object itself. Could be implemented as a new ExpressionTransformer.modules attribute.

Sounds good but how would I achieve this?

@vruusmann
Copy link
Member

Could be implemented as a new ExpressionTransformer.modules attribute.

Sounds good but how would I achieve this?

See https://github.com/jpmml/sklearn2pmml/blob/0.108.0/sklearn2pmml/preprocessing/__init__.py#L250

Replace this:

expr_func = to_expr_func(self.expr)

With this:

expr_func = to_expr_func(self.expr, modules = ["math", "re"])

The long-term solution is to make the list of modules easily customizable via an ExpressionTransformer constructor parameter.

@vruusmann
Copy link
Member

For simple re.search() functionality you can use the sklearn2pmml.preprocessing.MatchesTransformer class: https://github.com/jpmml/sklearn2pmml/blob/0.108.0/sklearn2pmml/preprocessing/__init__.py#L583-L598

It doesn't suffer from missing imports, because it has everything hard-coded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants