RegEx expressions are un-evaluatable due to missing `(pc)re` imports #421

woodly0 · 2024-05-27T08:37:52Z

Hello Villu,

I am coming back with a topic that we have already discussed here.
Was wondering if meanwhile there is a possibility to match one string with another dynamically, e.g.

transformer = ExpressionTransformer("re.search(X['name'], X['email'])")

Maybe you remember that I wanted to create a binary feature reflecting whether or not the name could be found within the email address.

Thanks in advance!

vruusmann · 2024-05-27T09:45:49Z

Was wondering if meanwhile there is a possibility to match one string with another dynamically

Right now, what happens? Does your code raise some sort of exception?

Looking into (J)PMML source code, then there doesn't seem to be any restrictions to RegEx pattern specification - it can be a string literal or a string variable (feature).

vruusmann · 2024-05-27T09:58:32Z

Right now, what happens? Does your code raise some sort of exception?

The conversion to PMML definitely works:

from pandas import DataFrame
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.preprocessing import ExpressionTransformer

X = DataFrame([["Hello World", "Hello"], ["One Two Three", "Zero"], ["alpha omega", "beta"]], columns = ["sentence", "word"])
print(X)

#transformer = ExpressionTransformer("X[0] + X[1]")
transformer = ExpressionTransformer("re.search(X[1], X[0])")
transformer.n_features_in_ = 2

sklearn2pmml(transformer, "Expression.pmml")

However, there seems to be an imports issue on the Python side:

import re

Xt = transformer.transform(X)
print(Xt)

The above raises an import error regarding the "re" module:

Traceback (most recent call last):
  File "main.py", line 3, in <module>
    Xt = transformer.transform(X)
  File "/trunk/.local/lib/python3.9/site-packages/sklearn/utils/_set_output.py", line 295, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/trunk/.local/lib/python3.9/site-packages/sklearn2pmml/preprocessing/__init__.py", line 286, in transform
    Xt = self._eval(X)
  File "/trunk/.local/lib/python3.9/site-packages/sklearn2pmml/preprocessing/__init__.py", line 272, in _eval
    Xt = eval_rows(X, _eval_row, to_numpy = (not is_1d(X)), shape = (-1, 1))
  File "/trunk/.local/lib/python3.9/site-packages/sklearn2pmml/util/__init__.py", line 205, in eval_rows
    Xt = X.apply(func, axis = 1)
  File "/trunk/.local/lib/python3.9/site-packages/pandas/core/frame.py", line 9565, in apply
    return op.apply().__finalize__(self, method="apply")
  File "/trunk/.local/lib/python3.9/site-packages/pandas/core/apply.py", line 746, in apply
    return self.apply_standard()
  File "/trunk/.local/lib/python3.9/site-packages/pandas/core/apply.py", line 873, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/trunk/.local/lib/python3.9/site-packages/pandas/core/apply.py", line 889, in apply_series_generator
    results[i] = self.f(v)
  File "/trunk/.local/lib/python3.9/site-packages/sklearn2pmml/preprocessing/__init__.py", line 257, in _eval_row
    xt = expr_func(x)
  File "/trunk/.local/lib/python3.9/site-packages/sklearn2pmml/util/__init__.py", line 188, in evaluate
    return eval(expr, env)
  File "<string>", line 1, in <module>
NameError: name 're' is not defined

vruusmann · 2024-05-27T10:10:18Z

Right now, Python expressions are evaluated in an environment that imports math, numpy and pandas modules:
https://github.com/jpmml/sklearn2pmml/blob/0.108.0/sklearn2pmml/util/__init__.py#L176

To make the current example work, it would need to include the re module (or even better, the pcre module).

Unfortunately, there is no way to override/customize the module list from the ExpressionTransformer object itself. Could be implemented as a new ExpressionTransformer.modules attribute.

woodly0 · 2024-05-27T11:15:16Z

Maybe RegEx is overkill for what I am trying to do. It would be enough to have some sort of True if "bc" in "abcd" else False logic. However, I don't know how that works behind the scene..

vruusmann · 2024-05-27T11:33:09Z

Maybe RegEx is overkill for what I am trying to do.

RegEx is definitely expensive, but it is the only tool for custom string manipulation in PMML.

It would be enough to have some sort of True if "bc" in "abcd" else False logic.

There is no built-in method for "find index of substring in string" in PMML (analogous to Python's find() method). It's RegEx or bust!

Anyway, if string (pre-)processing functionality is available in the form of standalone Python and Java libraries, then it's possible to use the UDF approach - the JPMML-Evaluator library can invoke a 3rd party Java library function.

But the UDF approach loses portability, and is better to be avoided if RegExes will do.

woodly0 · 2024-05-27T11:48:42Z

OK, I see.

Unfortunately, there is no way to override/customize the module list from the ExpressionTransformer object itself. Could be implemented as a new ExpressionTransformer.modules attribute.

Sounds good but how would I achieve this?

vruusmann · 2024-05-27T11:57:31Z

Could be implemented as a new ExpressionTransformer.modules attribute.

Sounds good but how would I achieve this?

See https://github.com/jpmml/sklearn2pmml/blob/0.108.0/sklearn2pmml/preprocessing/__init__.py#L250

Replace this:

expr_func = to_expr_func(self.expr)

With this:

expr_func = to_expr_func(self.expr, modules = ["math", "re"])

The long-term solution is to make the list of modules easily customizable via an ExpressionTransformer constructor parameter.

vruusmann · 2024-05-27T12:17:05Z

For simple re.search() functionality you can use the sklearn2pmml.preprocessing.MatchesTransformer class: https://github.com/jpmml/sklearn2pmml/blob/0.108.0/sklearn2pmml/preprocessing/__init__.py#L583-L598

It doesn't suffer from missing imports, because it has everything hard-coded.

vruusmann changed the title ~~RegEx Expression based on multiple input columns~~ RegEx expressions are un-evaluatable due to missing (pc)re imports May 27, 2024

vruusmann closed this as completed in 6a7997c Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RegEx expressions are un-evaluatable due to missing `(pc)re` imports #421

RegEx expressions are un-evaluatable due to missing `(pc)re` imports #421

woodly0 commented May 27, 2024

vruusmann commented May 27, 2024

vruusmann commented May 27, 2024

vruusmann commented May 27, 2024 •

edited

Loading

woodly0 commented May 27, 2024

vruusmann commented May 27, 2024

woodly0 commented May 27, 2024

vruusmann commented May 27, 2024

vruusmann commented May 27, 2024

RegEx expressions are un-evaluatable due to missing (pc)re imports #421

RegEx expressions are un-evaluatable due to missing (pc)re imports #421

Comments

woodly0 commented May 27, 2024

vruusmann commented May 27, 2024

vruusmann commented May 27, 2024

vruusmann commented May 27, 2024 • edited Loading

woodly0 commented May 27, 2024

vruusmann commented May 27, 2024

woodly0 commented May 27, 2024

vruusmann commented May 27, 2024

vruusmann commented May 27, 2024

RegEx expressions are un-evaluatable due to missing `(pc)re` imports #421

RegEx expressions are un-evaluatable due to missing `(pc)re` imports #421

vruusmann commented May 27, 2024 •

edited

Loading