PMML pipeline not working as expected after version upgrade #434

mzeres · 2024-09-19T08:33:41Z

Hello,

After upgrading to the latest version (0.110.0) my pipeline isn't working as expected anymore. The version that I was using previously, which was working fine, is version 0.95.1.

The situation is as follows. I am creating a pipeline to prepare the data, to later on train a classification model. The relevant part here is:

  dataprep = ColumnTransformer(
      [....
          (
              "cleaned_name",
              get_name_transformer(),
              "name",
          ),
      ....],
  )

The following worked fine under version 0.95.1:

from sklearn.pipeline import make_pipeline

def get_name_transformer() -> Pipeline:
    return make_pipeline(
        StringNormalizer("lower"),
        ReplaceTransformer(...),
        ReplaceTransformer(...),
        DataFrameConstructor(["cleaned_name"], str),
        make_column_transformer(
            (
                CountVectorizer(lowercase=False),
                "cleaned_name",
            ),

Under version 0.110.0, this doesn't work anymore. I get the following error message:

TypeError                                 Traceback (most recent call last)
....
   [1034](.../lib/python3.10/site-packages/numpy/core/defchararray.py:1034) """
   [1035](.../lib/python3.10/site-packages/numpy/core/defchararray.py:1035) a_arr = numpy.asarray(a)
-> [1036](.../lib/python3.10/site-packages/numpy/core/defchararray.py:1036) return _vec_string(a_arr, a_arr.dtype, 'lower')

TypeError: string operation on non-string array

It seems like it couldn't properly handle the dataframe column input in the StringNormalizer anymore, since the datatype keeps sticking to Object as the string column contains None values. A rather ugly solution that seems to be working for me is as follows, where I'm making use of DataFrameMapper from sklearn-pandas:

def get_name_transformer() -> Pipeline:
    return make_pipeline(
        DataFrameMapper(
            [
                (
                    "name",
                    [
                        StringNormalizer("lower"),
                        ReplaceTransformer(...),
                        ReplaceTransformer(...),
                    ],
                ),
            ],
        ),
        SeriesConstructor("cleaned_name", str),
        CountVectorizer(lowercase=False),
    )

Although this seems to be working, I would rather have cleaner code, without the dependency on the DataFrameMapper. Do you have any suggestions on how to improve this? Thanks!

The text was updated successfully, but these errors were encountered:

vruusmann · 2024-09-20T04:35:55Z

After upgrading to the latest version (0.110.0) my pipeline isn't working as expected anymore. The version that I was using previously, which was working fine, is version 0.95.1.

Your issue's description matches changes that happened in the 0.103.2 version (pay attention to "Breaking changes"):
https://github.com/jpmml/sklearn2pmml/blob/master/NEWS.md#01032

So, for starters, you can upgrade to the 0.103.1 version.

It seems like it couldn't properly handle the dataframe column input in the StringNormalizer anymore, since the datatype keeps sticking to Object as the string column contains None values.

I would expect the StringNormalizer transformer to support both numpy.ndarray and pandas.Series input, especially in 0.103.2 and newer versions. If it doesn't, then it's a bug that will be fixed.

def get_name_transformer() -> Pipeline

In your final pipeline, why do you use DataFrameMapper and SeriesConstructor steps at all? They shouldn't be needed ,as all other steps (ie. StringNormalizer, ReplaceTransformer and CountVectorizer transformers) should all support numpy.ndarray input.

The goial (of fixing this issue) should be to make the above statement hold true. That is' there is no need for an explicit Numpy to Pandas data container conversion operation.

vruusmann · 2024-09-20T04:39:28Z

In your final pipeline, why do you use DataFrameMapper and SeriesConstructor steps at all?

In other words, what's the Python error if you omit these meta-transformers? Is there something wrong in the interaction between (the last-) ReplaceTransformer and CountVectorizer steps?

If it's anything replated to the size/shape of Numpy arrays, then you can addess those using the sklearn2pmml.util.Reshaper transformer.

mzeres · 2024-09-22T12:20:06Z

Hi vruusmann, thanks for your quick responses!

I would expect the StringNormalizer transformer to support both numpy.ndarray and pandas.Series input, especially in 0.103.2 and newer versions. If it doesn't, then it's a bug that will be fixed.

On my end, it seems like indeed the changes in 0.103.2 caused problems for my pipeline. I see that the transform function of the StringNormalizer before version 0.103.2 contained the following line: X = to_numpy(X). When adding this transformation to numpy explicitly in my pipeline, the DataFrameMapper isn't necessary anymore and the following pipeline works:

def _transform_to_numpy(X):
    return X.to_numpy()

def get_name_transformer():

    return make_pipeline(
        FunctionTransformer(_transform_to_numpy),
        StringNormalizer("lower"),
        ReplaceTransformer(...),
        ReplaceTransformer(...),
        SeriesConstructor("cleaned_name", str),
        CountVectorizer(lowercase=False),
        )

However, the problem here is that inclusion of this FunctionTransformer over _transform_to_numpy doesn't allow for PMML conversion anymore. In case I don't include the _transform_to_numpy in my pipeline, I get the error as mentioned before:

TypeError                                 Traceback (most recent call last)
....
   [1034](.../lib/python3.10/site-packages/numpy/core/defchararray.py:1034) """
   [1035](.../lib/python3.10/site-packages/numpy/core/defchararray.py:1035) a_arr = numpy.asarray(a)
-> [1036](.../lib/python3.10/site-packages/numpy/core/defchararray.py:1036) return _vec_string(a_arr, a_arr.dtype, 'lower')

TypeError: string operation on non-string array

There seems to be going something wrong in the conversion from Series to numpy within the transformer.

In your final pipeline, why do you use DataFrameMapper and SeriesConstructor steps at all? They shouldn't be needed ,as all other steps (ie. StringNormalizer, ReplaceTransformer and CountVectorizer transformers) should all support numpy.ndarray input.

The DataFrameMapper was found as an (ugly) fix to at least make my pipeline work again, but no clear thought behind it, this is also why I would rather have it removed from my pipeline.

Ah nice find! You are indeed correct about the SeriesConstructor, it is not necessary, and removing it from the pipeline doesn't affect it.

vruusmann · 2024-09-22T20:51:09Z

However, the problem here is that inclusion of this FunctionTransformer over _transform_to_numpy doesn't allow for PMML conversion anymore

The simplest way to perform "from Pandas to Numpy" data container conversion is using sklearn.compose.ColumnTransformer. Simply create a passthrough transformer, and set its output to default:

numpyfier = ColumnTransformer([], remainder = "passthrough")
# THIS!
numpyfier.set_output(transform = "default")

In case I don't include the _transform_to_numpy in my pipeline, I get the error as mentioned before:

Looks like StringNormalizer doesn't support pandas.Series input?

It calls Numpy string utility functions, without first verifying that the X argument is a Numpy array:
https://github.com/jpmml/sklearn2pmml/blob/0.110.0/sklearn2pmml/preprocessing/__init__.py#L631-L645

mzeres · 2024-09-23T11:05:00Z

The simplest way to perform "from Pandas to Numpy" data container conversion is using sklearn.compose.ColumnTransformer. Simply create a passthrough transformer, and set its output to default:

Ah alright, that makes sense. I tried to make it work but I think I'm doing something wrong. I tried the following:

dataprep = ColumnTransformer(
    [....
        (
            "cleaned_name",
            get_name_transformer(),
            "name",
        ),
    ....],
)

where

def get_name_transformer():
    numpyfier = ColumnTransformer([], remainder = "passthrough")
    numpyfier.set_output(transform = "default"),

    return make_pipeline(
        numpyfier,
        StringNormalizer("lower"),
        ReplaceTransformer(...),
        ReplaceTransformer(...),
        CountVectorizer(lowercase=False),
        )

In that case I get the following error message while calling dataprep.fit_transform(X): AttributeError: 'ColumnTransformer' object has no attribute 'n_features_in_'.

Am I using your suggestion in the wrong way?

mzeres · 2024-09-23T15:49:11Z

It seems like I have found a solution by adding brackets around the input column. So, i replace:

dataprep = ColumnTransformer(
    [....
        (
            "cleaned_name",
            get_name_transformer(),
            "name",
        ),
    ....],
)

By:

dataprep = ColumnTransformer(
    [....
        (
            "cleaned_name",
            get_name_transformer(),
            ["name"],
        ),
    ....],
)

vruusmann closed this as completed in 4a0feb8 Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PMML pipeline not working as expected after version upgrade #434

PMML pipeline not working as expected after version upgrade #434

mzeres commented Sep 19, 2024 •

edited by vruusmann

Loading

vruusmann commented Sep 20, 2024

vruusmann commented Sep 20, 2024

mzeres commented Sep 22, 2024

vruusmann commented Sep 22, 2024 •

edited

Loading

mzeres commented Sep 23, 2024

mzeres commented Sep 23, 2024

PMML pipeline not working as expected after version upgrade #434

PMML pipeline not working as expected after version upgrade #434

Comments

mzeres commented Sep 19, 2024 • edited by vruusmann Loading

vruusmann commented Sep 20, 2024

vruusmann commented Sep 20, 2024

mzeres commented Sep 22, 2024

vruusmann commented Sep 22, 2024 • edited Loading

mzeres commented Sep 23, 2024

mzeres commented Sep 23, 2024

mzeres commented Sep 19, 2024 •

edited by vruusmann

Loading

vruusmann commented Sep 22, 2024 •

edited

Loading