Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PMML pipeline not working as expected after version upgrade #434

Closed
mzeres opened this issue Sep 19, 2024 · 6 comments
Closed

PMML pipeline not working as expected after version upgrade #434

mzeres opened this issue Sep 19, 2024 · 6 comments

Comments

@mzeres
Copy link

mzeres commented Sep 19, 2024

Hello,

After upgrading to the latest version (0.110.0) my pipeline isn't working as expected anymore. The version that I was using previously, which was working fine, is version 0.95.1.

The situation is as follows. I am creating a pipeline to prepare the data, to later on train a classification model. The relevant part here is:

  dataprep = ColumnTransformer(
      [....
          (
              "cleaned_name",
              get_name_transformer(),
              "name",
          ),
      ....],
  )

The following worked fine under version 0.95.1:

from sklearn.pipeline import make_pipeline

def get_name_transformer() -> Pipeline:
    return make_pipeline(
        StringNormalizer("lower"),
        ReplaceTransformer(...),
        ReplaceTransformer(...),
        DataFrameConstructor(["cleaned_name"], str),
        make_column_transformer(
            (
                CountVectorizer(lowercase=False),
                "cleaned_name",
            ),

Under version 0.110.0, this doesn't work anymore. I get the following error message:

TypeError                                 Traceback (most recent call last)
....
   [1034](.../lib/python3.10/site-packages/numpy/core/defchararray.py:1034) """
   [1035](.../lib/python3.10/site-packages/numpy/core/defchararray.py:1035) a_arr = numpy.asarray(a)
-> [1036](.../lib/python3.10/site-packages/numpy/core/defchararray.py:1036) return _vec_string(a_arr, a_arr.dtype, 'lower')

TypeError: string operation on non-string array

It seems like it couldn't properly handle the dataframe column input in the StringNormalizer anymore, since the datatype keeps sticking to Object as the string column contains None values. A rather ugly solution that seems to be working for me is as follows, where I'm making use of DataFrameMapper from sklearn-pandas:

def get_name_transformer() -> Pipeline:
    return make_pipeline(
        DataFrameMapper(
            [
                (
                    "name",
                    [
                        StringNormalizer("lower"),
                        ReplaceTransformer(...),
                        ReplaceTransformer(...),
                    ],
                ),
            ],
        ),
        SeriesConstructor("cleaned_name", str),
        CountVectorizer(lowercase=False),
    )

Although this seems to be working, I would rather have cleaner code, without the dependency on the DataFrameMapper. Do you have any suggestions on how to improve this? Thanks!

@vruusmann
Copy link
Member

After upgrading to the latest version (0.110.0) my pipeline isn't working as expected anymore. The version that I was using previously, which was working fine, is version 0.95.1.

Your issue's description matches changes that happened in the 0.103.2 version (pay attention to "Breaking changes"):
https://github.com/jpmml/sklearn2pmml/blob/master/NEWS.md#01032

So, for starters, you can upgrade to the 0.103.1 version.

It seems like it couldn't properly handle the dataframe column input in the StringNormalizer anymore, since the datatype keeps sticking to Object as the string column contains None values.

I would expect the StringNormalizer transformer to support both numpy.ndarray and pandas.Series input, especially in 0.103.2 and newer versions. If it doesn't, then it's a bug that will be fixed.

def get_name_transformer() -> Pipeline

In your final pipeline, why do you use DataFrameMapper and SeriesConstructor steps at all? They shouldn't be needed ,as all other steps (ie. StringNormalizer, ReplaceTransformer and CountVectorizer transformers) should all support numpy.ndarray input.

The goial (of fixing this issue) should be to make the above statement hold true. That is' there is no need for an explicit Numpy to Pandas data container conversion operation.

@vruusmann
Copy link
Member

In your final pipeline, why do you use DataFrameMapper and SeriesConstructor steps at all?

In other words, what's the Python error if you omit these meta-transformers? Is there something wrong in the interaction between (the last-) ReplaceTransformer and CountVectorizer steps?

If it's anything replated to the size/shape of Numpy arrays, then you can addess those using the sklearn2pmml.util.Reshaper transformer.

@mzeres
Copy link
Author

mzeres commented Sep 22, 2024

Hi vruusmann, thanks for your quick responses!

I would expect the StringNormalizer transformer to support both numpy.ndarray and pandas.Series input, especially in 0.103.2 and newer versions. If it doesn't, then it's a bug that will be fixed.

On my end, it seems like indeed the changes in 0.103.2 caused problems for my pipeline. I see that the transform function of the StringNormalizer before version 0.103.2 contained the following line: X = to_numpy(X). When adding this transformation to numpy explicitly in my pipeline, the DataFrameMapper isn't necessary anymore and the following pipeline works:

def _transform_to_numpy(X):
    return X.to_numpy()

def get_name_transformer():

    return make_pipeline(
        FunctionTransformer(_transform_to_numpy),
        StringNormalizer("lower"),
        ReplaceTransformer(...),
        ReplaceTransformer(...),
        SeriesConstructor("cleaned_name", str),
        CountVectorizer(lowercase=False),
        )

However, the problem here is that inclusion of this FunctionTransformer over _transform_to_numpy doesn't allow for PMML conversion anymore. In case I don't include the _transform_to_numpy in my pipeline, I get the error as mentioned before:

TypeError                                 Traceback (most recent call last)
....
   [1034](.../lib/python3.10/site-packages/numpy/core/defchararray.py:1034) """
   [1035](.../lib/python3.10/site-packages/numpy/core/defchararray.py:1035) a_arr = numpy.asarray(a)
-> [1036](.../lib/python3.10/site-packages/numpy/core/defchararray.py:1036) return _vec_string(a_arr, a_arr.dtype, 'lower')

TypeError: string operation on non-string array

There seems to be going something wrong in the conversion from Series to numpy within the transformer.

In your final pipeline, why do you use DataFrameMapper and SeriesConstructor steps at all? They shouldn't be needed ,as all other steps (ie. StringNormalizer, ReplaceTransformer and CountVectorizer transformers) should all support numpy.ndarray input.

The DataFrameMapper was found as an (ugly) fix to at least make my pipeline work again, but no clear thought behind it, this is also why I would rather have it removed from my pipeline.

Ah nice find! You are indeed correct about the SeriesConstructor, it is not necessary, and removing it from the pipeline doesn't affect it.

@vruusmann
Copy link
Member

vruusmann commented Sep 22, 2024

However, the problem here is that inclusion of this FunctionTransformer over _transform_to_numpy doesn't allow for PMML conversion anymore

The simplest way to perform "from Pandas to Numpy" data container conversion is using sklearn.compose.ColumnTransformer. Simply create a passthrough transformer, and set its output to default:

numpyfier = ColumnTransformer([], remainder = "passthrough")
# THIS!
numpyfier.set_output(transform = "default")

In case I don't include the _transform_to_numpy in my pipeline, I get the error as mentioned before:

Looks like StringNormalizer doesn't support pandas.Series input?

It calls Numpy string utility functions, without first verifying that the X argument is a Numpy array:
https://github.com/jpmml/sklearn2pmml/blob/0.110.0/sklearn2pmml/preprocessing/__init__.py#L631-L645

@mzeres
Copy link
Author

mzeres commented Sep 23, 2024

The simplest way to perform "from Pandas to Numpy" data container conversion is using sklearn.compose.ColumnTransformer. Simply create a passthrough transformer, and set its output to default:

Ah alright, that makes sense. I tried to make it work but I think I'm doing something wrong. I tried the following:

dataprep = ColumnTransformer(
    [....
        (
            "cleaned_name",
            get_name_transformer(),
            "name",
        ),
    ....],
)

where

def get_name_transformer():
    numpyfier = ColumnTransformer([], remainder = "passthrough")
    numpyfier.set_output(transform = "default"),

    return make_pipeline(
        numpyfier,
        StringNormalizer("lower"),
        ReplaceTransformer(...),
        ReplaceTransformer(...),
        CountVectorizer(lowercase=False),
        )

In that case I get the following error message while calling dataprep.fit_transform(X): AttributeError: 'ColumnTransformer' object has no attribute 'n_features_in_'.

Am I using your suggestion in the wrong way?

@mzeres
Copy link
Author

mzeres commented Sep 23, 2024

It seems like I have found a solution by adding brackets around the input column. So, i replace:

dataprep = ColumnTransformer(
    [....
        (
            "cleaned_name",
            get_name_transformer(),
            "name",
        ),
    ....],
)

By:

dataprep = ColumnTransformer(
    [....
        (
            "cleaned_name",
            get_name_transformer(),
            ["name"],
        ),
    ....],
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants