Can I keep the category dtype in a pa.Category? #740

a-berg · 2022-01-18T13:02:46Z

a-berg
Jan 18, 2022

Hello, I am trying to build a schema for a dataframe that has one string categorical column, 2 int64 binary columns and 4 int64 categorical columns.
Essentially, my use case is to use information from the schema in order to build a data preprocessor, e.g.: string column will have a vocabulary lookup, binary columns won't get preprocessed, integer categorical columns will be One Hot encoded, and the rest (numeric integer and float variables) will just get normalized. E.g.:

>>> pa_schema.columns['sex']
<Schema Column(name=sex, type=category)>

So basically I lost information on the dtype of that column, which is "int64". In pandas I can get not only the dtype but also how many values this category has (making it binary):

>>> print(len(df_cat['sex'].cat.categories.values), df_cat['sex'].cat.categories.dtype)
2 int64

I know the primary objective of pandera is checking data frames, but schemas are also kind of a metadata representation of the dataset that can be used for other purposes, and I found this small detail makes it difficult to use it like that.

Any suggestion?

jeffzi · 2022-01-18T23:23:52Z

jeffzi
Jan 18, 2022
Collaborator

Hi @a-berg. Interesting use case !

The datatype layer was refactored in pandera 0.7.0. After the refactor, the native pandas dtype is stored in the .type attribute of the pandera.DataType object.

Here is how you can get the the dtype of the categories:

import pandera as pa
import pandera.typing as P
import pandas as pd


class SexSchema(pa.SchemaModel):
    sex: P.Series[pd.CategoricalDtype] = pa.Field(dtype_kwargs={"categories": [0, 1]})


sex = SexSchema.to_schema().columns["sex"]
print(sex)
#> <Schema Column(name=sex, type=DataType(category))>

# Retrieve the pandera DataType of the column
sex_column = SexSchema.to_schema().columns["sex"]
pandera_categorical = sex_column.dtype
print(repr(pandera_categorical))
#> DataType(category)
print(pandera_categorical.categories)
#> (0, 1)
print(pandera_categorical.type)
#> category
print(pandera_categorical.type.categories) # should append .dtype to get int64
#> Int64Index([0, 1], dtype='int64')

I suspect you use a version lower than 0.7.0, given the format of your output.

Let me know if you have other questions. The datatype layer is relatively new and we can always improve it if we discover new usages.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I keep the category dtype in a pa.Category? #740

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Can I keep the category dtype in a pa.Category? #740

a-berg Jan 18, 2022

Replies: 1 comment

jeffzi Jan 18, 2022 Collaborator

a-berg
Jan 18, 2022

jeffzi
Jan 18, 2022
Collaborator