Replies: 1 comment
-
Hi @a-berg. Interesting use case ! The datatype layer was refactored in pandera 0.7.0. After the refactor, the native pandas dtype is stored in the Here is how you can get the the dtype of the categories: import pandera as pa
import pandera.typing as P
import pandas as pd
class SexSchema(pa.SchemaModel):
sex: P.Series[pd.CategoricalDtype] = pa.Field(dtype_kwargs={"categories": [0, 1]})
sex = SexSchema.to_schema().columns["sex"]
print(sex)
#> <Schema Column(name=sex, type=DataType(category))>
# Retrieve the pandera DataType of the column
sex_column = SexSchema.to_schema().columns["sex"]
pandera_categorical = sex_column.dtype
print(repr(pandera_categorical))
#> DataType(category)
print(pandera_categorical.categories)
#> (0, 1)
print(pandera_categorical.type)
#> category
print(pandera_categorical.type.categories) # should append .dtype to get int64
#> Int64Index([0, 1], dtype='int64') I suspect you use a version lower than 0.7.0, given the format of your output. Let me know if you have other questions. The datatype layer is relatively new and we can always improve it if we discover new usages. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello, I am trying to build a schema for a dataframe that has one string categorical column, 2
int64
binary columns and 4int64
categorical columns.Essentially, my use case is to use information from the schema in order to build a data preprocessor, e.g.: string column will have a vocabulary lookup, binary columns won't get preprocessed, integer categorical columns will be One Hot encoded, and the rest (numeric integer and float variables) will just get normalized. E.g.:
So basically I lost information on the dtype of that column, which is "int64". In pandas I can get not only the dtype but also how many values this category has (making it binary):
I know the primary objective of pandera is checking data frames, but schemas are also kind of a metadata representation of the dataset that can be used for other purposes, and I found this small detail makes it difficult to use it like that.
Any suggestion?
Beta Was this translation helpful? Give feedback.
All reactions