-
-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add tolerance: float
argument in Check class
#183
Comments
leaving this without elaboration on implementation details for now, still not sure if it's really needed |
@sfczekalski let's discuss implementation details a little further here: The big question is how to implement this option at multiple levels of the API
# built-in checks
pa.Check.lt(0, tolerance=0.1)
# custom checks
pa.Check(lambda x: x < 0, tolerance=0.1)
# this should raise a SchemaDefinitionError
pa.Check(lambda x: x.mean() < 0, tolerance=0.1)
|
|
@cosmicBboy now I started wondering, if you mean that the |
So for #390 the dataframe-level unique option would behave like this: # case 1: combination of these three columns must be unique
pa.DataFrameSchema(unique=["col1", "col2", "col3"])
# case 2: different sets of column combinations must be unique
pa.DataFrameSchema(
unique=[
["col1", "col2", "col3"],
["col4", "col5"]
]
) And to support at tolerance float there are several options:
pa.DataFrameSchema(
unique={
("col1", "col2", "col3"): 0.0,
("col4", "col5"): 0.1,
}
)
pa.DataFrameSchema(
unique=[
(["col1", "col2", "col3"], 0.0),
(["col4", "col5"], 0.1),
]
)
pa.DataFrameSchema(
unique=[
{"columns": [...], "tolerance": ...},
{"columns": [...], "tolerance": ...}
]
) My preference would be for (1), though I can see the merits of (3) as well.
I'm not sure what you mean by allowing tolerance fraction to be duplicated, can you elaborate on that? |
I'd also opt for the first option as it's the most concise.
Yes, let me give an example: df = pd.DataFrame.from_dict(
{
"col1": [1, 2, 3],
"col2": [4, 5, 6],
"col3": ["a", "b", "c"],
"col4": ["d", "e", "f"],
}
)
schema = pa.DataFrameSchema(
unique={
("col1", "col2"): 1.0,
( "col3", "col4", "col5"): 1.0,
}
) The DataFrame has unique values in columns |
you mean yes, this is exactly what I'm thinking. I think we an scope this particular issue to cover:
Once #390 is done we can make another issue to tackle the PR would be very welcome! 🙏 |
tolerance
should be a float between 0 and 1 and allow users to express the fact that a check can be true some percentage of the time.The text was updated successfully, but these errors were encountered: