add `tolerance: float` argument in Check class #183

cosmicBboy · 2020-03-08T00:55:03Z

tolerance should be a float between 0 and 1 and allow users to express the fact that a check can be true some percentage of the time.

The text was updated successfully, but these errors were encountered:

cosmicBboy · 2020-08-05T13:43:33Z

leaving this without elaboration on implementation details for now, still not sure if it's really needed

cosmicBboy · 2021-04-08T13:11:48Z

@sfczekalski let's discuss implementation details a little further here:

The big question is how to implement this option at multiple levels of the API

Check: this makes sense for checks with an output that has the same shape as the origin series/dataframe being validated

# built-in checks
pa.Check.lt(0, tolerance=0.1)

# custom checks
pa.Check(lambda x: x < 0, tolerance=0.1)

# this should raise a SchemaDefinitionError
pa.Check(lambda x: x.mean() < 0, tolerance=0.1)

Column: how to apply different tolerances to the first-class checks?
- allow a float between 0 and 1 (inclusive) for the nullable and allow_duplicates (and eventually the unique option Add unique keyword option to all schemas and schema components #390 that will replace allow_duplicates). Semantics of the float input depends on the option. E.g.
  - nullable=0.0 means no allowed nulls, nullable=1.0 means all values allowed to be null
  - allow_duplicates=0.0 means no allowed duplicates, allow_duplicates=1.0 means all values allowed to be duplicates
DataFrame
- Add unique keyword option to all schemas and schema components #390 will plan on adding unique as an option to the DataFrameSchema, which allows for a list of columns or a list of list of columns (for multiple sets of columns that should be considered unique. Will have to figure out how to add the tolerance semantics to that argument.

sfczekalski · 2021-04-08T14:55:52Z

@cosmicBboy

Fair point, totally agree with the SchemaDefinitionError being raised in the third example.
It makes sense for me!
What about Dict[str, float] dictionaries with column names as keys and tolerance floats as values?

sfczekalski · 2021-04-09T19:31:34Z

@cosmicBboy now I started wondering, if you mean that the unique option for DataFrameSchema, in the list of columns option, checks for uniqueness of all values in those columns when combined? If so, them my point above of course doesn't make much sense. In such situation I'd maybe make sense to allow the tolerance fraction to be duplicated, no matter if the duplicates are across the columns, or only in one. What do you think?

cosmicBboy · 2021-04-14T13:20:26Z

So for #390 the dataframe-level unique option would behave like this:

# case 1: combination of these three columns must be unique
pa.DataFrameSchema(unique=["col1", "col2", "col3"])

# case 2: different sets of column combinations must be unique
pa.DataFrameSchema(
    unique=[
        ["col1", "col2", "col3"],
        ["col4", "col5"]
    ]
)

And to support at tolerance float there are several options:

A dict with tuple keys specifying columns and float values for tolerance

pa.DataFrameSchema(
    unique={
        ("col1", "col2", "col3"): 0.0,
        ("col4", "col5"): 0.1,
    }
)

A list of 2-tuples where the first element is the column list and second is the tolerance value

pa.DataFrameSchema(
    unique=[
        (["col1", "col2", "col3"], 0.0),
        (["col4", "col5"], 0.1),
    ]
)

A list of dictionaries with column and tolerance keys

pa.DataFrameSchema(
    unique=[
       {"columns": [...], "tolerance": ...},
       {"columns": [...], "tolerance": ...}
    ]
)

My preference would be for (1), though I can see the merits of (3) as well.

In such situation I'd maybe make sense to allow the tolerance fraction to be duplicated, no matter if the duplicates are across the columns, or only in one. What do you think?

I'm not sure what you mean by allowing tolerance fraction to be duplicated, can you elaborate on that?

sfczekalski · 2021-04-17T16:37:19Z

My preference would be for (1), though I can see the merits of (3) as well.

I'd also opt for the first option as it's the most concise.

I'm not sure what you mean by allowing tolerance fraction to be duplicated, can you elaborate on that?

Yes, let me give an example:

df = pd.DataFrame.from_dict(
  {
    "col1": [1, 2, 3],
    "col2": [4, 5, 6],
    "col3": ["a", "b", "c"],
    "col4": ["d", "e", "f"],
  }
)

schema = pa.DataFrameSchema(
    unique={
        ("col1", "col2"): 1.0,
        ( "col3", "col4", "col5"): 1.0,
    }
)

The DataFrame has unique values in columns col1 and col2 when concatenated, it also has unique values in columns col3, col4 when concatenated. Is this what you mean as well?

cosmicBboy · 2021-04-21T13:19:10Z

it also has unique values in columns col3, col4 when concatenated

you mean col3, col4, and col5 right?

yes, this is exactly what I'm thinking. I think we an scope this particular issue to cover:

adding the tolerance argument to Check objects,
support float between [0, 1] for the nullable and allow_duplicates key word arguments.

Once #390 is done we can make another issue to tackle the unique tolerance implementation at the dataframe-level.

PR would be very welcome! 🙏

cosmicBboy added the proposal label Apr 9, 2020

cosmicBboy added the enhancement New feature or request label Jul 5, 2020

cosmicBboy mentioned this issue Apr 8, 2021

Allow a check to fail on specified fraction of rows #456

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add `tolerance: float` argument in Check class #183

add `tolerance: float` argument in Check class #183

cosmicBboy commented Mar 8, 2020 •

edited

Loading

cosmicBboy commented Aug 5, 2020

cosmicBboy commented Apr 8, 2021 •

edited

Loading

sfczekalski commented Apr 8, 2021

sfczekalski commented Apr 9, 2021

cosmicBboy commented Apr 14, 2021

sfczekalski commented Apr 17, 2021

cosmicBboy commented Apr 21, 2021

add tolerance: float argument in Check class #183

add tolerance: float argument in Check class #183

Comments

cosmicBboy commented Mar 8, 2020 • edited Loading

cosmicBboy commented Aug 5, 2020

cosmicBboy commented Apr 8, 2021 • edited Loading

sfczekalski commented Apr 8, 2021

sfczekalski commented Apr 9, 2021

cosmicBboy commented Apr 14, 2021

sfczekalski commented Apr 17, 2021

cosmicBboy commented Apr 21, 2021

add `tolerance: float` argument in Check class #183

add `tolerance: float` argument in Check class #183

cosmicBboy commented Mar 8, 2020 •

edited

Loading

cosmicBboy commented Apr 8, 2021 •

edited

Loading