Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add tolerance: float argument in Check class #183

Open
cosmicBboy opened this issue Mar 8, 2020 · 7 comments
Open

add tolerance: float argument in Check class #183

cosmicBboy opened this issue Mar 8, 2020 · 7 comments
Labels
enhancement New feature or request proposal

Comments

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Mar 8, 2020

tolerance should be a float between 0 and 1 and allow users to express the fact that a check can be true some percentage of the time.

@cosmicBboy cosmicBboy added the enhancement New feature or request label Jul 5, 2020
@cosmicBboy
Copy link
Collaborator Author

leaving this without elaboration on implementation details for now, still not sure if it's really needed

@cosmicBboy
Copy link
Collaborator Author

cosmicBboy commented Apr 8, 2021

@sfczekalski let's discuss implementation details a little further here:

The big question is how to implement this option at multiple levels of the API

  1. Check: this makes sense for checks with an output that has the same shape as the origin series/dataframe being validated
# built-in checks
pa.Check.lt(0, tolerance=0.1)

# custom checks
pa.Check(lambda x: x < 0, tolerance=0.1)

# this should raise a SchemaDefinitionError
pa.Check(lambda x: x.mean() < 0, tolerance=0.1)  
  1. Column: how to apply different tolerances to the first-class checks?

    • allow a float between 0 and 1 (inclusive) for the nullable and allow_duplicates (and eventually the unique option Add unique keyword option to all schemas and schema components #390 that will replace allow_duplicates). Semantics of the float input depends on the option. E.g.
      • nullable=0.0 means no allowed nulls, nullable=1.0 means all values allowed to be null
      • allow_duplicates=0.0 means no allowed duplicates, allow_duplicates=1.0 means all values allowed to be duplicates
  2. DataFrame

@sfczekalski
Copy link

@cosmicBboy

  1. Fair point, totally agree with the SchemaDefinitionError being raised in the third example.
  2. It makes sense for me!
  3. What about Dict[str, float] dictionaries with column names as keys and tolerance floats as values?

@sfczekalski
Copy link

@cosmicBboy now I started wondering, if you mean that the unique option for DataFrameSchema, in the list of columns option, checks for uniqueness of all values in those columns when combined? If so, them my point above of course doesn't make much sense. In such situation I'd maybe make sense to allow the tolerance fraction to be duplicated, no matter if the duplicates are across the columns, or only in one. What do you think?

@cosmicBboy
Copy link
Collaborator Author

So for #390 the dataframe-level unique option would behave like this:

# case 1: combination of these three columns must be unique
pa.DataFrameSchema(unique=["col1", "col2", "col3"])

# case 2: different sets of column combinations must be unique
pa.DataFrameSchema(
    unique=[
        ["col1", "col2", "col3"],
        ["col4", "col5"]
    ]
)

And to support at tolerance float there are several options:

  1. A dict with tuple keys specifying columns and float values for tolerance
pa.DataFrameSchema(
    unique={
        ("col1", "col2", "col3"): 0.0,
        ("col4", "col5"): 0.1,
    }
)
  1. A list of 2-tuples where the first element is the column list and second is the tolerance value
pa.DataFrameSchema(
    unique=[
        (["col1", "col2", "col3"], 0.0),
        (["col4", "col5"], 0.1),
    ]
)
  1. A list of dictionaries with column and tolerance keys
pa.DataFrameSchema(
    unique=[
       {"columns": [...], "tolerance": ...},
       {"columns": [...], "tolerance": ...}
    ]
)

My preference would be for (1), though I can see the merits of (3) as well.

In such situation I'd maybe make sense to allow the tolerance fraction to be duplicated, no matter if the duplicates are across the columns, or only in one. What do you think?

I'm not sure what you mean by allowing tolerance fraction to be duplicated, can you elaborate on that?

@sfczekalski
Copy link

My preference would be for (1), though I can see the merits of (3) as well.

I'd also opt for the first option as it's the most concise.

I'm not sure what you mean by allowing tolerance fraction to be duplicated, can you elaborate on that?

Yes, let me give an example:

df = pd.DataFrame.from_dict(
  {
    "col1": [1, 2, 3],
    "col2": [4, 5, 6],
    "col3": ["a", "b", "c"],
    "col4": ["d", "e", "f"],
  }
)

schema = pa.DataFrameSchema(
    unique={
        ("col1", "col2"): 1.0,
        ( "col3", "col4", "col5"): 1.0,
    }
)

The DataFrame has unique values in columns col1 and col2 when concatenated, it also has unique values in columns col3, col4 when concatenated. Is this what you mean as well?

@cosmicBboy
Copy link
Collaborator Author

it also has unique values in columns col3, col4 when concatenated

you mean col3, col4, and col5 right?

yes, this is exactly what I'm thinking. I think we an scope this particular issue to cover:

  • adding the tolerance argument to Check objects,
  • support float between [0, 1] for the nullable and allow_duplicates key word arguments.

Once #390 is done we can make another issue to tackle the unique tolerance implementation at the dataframe-level.

PR would be very welcome! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request proposal
Projects
None yet
Development

No branches or pull requests

2 participants