-
-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse Hypothesis strageies for schema inference, validation, and synthesis #399
Comments
hey @crypdick, yes getting data synthesis right is pretty challenging and I'd like to smooth out the rough edges. Just so I get a better sense of the issue, can you provide an example schema that's causing the health check issues? |
@cosmicBboy sure! I had to obfuscate sensitive details but hopefully this gives you a rough idea. import re
import pandera as pa
# simplified for security
IMG_URL_REGEX = re.compile(
r"""
^s3:// # prefix
[-a-z0-9]+/
[A-Za-z0-9-]+/
20[1-2][0-9]+/ # years 2010-2029
[A-Za-z0-9-_,.]+/ # including _, commas, period
v[0-9]/ # versions v0-v9
2[4-5]/
[0-9]+/
[0-9]+ # file name
[.]png$ # extension, escape dot
""",
re.X,
)
valid_ids = {str(i) for i in range(1, 1000)} # placeholder for more complicated logic
is_valid_img_url = pa.Check.str_matches(IMG_URL_REGEX)
is_not_empty_series = pa.Check(lambda series_: len(series_) > 0, name="Series not empty")
is_valid_id = pa.Check.isin(valid_ids)
# class IDs joined by |
# this later gets filtered to ensure each element is a valid ID (invalid IDs throw Exceptions downstream)
numeric_str_delim_by_pipes = re.compile(
r"""[0-9]{1,4} # in reality, this is more complicated
(\|[0-9]{1,4})* # 0 or more IDs
""",
re.X,
)
grouped_by_s3uri_schema = pa.DataFrameSchema(
columns={
"img_url": pa.Column(
pa.String,
allow_duplicates=False,
nullable=False,
checks=[is_valid_img_url, is_not_empty_series], # most restrictive check first
),
"labels": pa.Column(
pa.String,
checks=[
pa.Check.str_matches(numeric_str_delim_by_pipes),
pa.Check(
lambda str_delim_pipes: all([id_ in valid_ids for id_ in str_delim_pipes.split("|")]),
element_wise=True,
),
],
),
}
) |
This sounds complicated 😅 (I did sort of look into this when building out the data synthesis functionality but found it way over my head, I'm not sure if there's a straightforward way of parsing complex multi-part strategies) Note that rejection sampling only occurs after the first check in the I think it's up to Another potential solution: custom checks and strategiesThe extensions API was designed to give users access to the You can do something like: import re
from typing import Optional
import hypothesis
import hypothesis.strategies as st
import pandas as pd
import pandera as pa
import pandera.extensions as extensions
# simplified for security
IMG_URL_REGEX = re.compile(
r"""
^s3:// # prefix
[-a-z0-9]+/
[A-Za-z0-9-]+/
20[1-2][0-9]+/ # years 2010-2029
[A-Za-z0-9-_,.]+/ # including _, commas, period
v[0-9]/ # versions v0-v9
2[4-5]/
[0-9]+/
[0-9]+ # file name
[.]png$ # extension, escape dot
""",
re.X,
)
NUMERIC_STR_DELIM_REGEX = re.compile(
r"""[0-9]{1,4} # in reality, this is more complicated
(\|[0-9]{1,4})* # 0 or more IDs
""",
re.X,
)
VALID_IDS = [str(i) for i in range(1, 1000)]
# Define custom url strategy and check
def url_strategy(
pandas_dtype: pa.PandasDtype,
strategy: Optional[st.SearchStrategy] = None,
*,
url_regex,
):
if strategy is None:
# replace this with more efficient Hypothesis strategy if desired
return st.from_regex(url_regex, fullmatch=True)
raise pa.errors.BaseStrategyOnlyError(
"'url_strategy' must be a base strategy"
)
@extensions.register_check_method(
statistics=["url_regex"],
strategy=url_strategy,
supported_types=pd.Series,
)
def valid_url(pandas_obj, *, url_regex):
"""Url regex check."""
return pandas_obj.str.match(url_regex, na=False)
# Define custom label strategy and check
def labels_strategy(
pandas_dtype: pa.PandasDtype,
strategy: Optional[st.SearchStrategy] = None,
*,
valid_ids,
):
if strategy is None:
# replace this with more efficient Hypothesis strategy if desired
return st.lists(
st.sampled_from(VALID_IDS), unique=True, min_size=1
).map("|".join)
raise pa.errors.BaseStrategyOnlyError(
"'labels_strategy' must be a base strategy"
)
@extensions.register_check_method(
statistics=["valid_ids"],
strategy=labels_strategy,
supported_types=pd.Series,
)
def valid_labels(pandas_obj, *, valid_ids):
# combines the regex match check and the valid_ids check
valid_ids = set(valid_ids)
return pandas_obj.str.match(NUMERIC_STR_DELIM_REGEX) & (
pandas_obj.map(lambda x: all(id_ in valid_ids for id_ in x.split("|")))
)
schema = pa.DataFrameSchema(
columns={
"img_url": pa.Column(
pa.String,
allow_duplicates=False,
nullable=False,
checks=[
pa.Check.valid_url(url_regex=IMG_URL_REGEX),
pa.Check(
lambda series_: len(series_) > 0, name="Series not empty"
),
],
),
"labels": pa.Column(
pa.String,
checks=pa.Check.valid_labels(valid_ids=VALID_IDS),
),
}
)
@hypothesis.given(schema.strategy(size=5))
def test_schema(df):
print(df)
# test something Phew! thanks for bearing with me. All that said, I did find some inefficiencies in the way the dataframe strategy was being constructed: #400 << this PR should actually make the For the |
Other types of solutions that would be fairly heavy lifts i.e. going into the guts of the
|
@cosmicBboy Tyvm for the detailed answer! When I copy-paste your code, I get a |
woops! just edited the code snippet , should work now |
@cosmicBboy you beat me to it :) ty for pointing me to the extensions docs, I hadn't noticed this feature before |
Ah, yeah. As a Hypothesis core dev I wouldn't try this; complex strategies like A better approach, at least for element-wise checks, would be to get our "efficient filter rewriting" HypothesisWorks/hypothesis#2701 done - I think it would be reasonably simple to support numeric bounds and string regex patterns for an initial pass, and that would probably cover a lot of your use-cases. |
Thanks for the pointers @Zac-HD! Also, just wanted to say I'm a big fan of The current implementation of check strategy chaining in pandera does heavily use |
@cosmicBboy the solution you posted seems to break down when used with indexes. In particular, if you edit the index=pa.Index(
pa.String,
allow_duplicates=False,
nullable=False,
name="img_url",
checks=[
is_valid_img_url,
is_not_empty_series,
]
), Then run |
@crypdick do you have a stacktrace of the error? this is definitely a bug |
@cosmicBboy sure, here you go: File "/home/richard/.config/JetBrains/PyCharm2021.1/scratches/scratch_2.py", line 85, in <module>
print(multihot_dataset_schema.strategy().example())
File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/hypothesis/strategies/_internal/strategies.py", line 319, in example
example_generating_inner_function()
File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/hypothesis/strategies/_internal/strategies.py", line 307, in example_generating_inner_function
@settings(
File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/hypothesis/core.py", line 1163, in wrapped_test
raise the_error_hypothesis_found
File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/pandera/strategies.py", line 117, in set_pandas_index
df_or_series.index = index
File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/pandas/core/generic.py", line 5154, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/_libs/properties.pyx", line 66, in pandas._libs.properties.AxisProperty.__set__
File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/pandas/core/generic.py", line 564, in _set_axis
self._mgr.set_axis(axis, labels)
File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 226, in set_axis
raise ValueError(
ValueError: Length mismatch: Expected axis has 6 elements, new values have 0 elements |
hey @crypdick #410 should fix the issue! there was a bug leading to the length mismatch where the generated df and index would not have the same length. Will be merging into |
Sounds great! Cc @daavidstein |
@crypdick let me know if you come across any other problems relating to this issue! we can reopen it if needed |
Is your feature request related to a problem? Please describe.
Pandera's new synthesis feature is very attractive, but is broken if the constraints are non-trivial (e.g.
pa.Check.str_matches(complicated_regex)
). Related discussion here.Describe the solution you'd like
Ideally, we could use complex schema for both validation and synthesis. In order for that to be possible, Pandera needs to generate data more efficiently, which means not using rejection sampling. Luckily, Hypothesis has solved the efficient synthesis problem, but Hypothesis strategies can't be used for validation.
It would be great if Pandera could parse a Hypothesis
data_frames
strategy into Schemas for validation:my_schema = pa.infer_schema(hypothesis.extra.pandas.data_frames(...))
The Schema, then, could use hypothesis for efficient data generation.
Describe alternatives you've considered
I've tried disabling Hypothesis's health checks but
Hypothesis literally ran out of random bytes to parse into your dataframe, and there's not really anything that [Hypothesis] can do about that
Additional context
We were attracted to Pandera over Great Expectations due to the ability to create
hypothesis
strategies directly from schemas. This saves us the labor of maintaining separate validation schemas and Hypothesis strategies.The text was updated successfully, but these errors were encountered: