-
-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Hypothesis strategies more efficient with statistics resolver and reducing use of .filter()
#404
Comments
You also mention that ...that's just because nobody has asked for them yet; and if you're working on categorical or object support we'd love to have that upstream rather than Pandera-specific if that would work for you 😄 |
thanks @Zac-HD!
Yes, this is something I noticed when building out the strategy wrappers, so currently Building off of your suggestion, I think there are a couple of possible approaches on the
pa.Column(float, checks=[pa.Check.gt(0), pa.Check.le(1)]) # the schema
# currently, the first check is the "base strategy", and its statistics are used to build the first strategy
st.floats(min_value=0).filter(lambda x: x < 1)
# using map instead
st.floats(min_value=0).map(lambda x: x if x <= 1 else 1)
pa.Column(float, checks=[pa.Check.gt(0), pa.Check.le(1)]) # the schema
# statistics resolver would collect all relevant constraints
agg_stats = {"min_value": 0, "max_value": 1, ...}
st.floats(agg_stats["min_value"], agg_stats["max_value"], exclude_max=True)
# this should raise an error in the
pa.Column(float, checks=[pa.Check.gt(0), pa.Check.le(1)]) The pro about (1) is ease of implementation: the subsequent strategies in a chain don't necessarily have to know anything about the constraints of the prior strategies. Con would be potential issues around oversampling of values based on the first strategy that don't agree with the second strategy (leading to a lot of On the other hand, the pro about (2) is it elegantly handles multiple constraints and can catch incompatible sets of checks off the bat. Con would be potentially more complex logic for implementing the "statistics resolver".
Agreed, this feature would be amazing! Let me know if there's anything we can do on our end to help (or even contribute to the hypothesis codebase 😀) |
I wouldn't recommend this, as it skews the distribution very badly - you'll tend to end up testing the same forced-to-an-endpoint value essentially every time.
This is the "right way to do it", though it can also get complicated - which is why I've suggested that we do it upstream 😉
As it happens, there is! I've just opened HypothesisWorks/hypothesis#2853, and HypothesisWorks/hypothesis#2701 (comment) describes the next steps - if I could "just" refactor the strategies and let you write the filter methods and tests for other integers and floats strategies, that would be so helpful! After that we can look at string/regex strategies and handling lambdas in addition to |
@cosmicBboy - we've just released our first filter rewriting, for If you change the comparison and equality checks to use e.g. |
thanks @Zac-HD! will update the minimum
Thats okay, currently we're using lambda functions for these right now, so we're not type-annotating anyway |
Looks like @Zac-HD followed up with another release! https://github.com/HypothesisWorks/hypothesis/releases/tag/hypothesis-python-6.12.0 |
Yep! Support for floats is probably a month or two off, and then Unfortunately we'll never be able to "rewrite" length filters (😢), because it's possible that On another note, you might want to add the |
.filter()
.filter()
.filter()
.filter()
Heya! I'm a Hypothesis core dev, and stoked that you've found it useful enough to promote to your users 🥰 I also have some suggestions for how to make things more efficient - the short version is "avoid using
.filter()
wherever possible"; the long version is... well, longer. And probably a lot of work for all of us 😅Filtering is very convenient, but can also cause some performance issues - because it works by rejecting and re-drawing whatever data did not pass the predicate. This is OK (ish, usually) for scalars, but adds up really fast when you're rejecting whole columns or dataframes. So "try to filter elements, not columns" is the first piece of advice, and one that makes sense to implement immediately.
More generally, in many cases it's possible to define a strategy which always passes the check, instead of filtering:
Now... it's basically fine to say this if you're writing it all by hand, but I agree that it's going to be painful to manage from library code. That's why we have HypothesisWorks/hypothesis#2701, a plan to automatically rewrite strategies when simple filter functions are applied - I'd propose that we work on that together rather than have everyone implement it separately, to split the work and share the benefits.
The text was updated successfully, but these errors were encountered: