Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Hypothesis strategies more efficient with statistics resolver and reducing use of .filter() #404

Closed
Zac-HD opened this issue Feb 6, 2021 · 7 comments · Fixed by #1503
Labels
enhancement New feature or request

Comments

@Zac-HD
Copy link
Contributor

Zac-HD commented Feb 6, 2021

Heya! I'm a Hypothesis core dev, and stoked that you've found it useful enough to promote to your users 🥰 I also have some suggestions for how to make things more efficient - the short version is "avoid using .filter() wherever possible"; the long version is... well, longer. And probably a lot of work for all of us 😅

Filtering is very convenient, but can also cause some performance issues - because it works by rejecting and re-drawing whatever data did not pass the predicate. This is OK (ish, usually) for scalars, but adds up really fast when you're rejecting whole columns or dataframes. So "try to filter elements, not columns" is the first piece of advice, and one that makes sense to implement immediately.

More generally, in many cases it's possible to define a strategy which always passes the check, instead of filtering:

pa.Column(float, checks=[pa.Check.gt(0), pa.Check.le(1)])     # the schema
st.floats().filter(lambda x: x >= 0).filter(lambda x: x < 1)  # the current strategy - it's slow!
st.floats(0, 1, exclude_max=True)                             # the ideal strategy - #nofilter 😉 

Now... it's basically fine to say this if you're writing it all by hand, but I agree that it's going to be painful to manage from library code. That's why we have HypothesisWorks/hypothesis#2701, a plan to automatically rewrite strategies when simple filter functions are applied - I'd propose that we work on that together rather than have everyone implement it separately, to split the work and share the benefits.

@Zac-HD
Copy link
Contributor Author

Zac-HD commented Feb 6, 2021

You also mention that

https://github.com/pandera-dev/pandera/blob/637f2bf1745d9053eede2ac15655a03b29210178/pandera/strategies.py#L307-L308

...that's just because nobody has asked for them yet; and if you're working on categorical or object support we'd love to have that upstream rather than Pandera-specific if that would work for you 😄

@cosmicBboy
Copy link
Collaborator

thanks @Zac-HD!

This is OK (ish, usually) for scalars, but adds up really fast when you're rejecting whole columns or dataframes. So "try to filter elements, not columns" is the first piece of advice, and one that makes sense to implement immediately.

Yes, this is something I noticed when building out the strategy wrappers, so currently pandera mostly filters elements where it can and only falls back to filtering whole columns/dataframes for user-defined custom checks that haven't been registered with a user-provided strategy via the extensions API.

Building off of your suggestion, I think there are a couple of possible approaches on the pandera strategy-chaining implementation:

  1. use map for strategy chaining:
pa.Column(float, checks=[pa.Check.gt(0), pa.Check.le(1)])     # the schema

# currently, the first check is the "base strategy", and its statistics are used to build the first strategy
st.floats(min_value=0).filter(lambda x: x < 1)

# using map instead
st.floats(min_value=0).map(lambda x: x if x <= 1 else 1)
  1. introduce concept of "statistics resolvers" in the backend to aggregate multiple checks into a single strategy. This would throw an error if the resolver finds incompatible statistics.
pa.Column(float, checks=[pa.Check.gt(0), pa.Check.le(1)])     # the schema

# statistics resolver would collect all relevant constraints
agg_stats = {"min_value": 0, "max_value": 1, ...}

st.floats(agg_stats["min_value"], agg_stats["max_value"], exclude_max=True)

# this should raise an error in the 
pa.Column(float, checks=[pa.Check.gt(0), pa.Check.le(1)])

The pro about (1) is ease of implementation: the subsequent strategies in a chain don't necessarily have to know anything about the constraints of the prior strategies. Con would be potential issues around oversampling of values based on the first strategy that don't agree with the second strategy (leading to a lot of 1s in the above example)

On the other hand, the pro about (2) is it elegantly handles multiple constraints and can catch incompatible sets of checks off the bat. Con would be potentially more complex logic for implementing the "statistics resolver".

That's why we have HypothesisWorks/hypothesis#2701, a plan to automatically rewrite strategies when simple filter functions are applied - I'd propose that we work on that together rather than have everyone implement it separately, to split the work and share the benefits.

Agreed, this feature would be amazing! Let me know if there's anything we can do on our end to help (or even contribute to the hypothesis codebase 😀)

@Zac-HD
Copy link
Contributor Author

Zac-HD commented Feb 8, 2021

  1. use .map for strategy chaining:

I wouldn't recommend this, as it skews the distribution very badly - you'll tend to end up testing the same forced-to-an-endpoint value essentially every time.

introduce concept of "statistics resolvers" in the backend to aggregate multiple checks into a single strategy. This would throw an error if the resolver finds incompatible statistics.

This is the "right way to do it", though it can also get complicated - which is why I've suggested that we do it upstream 😉

Agreed, this feature [automatic filter rewriting] would be amazing! Let me know if there's anything we can do on our end to help (or even contribute to the hypothesis codebase 😀)

As it happens, there is! I've just opened HypothesisWorks/hypothesis#2853, and HypothesisWorks/hypothesis#2701 (comment) describes the next steps - if I could "just" refactor the strategies and let you write the filter methods and tests for other integers and floats strategies, that would be so helpful!

After that we can look at string/regex strategies and handling lambdas in addition to functools.partial(operator.foo, ...), or jump directly to ensuring that Pandera can take advantage of the new functionality for numeric filters 😀

@Zac-HD
Copy link
Contributor Author

Zac-HD commented Feb 24, 2021

@cosmicBboy - we've just released our first filter rewriting, for st.integers().

If you change the comparison and equality checks to use e.g. functools.partial(operator.ge, min_value) instead of a custom less_than function, it'll just work. The downside is of course no keyword arguments or type annotations 😕

@cosmicBboy
Copy link
Collaborator

thanks @Zac-HD! will update the minimum hypothesis version and make these improvements.

The downside is of course no keyword arguments or type annotations

Thats okay, currently we're using lambda functions for these right now, so we're not type-annotating anyway

@antonl
Copy link
Contributor

antonl commented May 6, 2021

Looks like @Zac-HD followed up with another release! https://github.com/HypothesisWorks/hypothesis/releases/tag/hypothesis-python-6.12.0

@Zac-HD
Copy link
Contributor Author

Zac-HD commented May 7, 2021

Yep! Support for floats is probably a month or two off, and then text()-filtering-by-regex sometime after that.

Unfortunately we'll never be able to "rewrite" length filters (😢), because it's possible that len would be redefined in the outer scope and then we'd give the wrong answer. There are still some useful ideas to explore here, but they're all quite a lot of work and well in the future.


On another note, you might want to add the Framework :: Hypothesis trove classifier?

@cosmicBboy cosmicBboy changed the title Make Hypothesis strategies more efficient by reducing use of .filter() Make Hypothesis strategies more efficient by with statistics resolver and reducing use of .filter() Jul 15, 2021
@cosmicBboy cosmicBboy changed the title Make Hypothesis strategies more efficient by with statistics resolver and reducing use of .filter() Make Hypothesis strategies more efficient with statistics resolver and reducing use of .filter() Jul 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants