-
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate with Factory Boy? Or other solution? #470
Comments
hi @schlich, thanks for opening up this discussion. Before going ay further, are you aware of the I understand the gist of what you're going for, though I'm still a little unclear an integration would look like. Since It also might help to provide a small reproducible code snippet of illustrating (a) the problem and what you're currently doing to solve it and (b) your ideal solution.
Something like this would work: import pandera as pa
from pandera.typing import Series
class Schema(pa.SchemaModel):
column1: Series[int]
column2: Series[float]
column3: Series[str]
strategy = Schema.strategy(size=5)
# replace column1 with 100s
new_strategy = strategy.map(lambda df: df.assign(column1=100))
print(new_strategy.example())
# use new_strategy in tests, etc. |
I think the problem might come with the complexity of my dataframes -- i have upwards of 20 columns/index levels, many with dependencies on each other. I began trying to approach my problem with strategies but not only did the strategies end up being very complex (for example, there is no existing check that multiindex rows are unique, or a check to guarantee that each element from a list of choices appears at least once, among other things), but the strategies were prohibitively expensive for my TDD workflow, taking up to 30 seconds for each test, even after reducing the number of examples (reducing the size to about 5-10 worked fine, but that seems to defeat the purpose of hypothesis). I tried everything mentioned here but nothing seemed to fit. Additionally, registering all these custom checks and strategies with Pandera was not a simple task. Since simple is better than complex, i got frustrated and ditched the strategies for the factory boy approach. I would much rather be able to unit-test by example and integration test by strategy. Happy to be proven wrong, but i need my test runs to be quick! When I have time, I will try the map/assign pattern above and see if i make any headway. I'll work on producing some reduced examples as well. |
Thanks for continuing the discussion @schlich! One thing I want to emphasize is that pandera's data synthesis capabilities are still in the early days, but there is one constraint that I think is important to preserve: the symmetry between defining a schema and generating data directly from the constraints defined in the schema itself. Of course the user can always decide to adapt a
Yeah, unfortunately pandera doesn't really lend itself to generating data with a lot of interdependencies between columns. There is this issue #371, but specifics haven't been figured out yet. Will need to think about how to do that. If you don't mind, could you expand on what kinds of dependencies you rely on? It would be helpful to inform the solution for #371.
After #390 is implemented, adding this functionality to
I think that could be implemented as a built-in Check (with an associated strategy), since I also wanted to point you to the extensions module that leets you register checks into the
how many samples do you need for your tests to be meaningful?
I wouldn't be opposed to a factory boy integration, I think we can further optimize the functionality (and API) of the hypothesis integration and see how far we can get, then consider another integration if we still find performance lacking. |
As a side note, I've been using the dataframe checks for dependencies between columns. You can add your own using the extensions API as @cosmicBboy said. That said, I think there's a fundamental mismatch between the bottom-up and filter approach that I wonder if we could abstract out the current strategy implementation as one high-level approach, but allow users to replace that with their own top-level dataframe strategy if they want? This proposed FactoryBoy integration could be implemented as a coarse-grained sampling strategy. Was that what you were thinking in #371? |
Yes, I'm trying to see i can accomplish what i need to with the "group" options in the checks (the documentation here could use a little bit of a facelift IMO) but my guess is it will only take me about 50% of the way there... stay tuned! |
great discussion! okay, there are a few points that were brought up here worth decomposing:
Yes, the Re: Factory Boy integration, I think the fact that hypothesis and factory boy seem to play well together according to the post referenced by @schlich tells me that for now pandera's entrypoint to synthesizing data should stick with the
Not quite, I think the coarse-grained sampling strategy would be more in line with (2), for #371 I was thinking of native support for conditional checks with a default implementation for the hypothesis strategy. For (2) I'm thinking something like: import pandera as pa
from hypothesis import builds
from hypothesis.strategies import SearchStrategy
from factories import SomeFactory
# takes one argument, which is the schema object that this strategies is applied to
# and returns a hypothesis strategy
def custom_strategy(schema: pa.DataFrameSchema) -> SearchStrategy:
# e.g. using factory boy, but this could also be a hypothesis strategy
return builds(SomeFactory.build, ...)
schema = pa.DataFrameSchema(
...,
strategy=custom_strategy
)
strategy = schema.strategy() # to use in test suite
example = schema.example() # to generate examples on the fly for debugging |
Mostly for the purposes of testing and example generation, I would like to see Pandera's
schema.example
function to incorporate patterns similar to Factory Boy:While this might seem at first like this is a less-thorough version of what Hypothesis does, the bolded part above (emphasis mine) outlines the functionality I am looking for -- the ability to further constrain properties of a DataFrame in a manner appropriate for testing. While conceivably I can create a new schema with further restrictions, this seems like it would get quickly out of hand, and does not incorporate the advantages offered by Factory Boy's pattern
Here's what the author of Hypothesis has to say about using Hypothesis with Factory Boy:
essentially I would like to see
UserFactory.build
replaced withschema.example
. It might be feasible to allowexample
to take extra kwargs that may then set values for dataframe columns and indexes.As is, I've had to resort to constructing my factories without Pandera -- my DF has a lot of columns, and thus I had to deal with a lot of code duplication that could have easily been handled by the schema.
If there is an existing pattern that allows for overriding of fields during example generation (preferably without needing to modify the schema), please let me know!
The text was updated successfully, but these errors were encountered: