Dask Integration #647
ghost
started this conversation in
Show and tell
Replies: 1 comment 3 replies
-
This is fantastic @bphillips-exos! 🎉🚀 native dask support has been on the roadmap for some time #119 #381, the vision so far is to keep the user-facing API as simple as possible and take on the complexity in the backend, so basically in theory users would be able to define a single schema and be able to validate various dataframe-like objects. I think we could do it in 2 steps:
def _validate(self, ...):
# current validate function body
def validate(self, check_obj, **kwargs):
if isinstance(check_obj, dd.DataFrame):
# validates a dask dataframe
return check_obj.map_partitions(self._validate, **kwargs, meta=check_obj)
return self._validate(check_obj, **kwargs
Any interest in contributing first-class dask support to pandera for step (1)? |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, first off thanks for a great library. I wanted to share how I have used Pandera to address my specific use case and get some feedback as to whether there are currently better ways to do this or if better support is in the roadmap.
My primary use case was to integrate Pandera with Dask so that we could annotate and validate functions that operate on Dask dataframes. Secondarily, I wanted to support specific hypothesis strategies for data synthesis. This is most often used when testing functions that operate on data received from third parties.
I think this discussion #470, particularly the last comment, addresses the custom hypothesis strategies to some extent. I will open a different discussion for that topic (#648).
Dask Support
First I defined an equivalent to the Pandera DataFrame type that can be used in similar ways which Pandera will accept for type checking.
I then derived Pandera's DataFrameSchema to modify the validation method. Because of Dask's lazy execution model, this validation happens when the dataframe is computed rather than when the validated function is called.
Finally, I needed to derive pandera's SchemaModel class in order to hook into the to_schema method.
These classes then allow me to do the following
Because of Dask's use of Pandas DataFrames under the hood, this integration was fairly straightforward.
Beta Was this translation helpful? Give feedback.
All reactions