Dask Integration #647

ghost · 2021-10-07T16:46:21Z

ghost
Oct 7, 2021

Hi, first off thanks for a great library. I wanted to share how I have used Pandera to address my specific use case and get some feedback as to whether there are currently better ways to do this or if better support is in the roadmap.

My primary use case was to integrate Pandera with Dask so that we could annotate and validate functions that operate on Dask dataframes. Secondarily, I wanted to support specific hypothesis strategies for data synthesis. This is most often used when testing functions that operate on data received from third parties.

I think this discussion #470, particularly the last comment, addresses the custom hypothesis strategies to some extent. I will open a different discussion for that topic (#648).

Dask Support

First I defined an equivalent to the Pandera DataFrame type that can be used in similar ways which Pandera will accept for type checking.

import dask.dataframe as dd
from pandera.typing import DataFrame, T
from typing import Generic

class DaskDataFrame(dd.DataFrame, DataFrame, Generic[T]):
    ...

I then derived Pandera's DataFrameSchema to modify the validation method. Because of Dask's lazy execution model, this validation happens when the dataframe is computed rather than when the validated function is called.

class DaskDataFrameSchema(pa.DataFrameSchema):
    @classmethod
    def from_schema(cls, schema: pa.DataFrameSchema) -> 'DaskDataFrameSchema':
        return cls(
            columns=schema.columns,
            checks=schema.checks,
            index=schema.index,
            dtype=schema.dtype,
            coerce=schema.coerce,
            strict=schema.strict,
            name=schema.name,
            ordered=schema.ordered,
            unique=schema.unique,
        )

    def validate(
        self,
        check_obj: dd.DataFrame,
        head: Optional[int] = None,
        tail: Optional[int] = None,
        sample: Optional[int] = None,
        random_state: Optional[int] = None,
        lazy: bool = False,
        inplace: bool = False,
    ) -> dd.DataFrame:
        kwargs = {
            'head': head,
            'tail': tail,
            'sample': sample,
            'random_state': random_state,
            'lazy': lazy,
            'inplace': inplace,
        }
        return check_obj.map_partitions(
            super().validate,
            **kwargs,
            meta=check_obj,
        )

Finally, I needed to derive pandera's SchemaModel class in order to hook into the to_schema method.

class DaskSchemaModel(pa.SchemaModel):
    @classmethod
    def to_schema(cls) -> DaskDataFrameSchema:
        schema = super().to_schema()
        return DaskDataFrameSchema.from_schema(schema)

These classes then allow me to do the following

class Schema(DaskSchemaModel):
    float_col: Series[float]
    str_col: Series[str]

@pa.check_types
def transform(df: DaskDataFrame[Schema]) -> DaskDataFrame[Schema]:
    ...
    return df

Because of Dask's use of Pandas DataFrames under the hood, this integration was fairly straightforward.

cosmicBboy · 2021-10-08T02:06:07Z

cosmicBboy
Oct 8, 2021
Maintainer

This is fantastic @bphillips-exos! 🎉🚀

native dask support has been on the roadmap for some time #119 #381, the vision so far is to keep the user-facing API as simple as possible and take on the complexity in the backend, so basically in theory users would be able to define a single schema and be able to validate various dataframe-like objects.

I think we could do it in 2 steps:

looser integration (basically your solution 🙂): the DaskDataFrameSchema.validate method can basically be added to DataFrameSchema.validate and do an isinstance(df, dd.DataFrame). For this to work, the current validation logic would have to move to a private method _validate so that the validate method can delegate the

def _validate(self, ...):
    # current validate function body

def validate(self, check_obj, **kwargs):
    if isinstance(check_obj, dd.DataFrame):
        # validates a dask dataframe
        return check_obj.map_partitions(self._validate, **kwargs, meta=check_obj)
    return self._validate(check_obj, **kwargs

tighter integration: honestly (1) probably covers 95% of current cases, which is why it makes so much sense to do before Add first-class support for Dask #119... but this would enable checks like series.mean() < 0.5, basically aggregate checks. With (1) it would assess the mean on each partition, which isn't quite the expected behavior.

Any interest in contributing first-class dask support to pandera for step (1)?

3 replies

ghost Oct 10, 2021

Thanks for the response cosmicBboy. I would be interested in contributing support and have started looking into that. The first question that jumps out to me is whether dask should be added as a dependency under install_requires. For step (1), that can probably be avoided with inline imports and careful type checking as the scope of the changes is relatively small. Would you agree that's the right approach for step (1)?

cosmicBboy Oct 15, 2021
Maintainer

Hi @bphillips-exos you can add dask as a extra dependency, see here.

For step (1), that can probably be avoided with inline imports and careful type checking as the scope of the changes is relatively small

Yep! this sounds like the correct approach.

cosmicBboy Oct 17, 2021
Maintainer

We can also work off of this issue #119 to get this integration going.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask Integration #647

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Dask Integration #647

ghost Oct 7, 2021

Dask Support

Replies: 1 comment · 3 replies

cosmicBboy Oct 8, 2021 Maintainer

ghost Oct 10, 2021

cosmicBboy Oct 15, 2021 Maintainer

cosmicBboy Oct 17, 2021 Maintainer

ghost
Oct 7, 2021

Replies: 1 comment 3 replies

cosmicBboy
Oct 8, 2021
Maintainer

cosmicBboy Oct 15, 2021
Maintainer

cosmicBboy Oct 17, 2021
Maintainer