Fast validation of large pyspark dataframes #1312

DanielLenz · 2023-08-15T07:26:00Z

DanielLenz
Aug 15, 2023

I have some large (many TB) pyspark dataframes which I'd like to validate using pandera and the new pyspark SQL interface.

It's sufficient to mostly look at the datatypes and perhaps a few rows. To this end, I've been using something like the following.

import pandera.pyspark as pa
import pyspark.sql.types as T


class MySchema(pa.DataFrameModel):
  ts: T.IntegerType() = pa.Field(nullable=False)
  column1: T.StringType() = pa.Field(nullable=True)
  columns2: T.StringType() = pa.Field(nullable=False)

# dataframe with a couple billion rows
data = spark.read.format("delta").load('/some/path')

validation_output = MySchema.validate(data, head=100)

print(json.dumps(dict(validation_output.pandera.errors), indent=2))

This turns out to be very slow, unfortunately. Two questions now:

Is there a better way to validate the schema?
I do not need the .validate() method to return anything, just using it for the check. Is there a better way to use pandera here?

Thank you!

Answered by cosmicBboy

Aug 15, 2023

Is there a better way to validate the schema?

So the head kwarg isn't actually used in the validate method (it's there for API compatibility)... this should really raise an error or warning.

A few questions:

I'm assuming based on head=100 you don't actually want to validate the entire dataset. Is that correct?
Is there a way you can take only a few partitions of the data when you load it so that you're only validating a subset?

I do not need the .validate() method to return anything, just using it for the check. Is there a better way to use pandera here?

Currently the way you're using it is the current recommended way to pull out the errors.

@NeerajMalhotra-QB @jaskaransinghsidana

View full answer

cosmicBboy · 2023-08-15T14:29:07Z

cosmicBboy
Aug 15, 2023
Maintainer

Is there a better way to validate the schema?

So the head kwarg isn't actually used in the validate method (it's there for API compatibility)... this should really raise an error or warning.

A few questions:

I'm assuming based on head=100 you don't actually want to validate the entire dataset. Is that correct?
Is there a way you can take only a few partitions of the data when you load it so that you're only validating a subset?

I do not need the .validate() method to return anything, just using it for the check. Is there a better way to use pandera here?

Currently the way you're using it is the current recommended way to pull out the errors.

@NeerajMalhotra-QB @jaskaransinghsidana

1 reply

DanielLenz Aug 15, 2023
Author

Thanks for the quick reply!

I'm assuming based on head=100 you don't actually want to validate the entire dataset. Is that correct?

That's exactly right. We've been thinking about adding some checks of the actual values, but that's just not feasible on the full data set.

Is there a way you can take only a few partitions of the data when you load it so that you're only validating a subset?

That's a good point, will just do that for validation purposes.

Currently the way you're using it is the current recommended way to pull out the errors.

Gotcha, thanks!

jaskaransinghsidana · 2023-08-15T15:23:20Z

jaskaransinghsidana
Aug 15, 2023

This is not an easy problem to solve because of how spark has to handle schema, since spark follows schema on read this makes this issue complicated , me and @NeerajMalhotra-QB did think about this issue in quite a bit of detail.

Consider a non-nullable column, if the values is null the spark schema would mark it as nullable but there won’t be any error raised since its schema on read is not enforced. Hence from Pandera perspective it validates all the data, now if I read only few partition to validate I may skip such data which actually has the null.

But @cosmicBboy I think now it could be a right time to debate on sample based data validation for pyspark. Also how we communicate it to the user as well the implications of this approach on big data. I have also seen some interest in this feature in my conversation with few users. But I think some user research might be necessary to find the common pitfalls.

0 replies

NeerajMalhotra-QB · 2023-08-15T16:57:18Z

NeerajMalhotra-QB
Aug 15, 2023
Maintainer

I have some large (many TB) pyspark dataframes which I'd like to validate using pandera and the new pyspark SQL interface.

It's sufficient to mostly look at the datatypes and perhaps a few rows. To this end, I've been using something like the following.
import pandera.pyspark as pa
import pyspark.sql.types as T


class MySchema(pa.DataFrameModel):
  ts: T.IntegerType() = pa.Field(nullable=False)
  column1: T.StringType() = pa.Field(nullable=True)
  columns2: T.StringType() = pa.Field(nullable=False)

# dataframe with a couple billion rows
data = spark.read.format("delta").load('/some/path')

validation_output = MySchema.validate(data, head=100)

print(json.dumps(dict(validation_output.pandera.errors), indent=2))
This turns out to be very slow, unfortunately. Two questions now:

Is there a better way to validate the schema?

I do not need the .validate() method to return anything, just using it for the check. Is there a better way to use pandera here?

Thank you!

@DanielLenz, The way validation is designed, is to report the first error it encounters if any. Having a subset only validation such as head=100 isn't the right approach anyway and that's the reason its not supported on pyspark sql.

We had to keep the interface same as pandas counterparts, hence head argument was left.

If I understood your use case correctly, you are seeking to have streaming pipeline with schema validation in real time.

My recommendations for your use case would be:

If you want 'delta' load, then you will need to control the size of dataframe that is passed to pandera's validate() so may be write a custom function which validates only 'delta' dataframe (aka chunk of df) not entire df at a time and repeat it in your streaming pipeline.
also explore if validation can be moved to a nightly batch job and only validated data is passed through the pipeline. In other words, create a staging layer in your design.
lastly ensure there's enough server power to back your expected processing and turnaround times for TBs of data.

I hope it helps.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast validation of large pyspark dataframes #1312

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Fast validation of large pyspark dataframes #1312

DanielLenz Aug 15, 2023

Replies: 3 comments · 1 reply

cosmicBboy Aug 15, 2023 Maintainer

DanielLenz Aug 15, 2023 Author

jaskaransinghsidana Aug 15, 2023

NeerajMalhotra-QB Aug 15, 2023 Maintainer

DanielLenz
Aug 15, 2023

Replies: 3 comments 1 reply

cosmicBboy
Aug 15, 2023
Maintainer

DanielLenz Aug 15, 2023
Author

jaskaransinghsidana
Aug 15, 2023

NeerajMalhotra-QB
Aug 15, 2023
Maintainer