Fast validation of large pyspark dataframes #1312
-
I have some large (many TB) pyspark dataframes which I'd like to validate using pandera and the new pyspark SQL interface. It's sufficient to mostly look at the datatypes and perhaps a few rows. To this end, I've been using something like the following. import pandera.pyspark as pa
import pyspark.sql.types as T
class MySchema(pa.DataFrameModel):
ts: T.IntegerType() = pa.Field(nullable=False)
column1: T.StringType() = pa.Field(nullable=True)
columns2: T.StringType() = pa.Field(nullable=False)
# dataframe with a couple billion rows
data = spark.read.format("delta").load('/some/path')
validation_output = MySchema.validate(data, head=100)
print(json.dumps(dict(validation_output.pandera.errors), indent=2)) This turns out to be very slow, unfortunately. Two questions now:
Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
So the A few questions:
Currently the way you're using it is the current recommended way to pull out the errors. |
Beta Was this translation helpful? Give feedback.
-
This is not an easy problem to solve because of how spark has to handle schema, since spark follows schema on read this makes this issue complicated , me and @NeerajMalhotra-QB did think about this issue in quite a bit of detail. Consider a non-nullable column, if the values is null the spark schema would mark it as nullable but there won’t be any error raised since its schema on read is not enforced. Hence from Pandera perspective it validates all the data, now if I read only few partition to validate I may skip such data which actually has the null. But @cosmicBboy I think now it could be a right time to debate on sample based data validation for pyspark. Also how we communicate it to the user as well the implications of this approach on big data. I have also seen some interest in this feature in my conversation with few users. But I think some user research might be necessary to find the common pitfalls. |
Beta Was this translation helpful? Give feedback.
-
@DanielLenz, The way validation is designed, is to report the first error it encounters if any. Having a subset only validation such as We had to keep the interface same as If I understood your use case correctly, you are seeking to have streaming pipeline with schema validation in real time. My recommendations for your use case would be:
I hope it helps. |
Beta Was this translation helpful? Give feedback.
So the
head
kwarg isn't actually used in the validate method (it's there for API compatibility)... this should really raise an error or warning.A few questions:
head=100
you don't actually want to validate the entire dataset. Is that correct?Currently the way you're using it is the current recommended way to pull out the errors.
@NeerajMalhotra-QB @jaskaransinghsidana