Skip to content

Commit

Permalink
Fix for inability to read some parquet files (issue #816) (#817)
Browse files Browse the repository at this point in the history
* add polars to try and read some troublesome parquet files to arrow tables

Signed-off-by: David Wood <[email protected]>

* fix bug in convert_binary_to_arrow() by returnning table from polars

Signed-off-by: David Wood <[email protected]>

* update convert_binary_to_arrow() by catching exceptoins from polars

Signed-off-by: David Wood <[email protected]>

* change filter's duckdb setting to allow large buffers on arrow tables

Signed-off-by: David Wood <[email protected]>

* turn off changes to filter for now

Signed-off-by: David Wood <[email protected]>

* add polars to core library

Signed-off-by: David Wood <[email protected]>

* add comment to say way we're adding polars for reading some parquet files

Signed-off-by: David Wood <[email protected]>

* pin core lib polars>=1.16.0

Signed-off-by: David Wood <[email protected]>

* change failure on polars read from warning to error

Signed-off-by: David Wood <[email protected]>

* remove comments on duckdb settings for multimodal in FilterTransform.init().

Signed-off-by: David Wood <[email protected]>

* downgrade polars to >=1.9.0

Signed-off-by: David Wood <[email protected]>

---------

Signed-off-by: David Wood <[email protected]>
  • Loading branch information
daw3rd authored Dec 2, 2024
1 parent ec89271 commit 4171dfa
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 2 deletions.
1 change: 1 addition & 0 deletions data-processing-lib/python/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@
argparse
mmh3
psutil
polars>=1.9.0
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
################################################################################

import hashlib
import io
import os
import string
import sys
Expand Down Expand Up @@ -144,8 +145,21 @@ def convert_binary_to_arrow(data: bytes, schema: pa.schema = None) -> pa.Table:
table = pq.read_table(reader, schema=schema)
return table
except Exception as e:
logger.error(f"Failed to convert byte array to arrow table, exception {e}. Skipping it")
return None
logger.warning(f"Could not convert bytes to pyarrow: {e}")

# We have seen this exception before when using pyarrow, but polars does not throw it.
# "Nested data conversions not implemented for chunked array outputs"
# See issue 816 https://github.com/IBM/data-prep-kit/issues/816.
logger.info(f"Attempting read of pyarrow Table using polars")
try:
import polars

df = polars.read_parquet(io.BytesIO(data))
table = df.to_arrow()
except Exception as e:
logger.error(f"Could not convert bytes to pyarrow using polars: {e}. Skipping.")
table = None
return table

@staticmethod
def convert_arrow_to_binary(table: pa.Table) -> bytes:
Expand Down

0 comments on commit 4171dfa

Please sign in to comment.