Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] parquet files with columns containing large list of byte arrays can not be read by pyarrow. #816

Closed
2 tasks done
daw3rd opened this issue Nov 20, 2024 · 1 comment
Closed
2 tasks done
Labels
bug Something isn't working

Comments

@daw3rd
Copy link
Member

daw3rd commented Nov 20, 2024

Search before asking

  • I searched the issues and found no similar issues.

Component

Library/core

What happened + What you expected to happen

I have a parquet file that has a column containing a list of images, as byte arrays. Under some circumstances, such files are not readable by pyarrow.

Reproduction script

grab https://ibm.ent.box.com/file/1684883605503?s=9qcne0iubeji6t6a77gh2sxgi29k1tjp as test.parquet

cd transforms/universal/noop/python
mkdir input
cp .../test.parquet input
make venv
source venv/bin/activate
python src/noop_transform_python.py --data_local_config  "{'input_folder': 'input', 'output_folder':'output'}"
09:37:35 INFO - Launching noop transform
09:37:35 INFO - noop parameters are : {'sleep_sec': 1, 'pwd': 'nothing'}
09:37:35 INFO - pipeline id pipeline_id
09:37:35 INFO - code location None
09:37:35 INFO - data factory data_ is using local data access: input_folder - input output_folder - output
09:37:35 INFO - data factory data_ max_files -1, n_sample -1
09:37:35 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
09:37:35 INFO - orchestrator noop started at 2024-11-20 09:37:35
09:37:35 INFO - Number of files is 1, source profile {'max_file_size': 341.400185585022, 'min_file_size': 341.400185585022, 'total_file_size': 341.400185585022}
09:37:39 ERROR - Failed to convert byte array to arrow table, exception Nested data conversions not implemented for chunked array outputs. Skipping it
09:37:39 WARNING - Transformation of file to table failed
09:37:39 INFO - Completed 1 files (100.0%) in 0.07 min
09:37:39 INFO - Done processing 1 files, waiting for flush() completion.
09:37:39 INFO - done flushing in 0.0 sec
09:37:39 INFO - Completed execution in 0.07 min, execution result 0
Segmentation fault: 11

Anything else

(venv) dawood@MacBookPro:~/git/data-prep-kit/transforms/universal/noop/python$ parquet-tools inspect input/test.parquet 

############ file meta data ############
created_by: parquet-cpp-arrow version 16.1.0
num_columns: 16
num_rows: 1000
num_row_groups: 1
format_version: 2.6
serialized_size: 4643


############ Columns ############
id
contents
source
element
from
value
orig_json_cos_path
element
index
element
element
nfaces
nsfw
nsfw_score
prompted_child_score
prompted_nsfw_score

############ Column(id) ############
name: id
path: id
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 82%)

############ Column(contents) ############
name: contents
path: contents
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 51%)

############ Column(source) ############
name: source
path: source
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: -7%)

############ Column(element) ############
name: element
path: orig_image_fpaths.list.element
max_definition_level: 3
max_repetition_level: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 86%)

############ Column(from) ############
name: from
path: conversations.list.element.from
max_definition_level: 4
max_repetition_level: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 81%)

############ Column(value) ############
name: value
path: conversations.list.element.value
max_definition_level: 4
max_repetition_level: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 50%)

############ Column(orig_json_cos_path) ############
name: orig_json_cos_path
path: orig_json_cos_path
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 1%)

############ Column(element) ############
name: element
path: image_bins.list.element
max_definition_level: 3
max_repetition_level: 1
physical_type: BYTE_ARRAY
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 0%)

############ Column(index) ############
name: index
path: index
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 43%)

############ Column(element) ############
name: element
path: fixed_image_fpaths.list.element
max_definition_level: 3
max_repetition_level: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 81%)

############ Column(element) ############
name: element
path: blurred_images.list.element
max_definition_level: 3
max_repetition_level: 1
physical_type: BYTE_ARRAY
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 0%)

############ Column(nfaces) ############
name: nfaces
path: nfaces
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -4%)

############ Column(nsfw) ############
name: nsfw
path: nsfw
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -4%)

############ Column(nsfw_score) ############
name: nsfw_score
path: nsfw_score
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 14%)

############ Column(prompted_child_score) ############
name: prompted_child_score
path: prompted_child_score
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 11%)

############ Column(prompted_nsfw_score) ############
name: prompted_nsfw_score
path: prompted_nsfw_score
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 12%)

OS

Ubuntu, MacOS (limited support)

Python

3.11.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@daw3rd daw3rd added the bug Something isn't working label Nov 20, 2024
daw3rd added a commit that referenced this issue Dec 2, 2024
* add polars to try and read some troublesome parquet files to arrow tables

Signed-off-by: David Wood <[email protected]>

* fix bug in convert_binary_to_arrow() by returnning table from polars

Signed-off-by: David Wood <[email protected]>

* update convert_binary_to_arrow() by catching exceptoins from polars

Signed-off-by: David Wood <[email protected]>

* change filter's duckdb setting to allow large buffers on arrow tables

Signed-off-by: David Wood <[email protected]>

* turn off changes to filter for now

Signed-off-by: David Wood <[email protected]>

* add polars to core library

Signed-off-by: David Wood <[email protected]>

* add comment to say way we're adding polars for reading some parquet files

Signed-off-by: David Wood <[email protected]>

* pin core lib polars>=1.16.0

Signed-off-by: David Wood <[email protected]>

* change failure on polars read from warning to error

Signed-off-by: David Wood <[email protected]>

* remove comments on duckdb settings for multimodal in FilterTransform.init().

Signed-off-by: David Wood <[email protected]>

* downgrade polars to >=1.9.0

Signed-off-by: David Wood <[email protected]>

---------

Signed-off-by: David Wood <[email protected]>
@daw3rd
Copy link
Member Author

daw3rd commented Dec 12, 2024

Fixed in pr #817

@daw3rd daw3rd closed this as completed Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant