Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyarrow.lib.ArrowInvalid: Value 2147486084 too large to fit in C integer type` #1

Open
Jiam1ng opened this issue May 8, 2024 · 0 comments

Comments

@Jiam1ng
Copy link

Jiam1ng commented May 8, 2024

Hi there,

I am trying to create a dataset of my own data for Geneformer. Then performing the tokenization, the error message occurs:

Creating dataset.
Traceback (most recent call last):
File "", line 1, in
File "/home/jiaming/miniconda3/envs/geneformer/lib/python3.10/site-packages/geneformer/tokenizer.py", line 137, in tokenize_data
tokenized_dataset = self.create_dataset(
File "/home/jiaming/miniconda3/envs/geneformer/lib/python3.10/site-packages/geneformer/tokenizer.py", line 330, in create_dataset
output_dataset = Dataset.from_dict(dataset_dict)
File "/home/jiaming/miniconda3/envs/geneformer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 910, in from_dict
pa_table = InMemoryTable.from_pydict(mapping=mapping)
File "/home/jiaming/miniconda3/envs/geneformer/lib/python3.10/site-packages/datasets/table.py", line 799, in from_pydict
return cls(pa.Table.from_pydict(*args, **kwargs))
File "pyarrow/table.pxi", line 3725, in pyarrow.lib.Table.from_pydict
File "pyarrow/table.pxi", line 5254, in pyarrow.lib._from_pydict
File "pyarrow/array.pxi", line 350, in pyarrow.lib.asarray
File "pyarrow/array.pxi", line 236, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
File "/home/jiaming/miniconda3/envs/geneformer/lib/python3.10/site-packages/datasets/arrow_writer.py", line 187, in arrow_array
out = list_of_np_array_to_pyarrow_listarray(data)
File "/home/jiaming/miniconda3/envs/geneformer/lib/python3.10/site-packages/datasets/features/features.py", line 1428, in list_of_np_array_to_pyarrow_listarray
return list_of_pa_arrays_to_pyarrow_listarray(
File "/home/jiaming/miniconda3/envs/geneformer/lib/python3.10/site-packages/datasets/features/features.py", line 1420, in list_of_pa_arrays_to_pyarrow_listarray
offsets = pa.array(offsets, type=pa.int32())
File "pyarrow/array.pxi", line 316, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Value 2147486084 too large to fit in C integer type

I believe the original Geneformer dataset is larger than mine, do you have any suggestions to solve this issue? Thanks for your kind help in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant