You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to create a dataset of my own data for Geneformer. Then performing the tokenization, the error message occurs:
Creating dataset.
Traceback (most recent call last):
File "", line 1, in
File "/home/jiaming/miniconda3/envs/geneformer/lib/python3.10/site-packages/geneformer/tokenizer.py", line 137, in tokenize_data
tokenized_dataset = self.create_dataset(
File "/home/jiaming/miniconda3/envs/geneformer/lib/python3.10/site-packages/geneformer/tokenizer.py", line 330, in create_dataset
output_dataset = Dataset.from_dict(dataset_dict)
File "/home/jiaming/miniconda3/envs/geneformer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 910, in from_dict
pa_table = InMemoryTable.from_pydict(mapping=mapping)
File "/home/jiaming/miniconda3/envs/geneformer/lib/python3.10/site-packages/datasets/table.py", line 799, in from_pydict
return cls(pa.Table.from_pydict(*args, **kwargs))
File "pyarrow/table.pxi", line 3725, in pyarrow.lib.Table.from_pydict
File "pyarrow/table.pxi", line 5254, in pyarrow.lib._from_pydict
File "pyarrow/array.pxi", line 350, in pyarrow.lib.asarray
File "pyarrow/array.pxi", line 236, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
File "/home/jiaming/miniconda3/envs/geneformer/lib/python3.10/site-packages/datasets/arrow_writer.py", line 187, in arrow_array
out = list_of_np_array_to_pyarrow_listarray(data)
File "/home/jiaming/miniconda3/envs/geneformer/lib/python3.10/site-packages/datasets/features/features.py", line 1428, in list_of_np_array_to_pyarrow_listarray
return list_of_pa_arrays_to_pyarrow_listarray(
File "/home/jiaming/miniconda3/envs/geneformer/lib/python3.10/site-packages/datasets/features/features.py", line 1420, in list_of_pa_arrays_to_pyarrow_listarray
offsets = pa.array(offsets, type=pa.int32())
File "pyarrow/array.pxi", line 316, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Value 2147486084 too large to fit in C integer type
I believe the original Geneformer dataset is larger than mine, do you have any suggestions to solve this issue? Thanks for your kind help in advance.
The text was updated successfully, but these errors were encountered:
Hi there,
I am trying to create a dataset of my own data for Geneformer. Then performing the tokenization, the error message occurs:
I believe the original Geneformer dataset is larger than mine, do you have any suggestions to solve this issue? Thanks for your kind help in advance.
The text was updated successfully, but these errors were encountered: