You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've tried a few ways to use tokenizers with DataSet and DataBundle objects but am not successful.
Basically, just trying to do this:
# Initialize DataSet object `ds` with data.# Initialize DataBundle object with DataSet object `ds`.# Define tokenizer.# Associate tokenizer with field in DataSet or DataBundle object.# Hope to see tokenizer work when batches of data are extracted from DataSet object.
fromfastNLPimportDataSetfromfastNLPimportVocabularyfromfastNLP.ioimportDataBundlefromfunctoolsimportpartialfromtransformersimportGPT2Tokenizerdata= {'idx': [0, 1, 2],
'sentence':["This is an apple .", "I like apples .", "Apples are good for our health ."],
'words': [['This', 'is', 'an', 'apple', '.'],
['I', 'like', 'apples', '.'],
['Apples', 'are', 'good', 'for', 'our', 'health', '.']],
'num': [5, 4, 7]}
dataset=DataSet(data) # Initialize DataSet object with data.data_bundle=DataBundle(datasets={'train': dataset}) # Initialize DataBundle object# Define tokenizer:tokenizer_in=GPT2Tokenizer.from_pretrained('gpt2')
tokenizer_in.pad_token, tokenizer_in.padding_side=tokenizer_in.eos_token, 'left'tokenizer_in_fn=partial(tokenizer_in.encode_plus, padding=True, return_attention_mask=True)
print(tokenizer_in_fn) # ensure that settings are as expected.# Associate tokenizer with field:data_bundle.apply_field_more(tokenizer_in_fn, field_name='sentence', progress_bar='tqdm')
print(ds[0:3])
# Gives:# +-----+----------------+----------------+-----+----------------+--------------------+--------+# | idx | sentence | words | num | input_ids | attention_mask | length |# +-----+----------------+----------------+-----+----------------+--------------------+--------+# | 0 | This is an ... | ['This', 'i... | 5 | [1212, 318,... | [1, 1, 1, 1, 1]... | 5 |# | 1 | I like appl... | ['I', 'like... | 4 | [40, 588, 2... | [1, 1, 1, 1] | 4 |# | 2 | Apples are ... | ['Apples', ... | 7 | [4677, 829,... | [1, 1, 1, 1, 1,... | 8 |# +-----+----------------+----------------+-----+----------------+--------------------+--------+# Try to obtain batch data:ds=data_bundle.get_dataset('train')
print(ds['sentence'].get([0,1,2])) # okay, no problem.print(ds['input_ids'].get([0,1,2])) # throws exception.
Thanks for your report! The example code works well at numpy version 1.21.6, you can temporarily avoid this problem by using numpy 1.21.6.
For more details, we will receive a warning at numpy 1.21.6:
/remote-home/shxing/anaconda3/envs/fastnlp/lib/python3.7/site-packages/fastNLP/core/dataset/field.py:77: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
return np.array(contents)
Therefore another solution can be changing source code at fastNLP/core/dataset/field.py:77 from
We feel sorry that we didn't take different packages' version into consideration and apologize to you for the inconvenience. We are going to discuss this problem later to provide a better solution in our future version.
I've tried a few ways to use tokenizers with DataSet and DataBundle objects but am not successful.
Basically, just trying to do this:
I've tried associating the tokenizer to the DataSet object but the same exception is encountered:
Python version 3.10, numpy 1.24.1 (are there other python packages whose versions I need to be careful about?)
The text was updated successfully, but these errors were encountered: