-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The word order of the data contents is missing #1
Comments
Hi @CarlKilhart , Thank you for your interest in our work! The current For example, suppose vocabulary is Do I answer your question clearly? Or do you mean you need the original raw content of documents (including stop words, punctuations, etc)? |
Maybe you uploaded a wrong version of |
Hi @CarlKilhart , Thank you for the reminder! The current datasets indeed don't have word order. But my model also doesn't use word order for training. Thus the current datasets are still valid and correct for reproducing the results in the paper. I just processed the datasets again the obtain the word order. You can download all 5 datasets with word order, including Web dataset, using the below Google Drive link: https://drive.google.com/file/d/10sGsStbutM-e1XfM8uDwP354YcXdpmgj/view?usp=sharing Please note that for Aminer dataset, I forgot how I preprocessed it last year. I recently rewrite the preprocessing code to produce Aminer dataset, but the current dataset may have some deviations from the one uploaded on github repo. |
Hi @CarlKilhart , Did I clearly answer your question? If no more questions, could I close this issue? |
The description in README indicates that the word order of the contents is included in the data, but it turns out to be that the words are ordered by their ID in the vocab, not the real word order from the original texts. Could you offer a version with the correct word order?
The text was updated successfully, but these errors were encountered: