The word order of the data contents is missing #1

CarlKilhart · 2023-10-21T08:48:12Z

The description in README indicates that the word order of the contents is included in the data, but it turns out to be that the words are ordered by their ID in the vocab, not the real word order from the original texts. Could you offer a version with the correct word order?

cezhang01 · 2023-10-22T18:07:51Z

Hi @CarlKilhart ,

Thank you for your interest in our work!

The current contents.txt contains sequences of words after preprocessing - we removed stop words, punctuations, and other meaningless words. The current vocabulary contains remaining words. The current contents.txt contains the sequence of these remaining words. These remaining words are still ordered in the correct sequence of their original raw content.

For example, suppose vocabulary is [welcome, new, york, best, city, ...] and if the original raw content is welcome to the best city new york!, after preprocessing we have [0, 3, 4, 1, 2] for this document. Here to and the are removed because they are stop words. But the remaining 5 words (welcome, best, city, new, york) are still in the correct order with their original raw content.

Do I answer your question clearly? Or do you mean you need the original raw content of documents (including stop words, punctuations, etc)?

CarlKilhart · 2023-10-23T01:47:36Z

Maybe you uploaded a wrong version of contents.txt? Taking the first row of ml dataset as an example, clearly 3 16 17 28 34 36 39 45 46 85 111 150 150 151 192 192 192 192 200 201 217 218 269 306 328 351 377 476 477 488 507 623 723 762 898 947 1270 1347 1494 1587 1697 is ordered by the word ID, not the correct sequence of their original raw content. I would appreciate it if you could check the data files.

cezhang01 · 2023-10-26T00:05:24Z

Hi @CarlKilhart ,

Thank you for the reminder!

The current datasets indeed don't have word order. But my model also doesn't use word order for training. Thus the current datasets are still valid and correct for reproducing the results in the paper.

I just processed the datasets again the obtain the word order. You can download all 5 datasets with word order, including Web dataset, using the below Google Drive link: https://drive.google.com/file/d/10sGsStbutM-e1XfM8uDwP354YcXdpmgj/view?usp=sharing

Please note that for Aminer dataset, I forgot how I preprocessed it last year. I recently rewrite the preprocessing code to produce Aminer dataset, but the current dataset may have some deviations from the one uploaded on github repo.

cezhang01 · 2023-10-28T21:36:56Z

Hi @CarlKilhart ,

Did I clearly answer your question? If no more questions, could I close this issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The word order of the data contents is missing #1

The word order of the data contents is missing #1

CarlKilhart commented Oct 21, 2023

cezhang01 commented Oct 22, 2023 •

edited

Loading

CarlKilhart commented Oct 23, 2023

cezhang01 commented Oct 26, 2023

cezhang01 commented Oct 28, 2023

The word order of the data contents is missing #1

The word order of the data contents is missing #1

Comments

CarlKilhart commented Oct 21, 2023

cezhang01 commented Oct 22, 2023 • edited Loading

CarlKilhart commented Oct 23, 2023

cezhang01 commented Oct 26, 2023

cezhang01 commented Oct 28, 2023

cezhang01 commented Oct 22, 2023 •

edited

Loading