Confusion about the amount of monolingual data used in the experiments #160

cbaziotis · 2020-07-20T19:36:02Z

Hi,

I would like to ask what is the amount of monolingual data used in each experiment.

In the paper, as well as in this issue, you mention that you use...

... all of the monolingual data from WMT News Crawl datasets, which covers 190M, 62M and 270M sentences from the year 2007 to 2017 for English, French, German respectively.

On these issues: hyperparamters to reproduce paper result #32 (comment) Has anybody pre-trained successfully on en-de translation with MASS ? #62 (comment), you mention that you use a subsample (50 million sentences) of the full data.
In get-data-nmt.sh I see that you have commented out the download links to the News Crawl data from many years for each language.

I may have missed something or misread the issues, but I am confused about how much data you actually used. I would appreciate it if you helped clear my confusion.

Thanks!

nxphi47 · 2020-09-18T13:07:43Z

Same issue. Unable to tell which data is being used to reproduced the experiments. Can you please exactly specify how did you created the data for pre-training and fine-tuning, for en-fr, de-en and en-ro?
Thank you a lot.
@StillKeepTry

cbaziotis changed the title ~~Confusion about the amount of data use in the experiments~~ Confusion about the amount of monolingual data used in the experiments Jul 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion about the amount of monolingual data used in the experiments #160

Confusion about the amount of monolingual data used in the experiments #160

cbaziotis commented Jul 20, 2020

nxphi47 commented Sep 18, 2020

Confusion about the amount of monolingual data used in the experiments #160

Confusion about the amount of monolingual data used in the experiments #160

Comments

cbaziotis commented Jul 20, 2020

nxphi47 commented Sep 18, 2020