Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion about the amount of monolingual data used in the experiments #160

Open
cbaziotis opened this issue Jul 20, 2020 · 1 comment
Open

Comments

@cbaziotis
Copy link

Hi,

I would like to ask what is the amount of monolingual data used in each experiment.

  1. In the paper, as well as in this issue, you mention that you use...

... all of the monolingual data from WMT News Crawl datasets, which covers 190M, 62M and 270M sentences from the year 2007 to 2017 for English, French, German respectively.

  1. On these issues: hyperparamters to reproduce paper result #32 (comment) Has anybody pre-trained successfully on en-de translation with MASS ? #62 (comment), you mention that you use a subsample (50 million sentences) of the full data.
  2. In get-data-nmt.sh I see that you have commented out the download links to the News Crawl data from many years for each language.

I may have missed something or misread the issues, but I am confused about how much data you actually used. I would appreciate it if you helped clear my confusion.

Thanks!

@cbaziotis cbaziotis changed the title Confusion about the amount of data use in the experiments Confusion about the amount of monolingual data used in the experiments Jul 20, 2020
@nxphi47
Copy link

nxphi47 commented Sep 18, 2020

Same issue. Unable to tell which data is being used to reproduced the experiments. Can you please exactly specify how did you created the data for pre-training and fine-tuning, for en-fr, de-en and en-ro?
Thank you a lot.
@StillKeepTry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants