Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Multilingual LibriSpeech dataset #92

Merged
merged 2 commits into from
Dec 31, 2020

Conversation

monatis
Copy link
Contributor

@monatis monatis commented Dec 31, 2020

I added a script to download and prepare transcripts for a given language in MLS dataset.
Example usage:

python ./scripts/create_mls_dataset.py --help
usage: create_mls_trans.py [-h] [--dataset-home DATASET_HOME] --language
                           {dutch,english,german,french,italian,portuguese,polish,spanish}
                           [--opus]

Download and prepare MLS dataset in a given language

optional arguments:
  -h, --help            show this help message and exit
  --dataset-home DATASET_HOME, -d DATASET_HOME
                        Path to home directory to download and prepare
                        dataset. Default to ~/.keras
  --language {dutch,english,german,french,italian,portuguese,polish,spanish}, -l {dutch,english,german,french,italian,portuguese,polish,spanish}
                        Any name of language included in MLS
  --opus                Whether to use dataset in opus format or not

@nglehuy
Copy link
Collaborator

nglehuy commented Dec 31, 2020

Thanks, @monatis
The blank in CharacterFeaturizer is one of 0 or num_classes - 1 and it isn't retrieved from the file so you don't need to add extra \n when creating the character file.

@monatis
Copy link
Contributor Author

monatis commented Dec 31, 2020

Thanks for the info, fixed it. I was just confused by other implementations.

@nglehuy nglehuy merged commit b131a7c into TensorSpeech:main Dec 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants