Suggests subword sizes for fastText language models using character n-gram frequency analysis.
To suggest subword sizes for one or more languages, use the
suggest_subword_sizes.sh
tool:
$ git clone https://github.com/MIR-MU/fasttext-optimizer.git --recurse-submodules
$ cd fasttext-optimizer
$ TMPDIR=/var/tmp ./suggest_subword_sizes.sh en de cs it
Suggested subword sizes for en: -minn 1 -maxn 5 (4.52% n-gram coverage)
Suggested subword sizes for de: -minn 6 -maxn 6 (4.19% n-gram coverage)
Suggested subword sizes for cs: -minn 1 -maxn 4 (3.28% n-gram coverage)
Suggested subword sizes for it: -minn 6 -maxn 6 (5.11% n-gram coverage)
To see how you can suggest subword sizes in Python, see also our Python tutorial.
To train one or more fastText models with the suggested subword sizes,
use the train_fasttext_models.sh
tool:
$ git clone https://github.com/MIR-MU/fasttext-optimizer.git --recurse-submodules
$ cd fasttext-optimizer
$ TMPDIR=/var/tmp ./train_fasttext_models.sh cs de es fr
$ ls data/wikimedia/wiki.{cs,de,es,fr}.{default,suggested}.{bin,vec}
data/wikimedia/wiki.cs.default.bin data/wikimedia/wiki.cs.default.vec
data/wikimedia/wiki.cs.suggested.bin data/wikimedia/wiki.cs.suggested.vec
data/wikimedia/wiki.de.default.bin data/wikimedia/wiki.de.default.vec
data/wikimedia/wiki.de.suggested.bin data/wikimedia/wiki.de.suggested.vec
data/wikimedia/wiki.es.default.bin data/wikimedia/wiki.es.default.vec
data/wikimedia/wiki.es.suggested.bin data/wikimedia/wiki.es.suggested.vec
data/wikimedia/wiki.fr.default.bin data/wikimedia/wiki.fr.default.vec
data/wikimedia/wiki.fr.suggested.bin data/wikimedia/wiki.fr.suggested.vec
To use suggested subword sizes as a measure of distance between languages and to see how how this measure correlates with other language distance measures, see our Python tutorial.