This repository allows the creation of an ebook with text passages alternating in two languages for the purpose of language learning. Matching the two texts is done using vecalign which itself relies on Facebook's language agnostic sentence representations LASER.
It is recommended to clone the repository, create a virtual environment and then install all requirements.
git clone https://github.com/pschonev/biBooks.git
cd biBooks
virtualenv -p python3 bibook_env
source bibook_env/bin/activate
And then to install everything run
pip install -e .[full]
Next run python src/download_models.py
to get the necessary LASER models.
Finally install Calibre as this will be used to convert our generated HTML files to the desired eBook format.
The two books have to be provided in two files as a list of sentences seperated with a newline (see the books folder for examples). There are several tools available to split a text into sentences (sentence tokenization) and generate this output format (I used pySBD).
Using these two files, you can simply run python bilingual_books.py
and provide arguments via a config file with -c [config_path]
(see configs for examples) or via the command line. The result will be a finished eBook file.
The steps to go through are described below. Note that step 1-3 have examples in the notebooks folder and 4-9 are done automatically when running bilingual_books.py
.
-
Get text data in two files e.g. web scraping (see examples in notebooks) or converting eBook to txt using Calibre
ebook-convert [ebook_file] [output.txt]
-
Clean the text data
-
Run sentence tokenization, e.g. using pySBD
-
Possible overlaps of n (e.g. 10) sentences are created with vecalign/overlap.py
-
These overlapped sentences are then embedded using LASER, making them comparable independent of their language
-
Then all 6 files (original sentences, overlaps and embeddings) are fed to the main vecalign algorithm to determine matching text passages
-
The resulting alignment file indicates which lines of the original text with sentence tokenization match which each other. This can now be used to create a combined tab seperated (.tsv) file of matching text passages
-
This .tsv file is then converted into HTML format and can be accompanied by a .css file for styling
-
With Calibre installed run
ebook-convert [HTML-file] [ebook-file]
to get an eBook file with the format of your choice (epub, mobi, etc.) -
Finally you may want to open the eBook in Calibre and fix some issues or add additional things such as a cover pic
Optional 6a: For Russian, use UDAR (https://github.com/reynoldsnlp/udar) to create a file with stresses added from the source file with sentence tokenization and use it instead of the unstressed source file
During my research how to handle this problem I stumbled across this forum post from 2016 explaining how to do the same thing except with hunalign instead of vecalign. However when trying hunalign the results were very poor and the dictionary creation seemed tedious. However this forum post was still helpful for my overall procedural structure and it additionally linked to this helpful blog post where I found the HTML conversion script. So credit to the user slex and doviende who also let me use his script.
- better handling of paragraphs (e.g. keep paragraphs together up to a certain length or ensure a newline after a paragraph is over in the final document)
- dynamic layout inspired by Doppeltext
- automatically process files in eBook format (converting and cleaning newlines)
- Create a MOBI dictionary from wiktionary https://github.com/nyg/wiktionary-to-kindle