1. Initial raw files:

A. initial Intercorp files (not in the project structure)

intercorp_en2cs
intercorp_en
intercorp_cs

A1. Intercorp files: broken down into smaller files, so that one file corresponds to just one book/named collection.

a. The output files from the books (151), which were used in the corpus extraction are in:

correspondences_intercorp_en2cs intercorp_cs intercorp_en

b. big Corpus:

Some of the named collections _ACQUIS, _EUROPARL, _PRESSEUROP, _SUBTITLES, _SYNDICATE were too big to be processed on my computer (8GB and 12GB working memory were not enough) and were processed on a later stage. It turned out they contained just 555 unique verb occurrences and of these only 5 that did not occur in the books so they weren't taken into account further.

The classes used for breaking down the initial Intercorp files are:

RepairXml: splits intercorp_en2cs
SplitCorpusXmlData: splits intercorp_en and intercorp_cs

The files from the "big Corpus" were manually extracted; formatting mistakes in them were also manually corrected

B. Vallex files

vallex-2.8.3-work.xml
get_aspects.xslt -> resulting file with all verbs (5098) and all 4 aspect labels (pf, impf, biasp, iter)

2. Getting a dictionary: "cs verb, aspect value -> [en translations]"

VallexGlosbeDictionary:

Class containing methods that create a VallexGlosbeDictionary by looking up czech Vallex Verbs in Glosbe and finding List of english translations Once created, an instance of the VallexGlosbeDictionary can be used for further processing.

* input: "vallex\\vallex_aspectOutput.txt"
* output: "vallex/dictionary.csv" -> czech verbs with aspect value and englisch translations

The dictionary has 4221 entries (after removing 2 aspectual values and homographs from vallex); of these only 2915 have a matching translation

3. Container Classes for the corpus Processing

Verb
Sentence

4. Processing the parallel texts

CorpusParser - parses (pre-split) .xml files from intercorp_en, intercorp_cs
CorrespondenceParser - parses .xml files from correspondences_intercorp_en2cs
SentenceProcessor(filename_en2cs, filename_en, filename_cs)

Initialises a CorrespondenceParser(filename_en2cs) and CorpusParser(filename_en), CorpusParser(filename_cs);

gets/initializes corresponding sentencePairs

SentencePair(sentence_en, sentence_cs, outputFile)

Checks verb correspondences (if the GlosbeVallexDictionary translation of a verb from the czech verblist corresponds to a verb from the english Verblist)

Mainapp:

instantiates a VallexGlosbeDictionary
- input: vallex/dictionary.csv
initializes instance of SentenceProcessor

inside it instantiates a corpus parser and a correspondence parser processes corpora and writes corresponding sentence pairs: sentenceProcessor.getSentencePairs(output_filename) and writes files in output_sentences

* input: correspondences_intercorp_en2cs, intercorp_en, intercorp_cs

* output: output_sentences -> collection of sentence pairs, a file per book, a sentence pair per verb

of the form:

token_en, inf_en, token_cs, inf_cs, aspect, English sentence, Czech sentence

"come","come","prišel","prijít","pf"," Zarquon has come again! ”"," Zarquon znovu prišel! """

5. Ordering the processed output (from 5.) and extracting sample sentences for annotation

OutputVerbDataDictionary:

Processes the <verb: verbdata, sentence occurance> data from all output files (folder output_sentences) from the processed corpus and builds a dictionary for each verb: <verb : [data, all sentence occurances]>

WriteProcessedOutput

Initializes OutputVerbDataDictionary; uses instance of OutputVerbDataDictionary to select for each verb occuring in the corpus data the first two(in this case) sentences of occurrence

* input: folder output_sentences 
* output: in folder processed_output: selected_1.csv and selected_2.csv

selected_1.csv (2774 verbs/sentences), selected_2.csv (2374) -> first sentence (first sentence value for each verb key in the OutputVerbDataDictionary) - used for the annotation, second sentence(-"-) - additional sentences, not used further

full_output.txt - all entries of OutputVerbDataDictionary

verb_keys.txt - unique verb occurrences

verb_keys_occurrenceNumber.csv -> verb - #occurrences

verbKeyAspect.csv -> English verb with the aspect value of the corresponding Czech Vallex verb

PrepareFinalFiles:

Reads selected_n.csv file (in processed_output) -> write .txt and .json; writes only 200 sentences per .txt (better processing for the annotation tool)

* input: in folder processed input
* output: in folder final_files

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Documentation		Documentation
final_files		final_files
output_sentences		output_sentences
processed_output		processed_output
processed_output_big		processed_output_big
src		src
vallex		vallex
.gitignore		.gitignore
BA_ParallelCorpusExtraction.iml		BA_ParallelCorpusExtraction.iml
README.md		README.md
infos_damyana.txt		infos_damyana.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Initial raw files:

A. initial Intercorp files (not in the project structure)

A1. Intercorp files: broken down into smaller files, so that one file corresponds to just one book/named collection.

B. Vallex files

2. Getting a dictionary: "cs verb, aspect value -> [en translations]"

3. Container Classes for the corpus Processing

4. Processing the parallel texts

5. Ordering the processed output (from 5.) and extracting sample sentences for annotation

About

Releases

Packages

Languages

damyana79/BA_ParallelCorpusExtraction_master

Folders and files

Latest commit

History

Repository files navigation

1. Initial raw files:

A. initial Intercorp files (not in the project structure)

A1. Intercorp files: broken down into smaller files, so that one file corresponds to just one book/named collection.

B. Vallex files

2. Getting a dictionary: "cs verb, aspect value -> [en translations]"

3. Container Classes for the corpus Processing

4. Processing the parallel texts

5. Ordering the processed output (from 5.) and extracting sample sentences for annotation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages