Skip to content

The dataset contains editions from the South African government magazine Vuk'uzenzele. Data was scraped from PDFs that have been placed in the data/raw folder. The PDFS were obtained from the Vuk'uzenzele website.

License

Notifications You must be signed in to change notification settings

dsfsi/vukuzenzele-nlp

 
 

Repository files navigation

The Vuk'uzenzele South African Multilingual Corpus

Github: https://github.com/dsfsi/vukuzenzele-nlp/

Zenodo: DOI

Arxiv Preprint: arXiv

Give Feedback 📑: DSFSI Resource Feedback Form{:target="_blank"}

About dataset

The dataset contains editions from the South African government magazine Vuk'uzenzele, created by the Government Communication and Information System (GCIS). Data was scraped from PDFs that have been placed in the data/raw folder. The PDFS were obtatined from the Vuk'uzenzele website.

The datasets contain government magazine editions in 11 languages, namely:

Language Code Language Code
English (eng) Sepedi (sep)
Afrikaans (afr) Setswana (tsn)
isiNdebele (nbl) Siswati (ssw)
isiXhosa (xho) Tshivenda (ven)
isiZulu (zul) Xitstonga (tso)
Sesotho (nso)

Number of Aligned Pairs with Cosine Similarity Score >= 0.65

src_lang trg_lang num_aligned_pairs
ssw xho 2202
ssw zul 2183
xho zul 2102
nso xho 2081
nso tso 2071
ssw tso 2034
nso ssw 2021
tsn tso 2020
tsn xho 2009
tso xho 2009
nso tsn 2002
ssw tsn 1987
tso zul 1957
nso zul 1953
tsn zul 1933
eng zul 1923
eng tso 1923
eng nso 1867
eng ssw 1821
afr xho 1816
eng xho 1801
nbl sep 1795
sep ven 1794
afr ssw 1783
eng tsn 1772
afr zul 1769
afr nso 1746
nbl ven 1699
afr eng 1661
afr tsn 1631
afr tso 1617
afr sep 551
afr ven 498
afr nbl 491
nso sep 410
nso ven 352
sep tso 326
sep tsn 319
tso ven 307
sep ssw 305
sep xho 300
ssw ven 290
tsn ven 285
nbl ssw 282
nbl nso 266
ven xho 260
eng sep 258
nbl xho 250
sep zul 249
nbl tso 238
eng ven 234
nbl tsn 230
nbl zul 226
ven zul 225
eng nbl 184

The dataset is present in several forms on the repo. Generally the dataset is split by edition, eg. 2020-01-ed1
The data directory is broken down as follows

./data
├── external                # Data external to this repo
├── interim                 # I am not really sure - looks like interim in regards to processed.
├── processed               # The data from scraping the raw pdfs
├── raw                     # The raw pdfs of the Vuk'uzenzele magazine
├── sentence_align_output   # The output (csv) of the sentence alignment with LASER language encoders
└── simple_align_output     # The output (csv) of a simple one to one sentence alignment

The dataset is split by edition in the data/processed folder.

Disclaimer

This dataset contains machine-readable data extracted from PDF documents, from https://www.vukuzenzele.gov.za/, provided by the Government Communication Information System (GCIS). While efforts were made to ensure the accuracy and completeness of this data, there may be errors or discrepancies between the original publications and this dataset. No warranties, guarantees or representations are given in relation to the information contained in the dataset. The members of the Data Science for Societal Impact Research Group bear no responsibility and/or liability for any such errors or discrepancies in this dataset. The Government Communication Information System (GCIS) bears no responsibility and/or liability for any such errors or discrepancies in this dataset. It is recommended that users verify all information contained herein before making decisions based upon this information.

Authors

  • Vukosi Marivate - @vukosi
  • Andani Madodonga
  • Daniel Njini
  • Richard Lastrucci
  • Isheanesu Dzingirai
  • Jenalea Rajab

Citation

Paper

Preparing the Vuk'uzenzele and ZA-gov-multilingual South African multilingual corpora

@inproceedings{lastrucci-etal-2023-preparing, title = "Preparing the Vuk{'}uzenzele and {ZA}-gov-multilingual {S}outh {A}frican multilingual corpora", author = "Richard Lastrucci and Isheanesu Dzingirai and Jenalea Rajab and Andani Madodonga and Matimba Shingange and Daniel Njini and Vukosi Marivate", booktitle = "Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)", month = may, year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.rail-1.3", pages = "18--25" }

Dataset

Vukosi Marivate, Andani Madodonga, Daniel Njini, Richard Lastrucci, Isheanesu Dzingirai, Jenalea Rajab. The Vuk'uzenzele South African Multilingual Corpus, 2023

@dataset{marivate_vukosi_2023_7598540, author = {Marivate, Vukosi and Njini, Daniel and Madodonga, Andani and Lastrucci, Richard and Dzingirai, Isheanesu Rajab, Jenalea}, title = {The Vuk'uzenzele South African Multilingual Corpus}, month = feb, year = 2023, publisher = {Zenodo}, doi = {10.5281/zenodo.7598539}, url = {https://doi.org/10.5281/zenodo.7598539} }

Licences

About

The dataset contains editions from the South African government magazine Vuk'uzenzele. Data was scraped from PDFs that have been placed in the data/raw folder. The PDFS were obtained from the Vuk'uzenzele website.

Topics

Resources

License

Stars

Watchers

Forks

Languages

  • Jupyter Notebook 50.4%
  • Python 44.0%
  • Shell 3.9%
  • Makefile 1.4%
  • Dockerfile 0.3%