A curated list of awesome resources for the Kurdish language, including tools, libraries, datasets, and works in the computer science field to help the academic advancement of the language.
Kurdish is an Indo-European language spoken by the Kurdish people. This repository aims to gather significant resources related to the Kurdish language in the fields of computer science, linguistics, and technology.
- Language Resources
- Natural Language Processing
- Academic Research
- Programming Resources
- Communities and Organizations
- Contributing
- License
- Glosbe - Glosbe can be used as a platform for translating Kurdish (Kurmanji) into various languages
- Ferheng Kurdi - Ferheng Kurdi can be used as a platform for translating Kurdish (Kurmanji) into various languages
- Ferhengco - Ferheng.co is an Kurdish (Kurmanji) - Turkish Dictionary.
- Roj Dictionary - Roj Dictionary is an English to Kurdish (Sorani) Dictionary.
- Data.krd - DataKrd is dedicated to making Kurdish datasets accessible to all.
- Kurdish Songs - More than 5000 Kurdish (Kurmanji) songs with their artists.
- Kurdish (Kurmanji) Proverbs - Short Kurdish (Kurmanji) sentences (actually proverbs) can be used for testing purposes.
- Kurdish Words - This data is collected from different sources and it includes more than 60000 Kurdish Kurmanji words.
- Sorani Tweet Sent Dataset (STSD) - 24,668 labeled Twitter data for Kurdish Sorani.
- Kurdish News Dataset Headlines (KNDH) through Multiclass Classification - Kurdish Sorani texts labeled automatically for NLP.
- Medical Sentiment Analysis Dataset for Kurdish Short Text over Social Media - Sorani Kurdish texts labeled as medical and non-medical.
- A Dataset for the Classification of Different Kurdish Dialects - A 6,000-sample dataset for Kurdish dialect recognition.
- The Dialects of Kurdish - Database of Kurdish dialects
- A Dataset for the Classification of Different Kurdish Dialects - A dataset of 6,000 one-second audio samples for Kurdish dialect recognition.
- Language: Northern Kurdish (Kurmanji) - The Northern Kurdish (Kurmanji) DoReCo dataset contains annotated grammatical texts.
- kurdish-turkish-bianet-magazine - Kurdish magazine dataset.
- Vejinbooks-Poem-Dataset - A dataset of 1154 Central Kurdish poems with meter and form tags extracted from vejinbooks.com.
- kurdish-twitter-data - Kurdish twitter data repository for Kurmanji and Sorani dialects.
- Kurdish News Summarization Dataset - The KNSD contains 130,000 Kurdish news articles and headlines for news summarization.
- Mendeley Database - Database
- 50Languages - A free website used for learning Kurdish.
- Character Convertor - Kurdish Language Library for converting characters and digits in Persian, English and Arabic to Kurdish and vice versa.
- KurdishHunspell - A morphological analyzer and spell checker for Kurdish in Hunspell.
- kurdinusLibrary - Kurdînûs is pure JavaScript tools for Kurdish language texts.
- Kurdish number to words - Converts Numbers (including decimal points) into words for Central Kurdish Language. It also converts the numbers into words for currency.
- Kurdish-BLARK - This project consists of a set of basic tools developed in Python 2.7 as part of the Kurdish BLARK project and a corpus for the Kurmanji and Sorani dialects of Kurdish. The tools include a transliterator, tokenizer, stemmer, word-level translator using a bidialectal dictionary, proper names recognizer, and utilities for building and sorting dictionaries.
- KurdishTokenization - A Tokenization System for the Kurdish Language (Sorani & Kurmanji dialects).
- kurdish-llama - This is an attempt to fine-tune the Llama model released by Meta for Central Kurdish. The initial model was then fine-tuned on a set of instructions provided by Stanford's Alpaca project.
- Kurdish Language Processing Toolkit - Kurdish Language Processing Toolkit--KLPT is a natural language processing (NLP) toolkit in Python for the Kurdish language. The current version comes with four core modules, namely preprocess, stem, transliterate and tokenize and addresses basic language processing tasks such as text preprocessing, stemming, tokenization, spell-checking and morphological analysis for the Sorani and the Kurmanji dialects of Kurdish.
- kurdi - Various Kurdi related work done by Kurdish developers.
- kurdish_news - Kurdish News sources.
- AI2001_Category-Linguistics-SC-Kurdish - linguistic:Kurdish category for AI2001, containing Kurdish language linguistic datasets.
(This section is intentionally left empty.)
(This section is intentionally left empty.)
(This section is intentionally left empty.)
- A Tokenization System for the Kurdish Language - This study proposes a lexicon and morphological analyzer-based approach for tokenizing the Sorani and Kurmanji dialects of Kurdish. The developed annotated dataset demonstrates superior performance compared to unsupervised methods.
- The Kurdish Language Corpus: State of the Art - This paper reviews Kurdish language corpora, highlighting challenges like scarce resources and lack of unified orthography, and emphasizes the need for annotated corpora to support machine-readable text and intelligent applications.
- KLPT – Kurdish Language Processing Toolkit - This paper introduces a Kurdish language processing toolkit to address the lack of basic tools for this under-resourced language, which includes components like tokenization, stemming, and lemmatization. The toolkit is extendable by future developers and is publicly available.
- KuBERT: Central Kurdish BERT Model and Its Application for Sentiment Analysis - This paper enhances Kurdish sentiment analysis using BERT, achieving better accuracy than traditional models.
- Iraqi Legal Gpt - This paper introduces a small AI chatbot for offline legal information in Iraqi law with 80% accuracy.
- Kurdish News Dataset Headlines (KNDH) through multiclass classification - KNDH is a dataset of 50,000 Kurdish news headlines across five categories for text classification.
- Kurdish News Dataset Headlines (KNDH) through multiclass classification - KNDH is a dataset of 50,000 Kurdish news headlines across five categories for text classification.
- A Kurdish Sorani Twitter dataset for language modelling - This paper presents a Kurdish sentiment analysis dataset of 24,668 labeled tweets.
- Medical dataset classification for Kurdish short text over social media - The MKD contains 6,756 Facebook comments classified into medical and non-medical categories.
- Dataset for the recognition of Kurdish sound dialects - This paper presents a Kurdish dialect recognition dataset for improving speech recognition systems.
- TooKLPT – Kurdish Language Processing Toolkit - Toolkit Article and Video.
- Flutter Kurdish Localization - Localization support for Central Kurdish Branch Sorani (Kurdish: سۆرانی ,Soranî)
(This section is intentionally left empty.)
Contributions are welcome! Please feel free to submit a pull request or open an issue to add new resources or suggest improvements.