Somali Corpus

The Somali Corpus, also known as Kaydka Af Soomaaliga (KAF), is a digital collection of texts in the Somali, a language spoken in Greater Somalia, Ethiopia, and Kenya. It was started with 3 million words of Somali literature and language developed by Jama Musse Jama in 2016 as part of his doctoral dissertation. The corpus currently contains over 7 million words, mainly from literature, poetry, songs, news, essays, and political speeches, making it one of the most extensive collections of text types of language corpora within African languages and an important addition to online materials from under-resourced languages. The words of the corpus are tagged for part of speech categories. The corpus can be used to distill frequency lists for Somali words. The corpus also serves as the basis for an online Somali spell checker.
Other Somali language corpora
* Bangiga Af Soomaaliga, 79.7 million tokens (as of Oct 2024), at the Swedish Language Bank, University of Gothenburg.
* Somali Web Corpus, 18.9 million tokens (as of Oct 2024), at NLP Center, Brno University, Czech Republic, in coop. with Oslo & Addis Abeba.

Main Menu

See Also

Main Menu