Towards English - Swahili Machine Translation

De Pauw, G; Wagacha, PW; de Schryver, Gilles-Maurice

dc.contributor.author	De Pauw, G
dc.contributor.author	Wagacha, PW
dc.contributor.author	de Schryver, Gilles-Maurice
dc.date.accessioned	2013-06-21T09:40:00Z
dc.date.available	2013-06-21T09:40:00Z
dc.date.issued	2011
dc.identifier.uri	http://mt-archive.info/MTMRL-2011-DePauw.pdf
dc.identifier.uri	http://erepository.uonbi.ac.ke:8080/xmlui/handle/123456789/37360
dc.description.abstract	Even though the Bantu language of Swahili is spo- ken by more than fifty million people in East and Central Africa, it is surprisingly resource-scarce from a language technological point of view, an unfortu- nate situation that holds for most, if not all languages on the continent. The increasing amount of digitally available, vernacular data has prompted researchers to investigate the applicability of corpus-based ap- proaches to African language technology . In this vein, the SAWA corpus project attempts to collect and de- ploy a parallel corpus English - Swahili, not only for the straightforward purpose of developing a machine translation system, but also to investigate the possibil- ity of projection of annotation into a resource-scarce, African language. Compiling a balanced and expansive parallel corpus English - Swahili is a rather daunting task. While monolingual Swahili data is abundantly available on the Internet, sourcing parallel texts is cumbersome. Even countries that have both English and Swahili as their official languages, such as Tanzania, Kenya and Uganda, do not tend to translate and/or publish all government documents bilingually. One therefore op- portunistically collects whatever can be found in the public domain. At this point in the data collection phase, that means that the 2.2 million word parallel corpus is biased towards religious material, such as bible and quran translations. Nevertheless, the more interesting, secu- lar part of the SAWA corpus ( 420k words) is steadily increasing, thanks to the inclusion of bilingual invest- ment reports, manually translated movie subtitles, po- litical documents and material kindly donated by local translators to the SAWA project. Each text in the SAWA corpus is automatically part-of- speech tagged and lemmatized, using the TreeTagger for the English part (Schmid, 1994) and the systems described in De Pauw et al. (2006) and De Pauw and de Schryver (2008) for Swahili. These extra annota- tion layers allow us to perform more accurate auto- matic word alignment on the basis of factored data	en
dc.language.iso	en	en
dc.title	Towards English - Swahili Machine Translation	en
dc.type	Article	en
local.publisher	School of Computing & Informatics, University of Nairobi,	en

Files in this item

Name:: ABSTRACT.pdf
Size:: 10.01Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Faculty of Science & Technology (FST) [4283]

Show simple item record