Show simple item record

dc.contributor.authorDe Pauw, G
dc.contributor.authorWagacha, PW
dc.contributor.authorde Schryver, Gilles-Maurice
dc.date.accessioned2013-06-21T09:40:00Z
dc.date.available2013-06-21T09:40:00Z
dc.date.issued2011
dc.identifier.urihttp://mt-archive.info/MTMRL-2011-DePauw.pdf
dc.identifier.urihttp://erepository.uonbi.ac.ke:8080/xmlui/handle/123456789/37360
dc.description.abstractEven though the Bantu language of Swahili is spo- ken by more than fifty million people in East and Central Africa, it is surprisingly resource-scarce from a language technological point of view, an unfortu- nate situation that holds for most, if not all languages on the continent. The increasing amount of digitally available, vernacular data has prompted researchers to investigate the applicability of corpus-based ap- proaches to African language technology . In this vein, the SAWA corpus project attempts to collect and de- ploy a parallel corpus English - Swahili, not only for the straightforward purpose of developing a machine translation system, but also to investigate the possibil- ity of projection of annotation into a resource-scarce, African language. Compiling a balanced and expansive parallel corpus English - Swahili is a rather daunting task. While monolingual Swahili data is abundantly available on the Internet, sourcing parallel texts is cumbersome. Even countries that have both English and Swahili as their official languages, such as Tanzania, Kenya and Uganda, do not tend to translate and/or publish all government documents bilingually. One therefore op- portunistically collects whatever can be found in the public domain. At this point in the data collection phase, that means that the 2.2 million word parallel corpus is biased towards religious material, such as bible and quran translations. Nevertheless, the more interesting, secu- lar part of the SAWA corpus ( 420k words) is steadily increasing, thanks to the inclusion of bilingual invest- ment reports, manually translated movie subtitles, po- litical documents and material kindly donated by local translators to the SAWA project. Each text in the SAWA corpus is automatically part-of- speech tagged and lemmatized, using the TreeTagger for the English part (Schmid, 1994) and the systems described in De Pauw et al. (2006) and De Pauw and de Schryver (2008) for Swahili. These extra annota- tion layers allow us to perform more accurate auto- matic word alignment on the basis of factored dataen
dc.language.isoenen
dc.titleTowards English - Swahili Machine Translationen
dc.typeArticleen
local.publisherSchool of Computing & Informatics, University of Nairobi,en


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record