Towards English - Swahili Machine Translation

View/ Open
Date
2011Author
De Pauw, G
Wagacha, PW
de Schryver, Gilles-Maurice
Type
ArticleLanguage
enMetadata
Show full item recordAbstract
Even though the Bantu language of Swahili is spo-
ken by more than fifty million people in East and
Central Africa, it is surprisingly resource-scarce from
a language technological point of view, an unfortu-
nate situation that holds for most, if not all languages
on the continent. The increasing amount of digitally
available, vernacular data has prompted researchers
to investigate the applicability of corpus-based ap-
proaches to
African language technology
. In this vein,
the
SAWA
corpus project attempts to collect and de-
ploy a parallel corpus English - Swahili, not only for
the straightforward purpose of developing a machine
translation system, but also to investigate the possibil-
ity of projection of annotation into a resource-scarce,
African language.
Compiling a balanced and expansive parallel corpus
English - Swahili is a rather daunting task. While
monolingual Swahili data is abundantly available on
the Internet, sourcing parallel texts is cumbersome.
Even countries that have both English and Swahili
as their official languages, such as Tanzania, Kenya
and Uganda, do not tend to translate and/or publish all
government documents bilingually. One therefore op-
portunistically collects whatever can be found in the
public domain.
At this point in the data collection phase, that means
that the 2.2 million word parallel corpus is biased
towards religious material, such as bible and quran
translations. Nevertheless, the more interesting, secu-
lar part of the
SAWA
corpus (
420k words) is steadily
increasing, thanks to the inclusion of bilingual invest-
ment reports, manually translated movie subtitles, po-
litical documents and material kindly donated by local
translators to the
SAWA
project.
Each text in the
SAWA
corpus is automatically part-of-
speech tagged and lemmatized, using the TreeTagger
for the English part (Schmid, 1994) and the systems
described in De Pauw et al. (2006) and De Pauw and
de Schryver (2008) for Swahili. These extra annota-
tion layers allow us to perform more accurate auto-
matic word alignment on the basis of factored data
URI
http://mt-archive.info/MTMRL-2011-DePauw.pdfhttp://erepository.uonbi.ac.ke:8080/xmlui/handle/123456789/37360
Publisher
School of Computing & Informatics, University of Nairobi,