• Login
    • Login
    Advanced Search
    View Item 
    •   UoN Digital Repository Home
    • Journal Articles
    • Faculty of Science & Technology (FST)
    • View Item
    •   UoN Digital Repository Home
    • Journal Articles
    • Faculty of Science & Technology (FST)
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Towards English - Swahili Machine Translation

    Thumbnail
    View/Open
    ABSTRACT.pdf (10.01Kb)
    Date
    2011
    Author
    De Pauw, G
    Wagacha, PW
    de Schryver, Gilles-Maurice
    Type
    Article
    Language
    en
    Metadata
    Show full item record

    Abstract
    Even though the Bantu language of Swahili is spo- ken by more than fifty million people in East and Central Africa, it is surprisingly resource-scarce from a language technological point of view, an unfortu- nate situation that holds for most, if not all languages on the continent. The increasing amount of digitally available, vernacular data has prompted researchers to investigate the applicability of corpus-based ap- proaches to African language technology . In this vein, the SAWA corpus project attempts to collect and de- ploy a parallel corpus English - Swahili, not only for the straightforward purpose of developing a machine translation system, but also to investigate the possibil- ity of projection of annotation into a resource-scarce, African language. Compiling a balanced and expansive parallel corpus English - Swahili is a rather daunting task. While monolingual Swahili data is abundantly available on the Internet, sourcing parallel texts is cumbersome. Even countries that have both English and Swahili as their official languages, such as Tanzania, Kenya and Uganda, do not tend to translate and/or publish all government documents bilingually. One therefore op- portunistically collects whatever can be found in the public domain. At this point in the data collection phase, that means that the 2.2 million word parallel corpus is biased towards religious material, such as bible and quran translations. Nevertheless, the more interesting, secu- lar part of the SAWA corpus ( 420k words) is steadily increasing, thanks to the inclusion of bilingual invest- ment reports, manually translated movie subtitles, po- litical documents and material kindly donated by local translators to the SAWA project. Each text in the SAWA corpus is automatically part-of- speech tagged and lemmatized, using the TreeTagger for the English part (Schmid, 1994) and the systems described in De Pauw et al. (2006) and De Pauw and de Schryver (2008) for Swahili. These extra annota- tion layers allow us to perform more accurate auto- matic word alignment on the basis of factored data
    URI
    http://mt-archive.info/MTMRL-2011-DePauw.pdf
    http://erepository.uonbi.ac.ke:8080/xmlui/handle/123456789/37360
    Publisher
    School of Computing & Informatics, University of Nairobi,
    Collections
    • Faculty of Science & Technology (FST) [4253]

    Copyright © 2022 
    University of Nairobi Library
    Contact Us | Send Feedback

     

     

    Useful Links
    UON HomeLibrary HomeKLISC

    Browse

    All of UoN Digital RepositoryCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    LoginRegister

    Copyright © 2022 
    University of Nairobi Library
    Contact Us | Send Feedback