Exploring the sawa corpus: collection and deployment of a parallel corpus English—Swahili

De Pauw, G; Wagacha, pw; e de Schryver, Gilles-Mauric

dc.contributor.author	De Pauw, G
dc.contributor.author	Wagacha, pw
dc.contributor.author	e de Schryver, Gilles-Mauric
dc.date.accessioned	2013-06-21T09:33:37Z
dc.date.available	2013-06-21T09:33:37Z
dc.date.issued	2011
dc.identifier.citation	September 2011, Volume 45, Issue 3, pp 331-344	en
dc.identifier.issn	1574-0218
dc.identifier.uri	http://erepository.uonbi.ac.ke:8080/xmlui/handle/123456789/37349
dc.description.abstract	Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the sawa corpus, a two-million-word parallel corpus English—Swahili. We describe the data collection phase and zero in on the difficulties of finding appropriate and easily accessible data for this language pair. In the data annotation phase, the corpus was semi-automatically sentence and word-aligned and morphosyntactic information was added to both the English and Swahili portion of the corpus. The annotated parallel corpus allows us to investigate two possible uses. We describe experiments with the projection of part-of-speech tagging annotation from English onto Swahili, as well as the development of a basic statistical machine translation system for this language pair, using the parallel corpus and a consolidated database of existing English—Swahili translation dictionaries. We particularly focus on the difficulties of translating English into the morphologically more complex Bantu language of Swahili.	en
dc.language.iso	en	en
dc.publisher	Springer	en
dc.title	Exploring the sawa corpus: collection and deployment of a parallel corpus English—Swahili	en
dc.type	Article	en

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

Faculty of Science & Technology (FST) [4283]

Show simple item record