Word Sense Disambiguation of Swahili: Extending Swahili Language Technology with Machine Learning
MetadataShow full item record
This thesis addresses the problem of word sense disambiguation within the context of Swahili-English machine translation. In this setup, the goal of disambiguation is to choose the correct translation of an ambiguous Swahili noun in context. A corpus based approach to disambiguation is taken, where machine learning techniques are applied to a corpus of Swahili, to acquire disambiguation information automatically. In particular, the Self-Organizing Map algorithm is used to obtain a semantic categorization of Swahili nouns from data. The resulting classes form the basis of a class-based solution, where disambiguation is recast as a classification problem. The thesis exploits these semantic classes to automatically obtain annotated training data, addressing a key problem facing supervised word sense disambiguation. The semantic and linguistic characteristics of these classes are modelled as Bayesian belief networks, using the Bayesian Modelling Toolbox. Disambiguation is achieved via probabilistic inferencing.The thesisdevelops a disambiguation solution which does not make extensive resource requirements, but rather capitalizes on freely-available lexical and computational resources for English as a source of additional disambiguation information. A semantic tagger for Swahili is created by altering the configuration of the Bayesian classifiers. The disambiguation solution is tested on a subset of unambiguous nouns and a manually created gold standard of sixteen ambiguous nouns, using standard performance evaluation metrics.