论文标题
基于文本系统的IGBO文本文档的分析和表示
Analysis and representation of Igbo text document for a text-based system
论文作者
论文摘要
信息技术(IT)的进步有助于在基于文本的应用中灌输尼日利亚的三种主要语言,例如文本挖掘,信息检索和自然语言处理。本文的兴趣是伊博语,它使用复合作为一种常见的单词形成类型,并且还具有许多复合词的词汇。搭配,单词订购和复杂性的问题在伊博语中起着很高的作用。处理这些复合词的歧义使Igbo语言文本文档的表示非常困难,因为这无法使用文本表示的单词袋(BOW)模型的最常见和标准方法来解决文本表示的模型,而文本表示却忽略了秩序和关系。但是,这引起了人们的关注,并且需要开发改进的模型来捕捉这种情况。本文介绍了IGBO语言文本文档的分析,考虑了其复合性质,并用基于单词的N-Gram模型描述了其表示形式,以适当地为任何基于文本的应用程序做好准备。结果表明,Bigram和Trigram N-Gram文本表示模型提供了更多的语义信息,还解决了复合,单词订购和搭配的问题,这些问题是IGBO中主要语言特殊性的问题。在任何基于IGBO文本的系统中使用时,它们可能会提供更好的性能。
The advancement in Information Technology (IT) has assisted in inculcating the three Nigeria major languages in text-based application such as text mining, information retrieval and natural language processing. The interest of this paper is the Igbo language, which uses compounding as a common type of word formation and as well has many vocabularies of compound words. The issues of collocation, word ordering and compounding play high role in Igbo language. The ambiguity in dealing with these compound words has made the representation of Igbo language text document very difficult because this cannot be addressed using the most common and standard approach of the Bag-Of-Words (BOW) model of text representation, which ignores the word order and relation. However, this cause for a concern and the need to develop an improved model to capture this situation. This paper presents the analysis of Igbo language text document, considering its compounding nature and describes its representation with the Word-based N-gram model to properly prepare it for any text-based application. The result shows that Bigram and Trigram n-gram text representation models provide more semantic information as well addresses the issues of compounding, word ordering and collocations which are the major language peculiarities in Igbo. They are likely to give better performance when used in any Igbo text-based system.