使用机器学习利用音译单词以在语言中的新闻文章中找到相似性

论文标题

使用机器学习利用音译单词以在语言中的新闻文章中找到相似性

Exploiting Transliterated Words for Finding Similarity in Inter-Language News Articles using Machine Learning

论文作者

Naeem, Sameea, Rahman, Arif ur, Haider, Syed Mujtaba, Mughal, Abdul Basit

论文摘要

在两篇语言新闻文章之间找到相似之处是自然语言处理（NLP）的挑战性问题。很难以不同的语言找到类似的新闻文章，而不是用户的母语，因此需要基于机器学习的自动系统来找到两篇语言中的新闻文章之间的相似性。在本文中，我们提出了一种与英语乌尔都语单词音译结合的机器学习模型，该模型将显示英语新闻文章是否与乌尔都语新闻文章相似。当档案包含乌尔都语（例如乌尔都语）和英语新闻文章的文章时，现有的找到相似之处的方法具有重大缺点。当档案包含乌尔都语等低资源的语言以及英语新闻文章时，现有的找到相似之处的方法有缺点。我们使用词典将乌尔都语和英语新闻文章联系起来。由于乌尔都语语言处理应用程序（例如机器翻译，文本到语音等）无法同时处理英语文本，因此这项研究提出了基于音译的英语和乌尔都语新闻文章中的相似性。

Finding similarities between two inter-language news articles is a challenging problem of Natural Language Processing (NLP). It is difficult to find similar news articles in a different language other than the native language of user, there is a need for a Machine Learning based automatic system to find the similarity between two inter-language news articles. In this article, we propose a Machine Learning model with the combination of English Urdu word transliteration which will show whether the English news article is similar to the Urdu news article or not. The existing approaches to find similarities has a major drawback when the archives contain articles of low-resourced languages like Urdu along with English news article. The existing approaches to find similarities has drawback when the archives contain low-resourced languages like Urdu along with English news articles. We used lexicon to link Urdu and English news articles. As Urdu language processing applications like machine translation, text to speech, etc are unable to handle English text at the same time so this research proposed technique to find similarities in English and Urdu news articles based on transliteration.

下载PDF全文

下载文献需遵守相关版权规定

论文标题