论文标题

印地语的新型语言资源:美学文本语料库和全面的停止引理清单

Novel Language Resources for Hindi: An Aesthetics Text Corpus and a Comprehensive Stop Lemma List

论文作者

Venugopal-Wairagade, Gayatri, Saini, Jatinderkumar R., Pramod, Dhanya

论文摘要

本文是为了补充研究人员为将非英语语言纳入自然语言处理研究所做的贡献。已经创建并发布了两种新型的印地语语言资源,以供公众消费。第一个资源是一种语料库,该语料由数百年来的近数千个预处理的虚构和非虚构文本组成。第二个资源是详尽的停止引理列表,由12个范围内的12个Corpora创建,由超过1300万个单词组成,从中产生了超过200,000个引理,11个公开可用的停止单词列表包含1000多个单词,从中产生了近400个独特的诱饵。这项研究强调使用停止引理而不是停止单词,因为在停止单词列表中存在各种单词的形态,而不是仅存在单词的根部形式,如果需要,则可以从中得出变化。还观察到,与停止单词相比,多个来源的停止引理更加一致。为了产生停止引理清单,研究了引理的语音部分,但被拒绝,因为发现频率列表中的单词等级之间没有显着相关性。使用比较方法评估停止引理列表。提出了一种正式的评估方法,因为这项研究未来的工作。

This paper is an effort to complement the contributions made by researchers working toward the inclusion of non-English languages in natural language processing studies. Two novel Hindi language resources have been created and released for public consumption. The first resource is a corpus consisting of nearly thousand pre-processed fictional and nonfictional texts spanning over hundred years. The second resource is an exhaustive list of stop lemmas created from 12 corpora across multiple domains, consisting of over 13 million words, from which more than 200,000 lemmas were generated, and 11 publicly available stop word lists comprising over 1000 words, from which nearly 400 unique lemmas were generated. This research lays emphasis on the use of stop lemmas instead of stop words owing to the presence of various, but not all morphological forms of a word in stop word lists, as opposed to the presence of only the root form of the word, from which variations could be derived if required. It was also observed that stop lemmas were more consistent across multiple sources as compared to stop words. In order to generate a stop lemma list, the parts of speech of the lemmas were investigated but rejected as it was found that there was no significant correlation between the rank of a word in the frequency list and its part of speech. The stop lemma list was assessed using a comparative method. A formal evaluation method is suggested as future work arising from this study.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源