检测新单词含义：西班牙语中单词嵌入模型的比较

论文标题

检测新单词含义：西班牙语中单词嵌入模型的比较

Detecting New Word Meanings: A Comparison of Word Embedding Models in Spanish

论文作者

Torres-Rivera, Andrés, Torres-Moreno, Juan-Manuel

论文摘要

语义新的神学主义（SN）被定义为在保持其形式的同时获得新单词的词。鉴于这种新学的本质，识别这些新单词含义的任务目前是由Neology的专家手动执行的。为了以半自动的方式检测SN，我们开发了一个实现以下策略组合的系统：主题建模，关键字提取和单词sense sission歧义。主题建模的作用是检测输入文本中处理的主题。文本中的主题提供了有关所使用单词的特定含义的线索，例如：在谈论健康时，病毒在计算机科学（CS）的背景下具有一种含义。为了提取关键字，我们使用了带有POS标签过滤的Textrank。通过这种方法，我们可以获得已经是西班牙词典一部分的相关词。我们使用深度学习模型来确定给定关键字是否具有新的含义。与所有已知含义（或主题）不同的嵌入表明单词可能是有效的SN候选者。在这项研究中，我们检查了以下单词嵌入模型：Word2Vec，Sense2Vec和FastText。使用西班牙语作为Corpora的Wikipedia，对模型进行了同等参数的培训。然后，我们使用了单词及其一致性列表（从我们的Neologisms数据库中获得）来显示每个模型产生的不同嵌入。最后，我们将这些结果与每个单词的一致性进行了比较，以表明我们如何确定一个单词是否可以成为SN的有效候选人。

Semantic neologisms (SN) are defined as words that acquire a new word meaning while maintaining their form. Given the nature of this kind of neologisms, the task of identifying these new word meanings is currently performed manually by specialists at observatories of neology. To detect SN in a semi-automatic way, we developed a system that implements a combination of the following strategies: topic modeling, keyword extraction, and word sense disambiguation. The role of topic modeling is to detect the themes that are treated in the input text. Themes within a text give clues about the particular meaning of the words that are used, for example: viral has one meaning in the context of computer science (CS) and another when talking about health. To extract keywords, we used TextRank with POS tag filtering. With this method, we can obtain relevant words that are already part of the Spanish lexicon. We use a deep learning model to determine if a given keyword could have a new meaning. Embeddings that are different from all the known meanings (or topics) indicate that a word might be a valid SN candidate. In this study, we examine the following word embedding models: Word2Vec, Sense2Vec, and FastText. The models were trained with equivalent parameters using Wikipedia in Spanish as corpora. Then we used a list of words and their concordances (obtained from our database of neologisms) to show the different embeddings that each model yields. Finally, we present a comparison of these outcomes with the concordances of each word to show how we can determine if a word could be a valid candidate for SN.

下载PDF全文

下载文献需遵守相关版权规定

论文标题