论文标题

低资源设置的Dirichlet平滑单词嵌入

Dirichlet-Smoothed Word Embeddings for Low-Resource Settings

论文作者

Jungmaier, Jakob, Kassner, Nora, Roth, Benjamin

论文摘要

如今,使用正点互相信息(PPMI)加权同时矩阵的基于经典的单词嵌入已被基于机器学习的方法(如Word2Vec和Glove)广泛取代。但是这些方法通常使用大量的文本数据应用。但是,在许多情况下,没有太多可用的文本数据,例如针对特定域或低资源语言。本文通过添加Dirichlet平滑来纠正其对稀有单词的偏见来重新审视PPMI。我们对标准单词相似性数据集进行了评估,并与Word2Vec和最新的低资源设置艺术状态进行了比较:单词嵌入的积极和未标记(PU)学习。该提出的方法在低资源环境方面优于学习,并获得了马耳他和卢森堡的竞争成果。

Nowadays, classical count-based word embeddings using positive pointwise mutual information (PPMI) weighted co-occurrence matrices have been widely superseded by machine-learning-based methods like word2vec and GloVe. But these methods are usually applied using very large amounts of text data. In many cases, however, there is not much text data available, for example for specific domains or low-resource languages. This paper revisits PPMI by adding Dirichlet smoothing to correct its bias towards rare words. We evaluate on standard word similarity data sets and compare to word2vec and the recent state of the art for low-resource settings: Positive and Unlabeled (PU) Learning for word embeddings. The proposed method outperforms PU-Learning for low-resource settings and obtains competitive results for Maltese and Luxembourgish.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源