论文标题
使用变压器探索NER的瑞典语和英语FastText嵌入
Exploring Swedish & English fastText Embeddings for NER with the Transformer
论文作者
论文摘要
在本文中,我们的主要贡献是,相对较小的语料库的嵌入可以胜过较大的语料库的嵌入,我们将新的瑞典类比测试集公开可用。为了实现自然语言处理(NLP)下游任务的良好网络性能,几个因素扮演着重要角色:数据集大小,正确的超参数和训练有素的嵌入。我们表明,借助正确的超参数,即使在较小的数据集中也可以达到良好的网络性能。我们在固有水平和外在水平上评估嵌入。将嵌入在命名实体识别(NER)任务和进行显着性测试中的变压器中部署。这是针对瑞典语和英语完成的。与最近发布的常见爬网版相比,我们在下游任务上的两种语言中都能获得更好的性能。角色n-grams对于瑞典语(一种形态上丰富的语言)似乎很有用。
In this paper, our main contributions are that embeddings from relatively smaller corpora can outperform ones from larger corpora and we make the new Swedish analogy test set publicly available. To achieve a good network performance in natural language processing (NLP) downstream tasks, several factors play important roles: dataset size, the right hyper-parameters, and well-trained embeddings. We show that, with the right set of hyper-parameters, good network performance can be reached even on smaller datasets. We evaluate the embeddings at both the intrinsic and extrinsic levels. The embeddings are deployed with the Transformer in named entity recognition (NER) task and significance tests conducted. This is done for both Swedish and English. We obtain better performance in both languages on the downstream task with smaller training data, compared to recently released, Common Crawl versions; and character n-grams appear useful for Swedish, a morphologically rich language.