论文标题
统一的多标准中文单词分割与伯特
Unified Multi-Criteria Chinese Word Segmentation with BERT
论文作者
论文摘要
中国单词分割(MCCW)的多标准旨在在由连续字符组成的中文句子中找到单词边界,而存在多个分割标准。统一框架已在MCCW中广泛使用,并显示出其有效性。此外,在多任务学习框架中,预先训练的BERT语言模型也已引入MCCWS任务。在本文中,我们结合了统一框架的优越性和验证的语言模型,并提出了基于BERT的统一MCCWS模型。此外,我们使用BigRAM功能和辅助标准分类任务增强了基于BERT的MCCWS模型。具有不同标准的八个数据集的实验表明,我们的方法可以实现MCCW的最新结果。
Multi-Criteria Chinese Word Segmentation (MCCWS) aims at finding word boundaries in a Chinese sentence composed of continuous characters while multiple segmentation criteria exist. The unified framework has been widely used in MCCWS and shows its effectiveness. Besides, the pre-trained BERT language model has been also introduced into the MCCWS task in a multi-task learning framework. In this paper, we combine the superiority of the unified framework and pretrained language model, and propose a unified MCCWS model based on BERT. Moreover, we augment the unified BERT-based MCCWS model with the bigram features and an auxiliary criterion classification task. Experiments on eight datasets with diverse criteria demonstrate that our methods could achieve new state-of-the-art results for MCCWS.