论文标题

统一的多标准中文单词分割与伯特

Unified Multi-Criteria Chinese Word Segmentation with BERT

论文作者

Ke, Zhen, Shi, Liang, Meng, Erli, Wang, Bin, Qiu, Xipeng, Huang, Xuanjing

论文摘要

中国单词分割(MCCW)的多标准旨在在由连续字符组成的中文句子中找到单词边界,而存在多个分割标准。统一框架已在MCCW中广泛使用,并显示出其有效性。此外,在多任务学习框架中,预先训练的BERT语言模型也已引入MCCWS任务。在本文中,我们结合了统一框架的优越性和验证的语言模型,并提出了基于BERT的统一MCCWS模型。此外,我们使用BigRAM功能和辅助标准分类任务增强了基于BERT的MCCWS模型。具有不同标准的八个数据集的实验表明,我们的方法可以实现MCCW的最新结果。

Multi-Criteria Chinese Word Segmentation (MCCWS) aims at finding word boundaries in a Chinese sentence composed of continuous characters while multiple segmentation criteria exist. The unified framework has been widely used in MCCWS and shows its effectiveness. Besides, the pre-trained BERT language model has been also introduced into the MCCWS task in a multi-task learning framework. In this paper, we combine the superiority of the unified framework and pretrained language model, and propose a unified MCCWS model based on BERT. Moreover, we augment the unified BERT-based MCCWS model with the bigram features and an auxiliary criterion classification task. Experiments on eight datasets with diverse criteria demonstrate that our methods could achieve new state-of-the-art results for MCCWS.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源