用于多语言序列标记的结构级知识蒸馏

论文标题

用于多语言序列标记的结构级知识蒸馏

Structure-Level Knowledge Distillation For Multilingual Sequence Labeling

论文作者

Wang, Xinyu, Jiang, Yong, Bach, Nguyen, Wang, Tao, Huang, Fei, Tu, Kewei

论文摘要

多语言序列标记是使用多种语言的单个统一模型预测标签序列的任务。与依靠多种单语模型相比，使用多语言模型具有较小的模型大小，更容易在线服务以及对低资源语言的概括性的好处。但是，由于模型容量限制，当前的多语言模型仍然显着差异很大。在本文中，我们建议通过将几种单语模型（教师）的结构知识提炼为统一的多语言模型（学生），以减少单语模型与统一多语言模型之间的差距。我们根据结构级信息提出了两种新型的KD方法：（1）近似最小化学生和教师的结构水平概率分布之间的距离，（2）将结构级知识汇总到本地分布，并最大程度地减少两个局部概率分布之间的距离。我们对4个具有25个数据集的多种语言任务的实验表明，我们的方法的表现要优于几个强大的基线，并且比基线模型和教师模型具有更强的零弹性可推广性。

Multilingual sequence labeling is a task of predicting label sequences using a single unified model for multiple languages. Compared with relying on multiple monolingual models, using a multilingual model has the benefit of a smaller model size, easier in online serving, and generalizability to low-resource languages. However, current multilingual models still underperform individual monolingual models significantly due to model capacity limitations. In this paper, we propose to reduce the gap between monolingual models and the unified multilingual model by distilling the structural knowledge of several monolingual models (teachers) to the unified multilingual model (student). We propose two novel KD methods based on structure-level information: (1) approximately minimizes the distance between the student's and the teachers' structure level probability distributions, (2) aggregates the structure-level knowledge to local distributions and minimizes the distance between two local probability distributions. Our experiments on 4 multilingual tasks with 25 datasets show that our approaches outperform several strong baselines and have stronger zero-shot generalizability than both the baseline model and teacher models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题