论文标题

用于零拍的主题标识和发现的贝叶斯多语言文档模型

A Bayesian Multilingual Document Model for Zero-shot Topic Identification and Discovery

论文作者

Kesiraju, Santosh, Sagar, Sangeet, Glembek, Ondřej, Burget, Lukáš, Černocký, Ján, Gangashetty, Suryakanth V

论文摘要

在本文中,我们提出了一个贝叶斯多语言文档模型,用于学习与语言无关的文档嵌入。该模型是Baysmm [Kesiraju et al 2020]与多语言场景的扩展。它学会以高斯分布的形式表示文档嵌入,从而编码其协方差的不确定性。我们通过线性分类器来传播学习的不确定性,这些分类器使零击的跨语性主题识别。我们对17种语言的实验表明,与基于8种高水库语言的大型神经网络(Laser,XLM-R,Muse)的其他系统相比,提议的多语言贝叶斯文档模型在竞争性上表现出色,并且在9个中途语言上胜过这些系统。我们通过对当前数据集,基线系统和所涵盖的语言进行更深入的研究,重新访问零拍设置中的跨语性主题识别。我们确定现有评估协议(MLDOC数据集)中的缺点,并提出了强大的替代方案,同时还将跨语性实验设置扩展到17种语言。最后,我们巩固了所有实验中的观察结果,并讨论可以使未来研究工作有益于依靠跨语性转移的应用程序中的工作。

In this paper, we present a Bayesian multilingual document model for learning language-independent document embeddings. The model is an extension of BaySMM [Kesiraju et al 2020] to the multilingual scenario. It learns to represent the document embeddings in the form of Gaussian distributions, thereby encoding the uncertainty in its covariance. We propagate the learned uncertainties through linear classifiers that benefit zero-shot cross-lingual topic identification. Our experiments on 17 languages show that the proposed multilingual Bayesian document model performs competitively, when compared to other systems based on large-scale neural networks (LASER, XLM-R, mUSE) on 8 high-resource languages, and outperforms these systems on 9 mid-resource languages. We revisit cross-lingual topic identification in zero-shot settings by taking a deeper dive into current datasets, baseline systems and the languages covered. We identify shortcomings in the existing evaluation protocol (MLDoc dataset), and propose a robust alternative scheme, while also extending the cross-lingual experimental setup to 17 languages. Finally, we consolidate the observations from all our experiments, and discuss points that can potentially benefit the future research works in applications relying on cross-lingual transfers.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源