IndiNxnli：评估印度语言的多语言推断

论文标题

IndiNxnli：评估印度语言的多语言推断

IndicXNLI: Evaluating Multilingual Inference for Indian Languages

论文作者

Aggarwal, Divyanshu, Gupta, Vivek, Kunchukuttan, Anoop

论文摘要

虽然Indic NLP最近就语料库的可用性和预培训模型取得了迅速的进步，但标准NLU任务的基准数据集有限。为此，我们介绍了11个指示语言的NLI数据集IndiNxnli。它是由原始英语XNLI数据集的高质量机器翻译创建的，我们的分析证明了IndiNxnli的质量。通过在此IndionXNLI上对不同的预训练的LMS进行填充，我们就语言模型，语言，多语言，混合语言输入等语言模型的选择的影响进行了分析。这些实验为我们提供了有用的洞察力，以了解多种语言的预培训模型的行为。

While Indic NLP has made rapid advances recently in terms of the availability of corpora and pre-trained models, benchmark datasets on standard NLU tasks are limited. To this end, we introduce IndicXNLI, an NLI dataset for 11 Indic languages. It has been created by high-quality machine translation of the original English XNLI dataset and our analysis attests to the quality of IndicXNLI. By finetuning different pre-trained LMs on this IndicXNLI, we analyze various cross-lingual transfer techniques with respect to the impact of the choice of language models, languages, multi-linguality, mix-language input, etc. These experiments provide us with useful insights into the behaviour of pre-trained models for a diverse set of languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题