论文标题
新的越南语料库,用于机器阅读健康新闻文章的理解
New Vietnamese Corpus for Machine Reading Comprehension of Health News Articles
论文作者
论文摘要
大规模和高质量的语料库对于评估越南人等低资源语言的机器阅读理解模型是必需的。此外,针对健康领域的机器阅读理解(MRC)为实际应用提供了巨大的潜力;但是,该领域的MRC研究仍然很少。本文将Vinewsqa作为越南语言的新语料库,用于评估医疗保健阅读理解模型。该语料库包括22,057人生成的问答对。人群工作者根据超过4,416个在线越南医疗保健新闻文章的收藏来创建问题及其答案,其中的答案包括从相应文章中提取的跨度。特别是,我们开发了为越南机器阅读理解创建语料库的过程。全面的评估表明,我们的语料库需要超越简单推理的能力,例如单词匹配和基于单个句子信息的艰难推理。与其他模型的性能相比,我们使用不同类型的机器阅读理解方法进行实验以实现第一个基线性能。我们还衡量了语料库上的人类绩效,并将其与几种强大的基于神经网络和基于转移学习的模型进行了比较。我们的实验表明,最好的机器模型是Albert,其精确匹配分数为65.26%,F1得分为84.89%。人类与最佳性能模型(占EM的14.53%和F1得分的10.90%)之间的显着差异在我们的语料库的测试集上表明,将来可以探索Vinewsqa的改善。我们的语料库在我们的网站上公开可用,以鼓励研究社区进行这些改进。
Large-scale and high-quality corpora are necessary for evaluating machine reading comprehension models on a low-resource language like Vietnamese. Besides, machine reading comprehension (MRC) for the health domain offers great potential for practical applications; however, there is still very little MRC research in this domain. This paper presents ViNewsQA as a new corpus for the Vietnamese language to evaluate healthcare reading comprehension models. The corpus comprises 22,057 human-generated question-answer pairs. Crowd-workers create the questions and their answers based on a collection of over 4,416 online Vietnamese healthcare news articles, where the answers comprise spans extracted from the corresponding articles. In particular, we develop a process of creating a corpus for the Vietnamese machine reading comprehension. Comprehensive evaluations demonstrate that our corpus requires abilities beyond simple reasoning, such as word matching and demanding difficult reasoning based on single-or-multiple-sentence information. We conduct experiments using different types of machine reading comprehension methods to achieve the first baseline performances, compared with further models' performances. We also measure human performance on the corpus and compared it with several powerful neural network-based and transfer learning-based models. Our experiments show that the best machine model is ALBERT, which achieves an exact match score of 65.26% and an F1-score of 84.89% on our corpus. The significant differences between humans and the best-performance model (14.53% of EM and 10.90% of F1-score) on the test set of our corpus indicate that improvements in ViNewsQA could be explored in the future study. Our corpus is publicly available on our website for the research purpose to encourage the research community to make these improvements.