评估问题的知识依赖性

论文标题

评估问题的知识依赖性

Evaluating the Knowledge Dependency of Questions

论文作者

Moon, Hyeongdon, Yang, Yoonseok, Shin, Jamin, Yu, Hangyeol, Lee, Seunghyun, Jeong, Myeongho, Park, Juneyoung, Kim, Minsam, Choi, Seungtaek

论文摘要

自动生成多项选择问题（MCQ）有可能大大减少教育工作者在学生评估上的花费。但是，MCQ生成的现有评估指标，例如BLEU，Rouge和Meteor，重点介绍了基于N-Gram的基于N-Gram的相似性与数据集中的黄金样本的相似性，并忽略了其教育价值。他们无法评估MCQ评估学生对相应目标事实的了解的能力。为了解决此问题，我们提出了一种新颖的自动评估度量指标，依赖于知识的回答性（KDA），它可以衡量MCQ的回答性，并且对于目标事实的知识。具体来说，我们首先根据人类调查的学生反应来展示如何测量KDA。然后，我们提出了两个自动评估指标KDA_DISC和KDA_CONT，它们通过利用预训练的语言模型模仿学生的解决问题的行为来近似KDA。通过我们的人类研究，我们表明KDA_DISC和KDA_SOFT与（1）KDA和（2）在实际的教室环境中的可用性都有很强的相关性，并由专家标记。此外，当与基于N-Gram的相似性指标结合使用时，KDA_DISC和KDA_CONT被证明具有强大的预测能力，可用于各种专家标记的MCQ质量指标。

The automatic generation of Multiple Choice Questions (MCQ) has the potential to reduce the time educators spend on student assessment significantly. However, existing evaluation metrics for MCQ generation, such as BLEU, ROUGE, and METEOR, focus on the n-gram based similarity of the generated MCQ to the gold sample in the dataset and disregard their educational value. They fail to evaluate the MCQ's ability to assess the student's knowledge of the corresponding target fact. To tackle this issue, we propose a novel automatic evaluation metric, coined Knowledge Dependent Answerability (KDA), which measures the MCQ's answerability given knowledge of the target fact. Specifically, we first show how to measure KDA based on student responses from a human survey. Then, we propose two automatic evaluation metrics, KDA_disc and KDA_cont, that approximate KDA by leveraging pre-trained language models to imitate students' problem-solving behavior. Through our human studies, we show that KDA_disc and KDA_soft have strong correlations with both (1) KDA and (2) usability in an actual classroom setting, labeled by experts. Furthermore, when combined with n-gram based similarity metrics, KDA_disc and KDA_cont are shown to have a strong predictive power for various expert-labeled MCQ quality measures.

下载PDF全文

下载文献需遵守相关版权规定

论文标题