测量数据统计数据对语言模型的“事实”预测的因果影响

论文标题

测量数据统计数据对语言模型的“事实”预测的因果影响

Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions

论文作者

Elazar, Yanai, Kassner, Nora, Ravfogel, Shauli, Feder, Amir, Ravichander, Abhilasha, Mosbach, Marius, Belinkov, Yonatan, Schütze, Hinrich, Goldberg, Yoav

论文摘要

大量培训数据是最先进的NLP模型高性能的主要原因之一。但是，在培训数据中，什么导致模型做出一定的预测？我们试图通过提供一种通过因果框架来描述培训数据如何影响预测的语言来回答这个问题。重要的是，我们的框架绕过了重新培训昂贵模型的需求，并使我们能够仅基于观察数据来估计因果效应。解决从验证的语言模型（PLM）中提取事实知识的问题，我们重点介绍了简单的数据统计数据，例如共发生计数，并表明这些统计数据确实会影响PLM的预测，这表明此类模型依赖于浅启发式方法。我们的因果框架和结果表明，研究数据集的重要性以及因果关系对理解NLP模型的好处。

Large amounts of training data are one of the major reasons for the high performance of state-of-the-art NLP models. But what exactly in the training data causes a model to make a certain prediction? We seek to answer this question by providing a language for describing how training data influences predictions, through a causal framework. Importantly, our framework bypasses the need to retrain expensive models and allows us to estimate causal effects based on observational data alone. Addressing the problem of extracting factual knowledge from pretrained language models (PLMs), we focus on simple data statistics such as co-occurrence counts and show that these statistics do influence the predictions of PLMs, suggesting that such models rely on shallow heuristics. Our causal framework and our results demonstrate the importance of studying datasets and the benefits of causality for understanding NLP models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题