numglue：一套基本但具有挑战性的数学推理任务

论文标题

numglue：一套基本但具有挑战性的数学推理任务

NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks

论文作者

Mishra, Swaroop, Mitra, Arindam, Varshney, Neeraj, Sachdeva, Bhavdeep, Clark, Peter, Baral, Chitta, Kalyan, Ashwin

论文摘要

鉴于文本中数字的无处不在性质，用数字进行简单计算的推理是AI系统的重要技能。尽管已经开发了许多数据集和模型，但最先进的AI系统却很脆弱。当它们出现在略有不同的情况下时，无法执行潜在的数学推理。我们提出了在自然语言理解的背景下提出的灵感，我们提出了Numglue，Numglue是一种多任务基准，该基准评估了AI系统在八个不同任务上的性能，其核心需要简单的算术理解。我们表明，这种基准远非通过神经模型解决，包括最先进的大规模语言模型的表现明显比人类差（降低46.4％）。此外，Numglue促进了跨任务的共享知识，尤其是那些具有较高培训数据的人（每项任务的平均增益为3.4％）证明了一个模型，而不是针对所有任务而不是特定于任务的建模。最后，我们希望Numglue会鼓励在语言中执行强大而通用的算术推理的系统，这是迈向能够执行更复杂数学推理的第一步。

Given the ubiquitous nature of numbers in text, reasoning with numbers to perform simple calculations is an important skill of AI systems. While many datasets and models have been developed to this end, state-of-the-art AI systems are brittle; failing to perform the underlying mathematical reasoning when they appear in a slightly different scenario. Drawing inspiration from GLUE that was proposed in the context of natural language understanding, we propose NumGLUE, a multi-task benchmark that evaluates the performance of AI systems on eight different tasks, that at their core require simple arithmetic understanding. We show that this benchmark is far from being solved with neural models including state-of-the-art large-scale language models performing significantly worse than humans (lower by 46.4%). Further, NumGLUE promotes sharing knowledge across tasks, especially those with limited training data as evidenced by the superior performance (average gain of 3.4% on each task) when a model is jointly trained on all the tasks as opposed to task-specific modeling. Finally, we hope that NumGLUE will encourage systems that perform robust and general arithmetic reasoning within language, a first step towards being able to perform more complex mathematical reasoning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题