NLPSTATTEST：用于比较NLP系统性能的工具包

论文标题

NLPSTATTEST：用于比较NLP系统性能的工具包

NLPStatTest: A Toolkit for Comparing NLP System Performance

论文作者

Zhu, Haotian, Mak, Denise, Gioannini, Jesse, Xia, Fei

论文摘要

以P值为中心的统计显着性测试通常用于比较NLP系统的性能，但是仅p值是不够的，因为统计学意义与实际意义不同。后者可以通过估计效果大小来衡量。在本文中，我们提出了一个三阶段的程序，用于比较NLP系统性能并提供一个自动化过程的工具包，NLPStattest。用户可以上传NLP系统评估分数，该工具包将分析这些分数，进行适当的显着性测试，估计效果大小，并进行功率分析以估计II型错误。该工具包提供了一种方便，系统的方式来比较超越统计显着性测试的NLP系统性能

Statistical significance testing centered on p-values is commonly used to compare NLP system performance, but p-values alone are insufficient because statistical significance differs from practical significance. The latter can be measured by estimating effect size. In this paper, we propose a three-stage procedure for comparing NLP system performance and provide a toolkit, NLPStatTest, that automates the process. Users can upload NLP system evaluation scores and the toolkit will analyze these scores, run appropriate significance tests, estimate effect size, and conduct power analysis to estimate Type II error. The toolkit provides a convenient and systematic way to compare NLP system performance that goes beyond statistical significance testing

下载PDF全文

下载文献需遵守相关版权规定

论文标题