论文标题
除了计算数据集:多语言数据集构建和必要资源的调查
Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources
论文作者
论文摘要
尽管NLP社区通常了解语言之间的资源差异,但我们缺乏量化这种差异的程度和类型的研究。先前的调查基于数据集质量的质量变化,根据数据集数量估算资源的可用性可能会误导:许多数据集自动从英语数据中自动诱导或翻译。为了提供语言资源的更全面的图景,我们研究了156个公开可用的NLP数据集的特征。我们手动注释它们的创建方式,包括用于构建它们的输入文本和标签源和工具,以及它们研究的内容,他们解决的任务和创建动机。在跨语言量化了定性NLP资源差距之后,我们讨论了如何改善低资源语言的数据收集。我们每个语言都会调查具有语言的NLP研究人员和人群工人,发现他们的估计可用性与数据集的可用性相关。通过众包实验,我们确定了在机械Turk平台上收集高质量多语言数据的策略。最后,我们通过向NLP社区和个人研究人员提出宏观和微观建议,以进行未来的多语言数据开发。
While the NLP community is generally aware of resource disparities among languages, we lack research that quantifies the extent and types of such disparity. Prior surveys estimating the availability of resources based on the number of datasets can be misleading as dataset quality varies: many datasets are automatically induced or translated from English data. To provide a more comprehensive picture of language resources, we examine the characteristics of 156 publicly available NLP datasets. We manually annotate how they are created, including input text and label sources and tools used to build them, and what they study, tasks they address and motivations for their creation. After quantifying the qualitative NLP resource gap across languages, we discuss how to improve data collection in low-resource languages. We survey language-proficient NLP researchers and crowd workers per language, finding that their estimated availability correlates with dataset availability. Through crowdsourcing experiments, we identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform. We conclude by making macro and micro-level suggestions to the NLP community and individual researchers for future multilingual data development.