CrowdWorksheets：众包数据集注释的个人和集体身份

论文标题

CrowdWorksheets：众包数据集注释的个人和集体身份

CrowdWorkSheets: Accounting for Individual and Collective Identities Underlying Crowdsourced Dataset Annotation

论文作者

Diaz, Mark, Kivlichan, Ian D., Rosen, Rachel, Baker, Dylan K., Amironesei, Razvan, Prabhakaran, Vinodkumar, Denton, Emily

论文摘要

人类注释的数据在机器学习（ML）研究和开发中起着至关重要的作用。但是，围绕数据集注释的过程和决策的道德考虑并没有得到足够的关注。在本文中，我们调查了一系列文献，该文献对众包数据集注释的道德考虑提供了见解。我们综合了这些见解，并沿着两个层次列出了这个空间中的挑战：（1）注释者是谁，以及注释者的生活经历如何影响他们的注释，以及（2）注释者与众包平台之间的关系，以及这种关系是什么。最后，我们介绍了一个新颖的框架，即CrowdWorksheets，以供数据集开发人员促进数据注释管道各个阶段的关键决策点的透明文档：任务配方，注释者选择，平台和基础架构选择，数据集分析和评估，以及数据集释放和维护。

Human annotated data plays a crucial role in machine learning (ML) research and development. However, the ethical considerations around the processes and decisions that go into dataset annotation have not received nearly enough attention. In this paper, we survey an array of literature that provides insights into ethical considerations around crowdsourced dataset annotation. We synthesize these insights, and lay out the challenges in this space along two layers: (1) who the annotator is, and how the annotators' lived experiences can impact their annotations, and (2) the relationship between the annotators and the crowdsourcing platforms, and what that relationship affords them. Finally, we introduce a novel framework, CrowdWorkSheets, for dataset developers to facilitate transparent documentation of key decisions points at various stages of the data annotation pipeline: task formulation, selection of annotators, platform and infrastructure choices, dataset analysis and evaluation, and dataset release and maintenance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题