论文标题

小时间表:数据预处理决策的视觉表示

Smallset Timelines: A Visual Representation of Data Preprocessing Decisions

论文作者

Lucchesi, Lydia R., Kuhnert, Petra M., Davis, Jenny L., Xie, Lexing

论文摘要

数据预处理是数据分析管道中的关键阶段,需要考虑技术和社会方面。然而,在研究实践和传播中,它受到的关注通常缺乏。我们介绍了较小的时间表,这是一种可视化,以帮助反思和传达数据预处理决策。 “ smallset”是来自原始数据集中包含数据集更改实例的一小部分行。时间轴由小型快照组成,该快照代表预处理阶段的不同点和字幕,以描述每个点可视化的变化。数据集中的编辑,添加和删除用颜色突出显示。我们开发了R软件包,小型组,可以从R和Python数据预处理脚本中创建小型时间表。构建该数字要求从业人员反思和修改必要的决策,同时共享该决定旨在使各种受众范围的过程可以访问该过程。我们提出了两个案例研究,以说明使用少量时间表以可视化预处理决策。案例研究包括软件缺陷数据和收入调查基准数据,其中预处理分别影响预测任务中的数据丢失水平和群体公平性。我们将小型时间表视为首选数据出处工具,从而可以更好地文档和整个预处理任务的通信。

Data preprocessing is a crucial stage in the data analysis pipeline, with both technical and social aspects to consider. Yet, the attention it receives is often lacking in research practice and dissemination. We present the Smallset Timeline, a visualisation to help reflect on and communicate data preprocessing decisions. A "Smallset" is a small selection of rows from the original dataset containing instances of dataset alterations. The Timeline is comprised of Smallset snapshots representing different points in the preprocessing stage and captions to describe the alterations visualised at each point. Edits, additions, and deletions to the dataset are highlighted with colour. We develop the R software package, smallsets, that can create Smallset Timelines from R and Python data preprocessing scripts. Constructing the figure asks practitioners to reflect on and revise decisions as necessary, while sharing it aims to make the process accessible to a diverse range of audiences. We present two case studies to illustrate use of the Smallset Timeline for visualising preprocessing decisions. Case studies include software defect data and income survey benchmark data, in which preprocessing affects levels of data loss and group fairness in prediction tasks, respectively. We envision Smallset Timelines as a go-to data provenance tool, enabling better documentation and communication of preprocessing tasks at large.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源