论文标题

病理学家定义的标签是否可再现? TUPAC16有丝分裂图数据集与替代标签的比较

Are pathologist-defined labels reproducible? Comparison of the TUPAC16 mitotic figure dataset with an alternative set of labels

论文作者

Bertram, Christof A., Veta, Mitko, Marzahl, Christian, Stathonikos, Nikolas, Maier, Andreas, Klopfleisch, Robert, Aubreville, Marc

论文摘要

病理学家定义的标签是组织病理学数据集的黄金标准,无论某些任务的一致性众所周知。迄今为止,有一些有关有丝分裂数字的数据集可用,并用于开发有希望的深度学习算法。为了评估这些算法的鲁棒性和方法的可重复性,有必要在几个独立的数据集上测试。当前未知这些可用数据集的不同标签方法的影响。为了解决这个问题,我们为TUPAC16挑战的辅助有丝分裂数据集提供了一组替代标签。除手动有丝分裂图筛选外,我们还使用了一种新型的算法标记过程,该过程允许最大程度地减少图像中缺少稀有有丝分裂数字的风险。所有潜在的有丝分裂数字均由两名病理学家独立评估。新型公开可用的标签包含1,999个有丝分裂数字(+28.80%),还包括10,483个与有丝分裂数字相似的细胞标签(硬示例)。我们发现使用标准深度学习对象检测体系结构比较原始标签集(0.549)和新的替代标签集(0.735)之间的F_1分数的显着差异。在替代组中训练的模型显示出更高的总体置信值,表明总体标签一致性更高。本研究的发现表明,病理学家定义的标签可能会显着变化,从而导致模型性能的显着差异。应该谨慎地对具有不同标签方法的独立数据集之间的深度基于深度学习的算法进行比较。

Pathologist-defined labels are the gold standard for histopathological data sets, regardless of well-known limitations in consistency for some tasks. To date, some datasets on mitotic figures are available and were used for development of promising deep learning-based algorithms. In order to assess robustness of those algorithms and reproducibility of their methods it is necessary to test on several independent datasets. The influence of different labeling methods of these available datasets is currently unknown. To tackle this, we present an alternative set of labels for the images of the auxiliary mitosis dataset of the TUPAC16 challenge. Additional to manual mitotic figure screening, we used a novel, algorithm-aided labeling process, that allowed to minimize the risk of missing rare mitotic figures in the images. All potential mitotic figures were independently assessed by two pathologists. The novel, publicly available set of labels contains 1,999 mitotic figures (+28.80%) and additionally includes 10,483 labels of cells with high similarities to mitotic figures (hard examples). We found significant difference comparing F_1 scores between the original label set (0.549) and the new alternative label set (0.735) using a standard deep learning object detection architecture. The models trained on the alternative set showed higher overall confidence values, suggesting a higher overall label consistency. Findings of the present study show that pathologists-defined labels may vary significantly resulting in notable difference in the model performance. Comparison of deep learning-based algorithms between independent datasets with different labeling methods should be done with caution.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源