论文标题
多个数据集(包括个人信息)的非可读性识别数据协作分析
Non-readily identifiable data collaboration analysis for multiple datasets including personal information
论文作者
论文摘要
多源数据融合,共同分析了多个数据源以获得改进的信息,具有相当大的研究关注。对于多个医疗机构的数据集,数据机密性和跨机构沟通至关重要。在这种情况下,数据协作(DC)分析通过共享维度降低的中间表示,而无需迭代跨机构沟通可能是合适的。在分析包括个人信息在内的数据时,共享数据的可识别性至关重要。在这项研究中,研究了DC分析的可识别性。结果表明,共享的中间表示很容易识别为原始数据以进行监督学习。然后,这项研究提出了一个非可读性识别的直流分析,仅共享多个医学数据集(包括个人信息)的非可读数据。所提出的方法基于随机样本置换,可解释的DC分析的概念以及无法重建的功能的使用来解决可识别性问题。在医学数据集的数值实验中,所提出的方法表现出非可读性可识别性,同时保持了常规DC分析的高识别性能。对于医院的数据集,提出的方法在仅使用本地数据集的本地分析的识别性能方面提高了9个百分点。
Multi-source data fusion, in which multiple data sources are jointly analyzed to obtain improved information, has considerable research attention. For the datasets of multiple medical institutions, data confidentiality and cross-institutional communication are critical. In such cases, data collaboration (DC) analysis by sharing dimensionality-reduced intermediate representations without iterative cross-institutional communications may be appropriate. Identifiability of the shared data is essential when analyzing data including personal information. In this study, the identifiability of the DC analysis is investigated. The results reveals that the shared intermediate representations are readily identifiable to the original data for supervised learning. This study then proposes a non-readily identifiable DC analysis only sharing non-readily identifiable data for multiple medical datasets including personal information. The proposed method solves identifiability concerns based on a random sample permutation, the concept of interpretable DC analysis, and usage of functions that cannot be reconstructed. In numerical experiments on medical datasets, the proposed method exhibits a non-readily identifiability while maintaining a high recognition performance of the conventional DC analysis. For a hospital dataset, the proposed method exhibits a nine percentage point improvement regarding the recognition performance over the local analysis that uses only local dataset.