论文标题

多维随机响应

Multi-Dimensional Randomized Response

论文作者

Domingo-Ferrer, Josep, Soria-Comas, Jordi

论文摘要

在我们的数据世界中,许多不一定信任的控制器收集有关单个主题的数据。为了保留她的隐私,更普遍地,她的信息自决,必须通过根据自己的数据提供代理机构来授权个人。最大代理是由本地匿名化提供的,这使每个人都可以将自己的数据匿名化,然后再将其交给数据控制器。随机响应(RR)是一种局部匿名方法,能够产生对探索性分析和机器学习有效的多维全套匿名微型数据。之所以如此,是因为可以从其汇总的随机数据中获得对个人真实数据分布的无偏估计。此外,RR提供严格的隐私保证。 RR的主要弱点是应用于几个属性时的维数的诅咒:随着属性的数量的增加,估计的真实数据分布的准确性很快就会降低。我们提出了几种互补的方法来减轻维度问题。首先,我们提出两个基本协议,每个属性上的RR和所有属性的关节RR分开,并讨论其局限性。然后,我们引入一种算法以形成属性簇,以便可以将不同簇中的属性视为独立的属性,并且可以在每个群集中执行关节RR。之后,我们为随机数据集引入了一种调整算法,该算法将在每个属性上分别使用RR或假设集群RR中的簇之间独立时,可以修复由于假设属性之间的独立性而引起的一些准确性损失。我们还提出了实证工作,以说明所提出的方法。

In our data world, a host of not necessarily trusted controllers gather data on individual subjects. To preserve her privacy and, more generally, her informational self-determination, the individual has to be empowered by giving her agency on her own data. Maximum agency is afforded by local anonymization, that allows each individual to anonymize her own data before handing them to the data controller. Randomized response (RR) is a local anonymization approach able to yield multi-dimensional full sets of anonymized microdata that are valid for exploratory analysis and machine learning. This is so because an unbiased estimate of the distribution of the true data of individuals can be obtained from their pooled randomized data. Furthermore, RR offers rigorous privacy guarantees. The main weakness of RR is the curse of dimensionality when applied to several attributes: as the number of attributes grows, the accuracy of the estimated true data distribution quickly degrades. We propose several complementary approaches to mitigate the dimensionality problem. First, we present two basic protocols, separate RR on each attribute and joint RR for all attributes, and discuss their limitations. Then we introduce an algorithm to form clusters of attributes so that attributes in different clusters can be viewed as independent and joint RR can be performed within each cluster. After that, we introduce an adjustment algorithm for the randomized data set that repairs some of the accuracy loss due to assuming independence between attributes when using RR separately on each attribute or due to assuming independence between clusters in cluster-wise RR. We also present empirical work to illustrate the proposed methods.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源