论文标题
识别风险评估部分合成数据的$ \ texttt {dissideificationRiskCalculation} $ r软件包
Identification Risks Evaluation of Partially Synthetic Data with the $\texttt{IdentificationRiskCalculation}$ R Package
论文作者
论文摘要
我们扩展了一种一般方法来评估部分合成数据中合成变量的识别风险。对于多个连续合成的变量,我们介绍了半径$ r $在每个目标记录的识别风险概率构建中的使用,并用工作示例说明。我们创建$ \ texttt {dissineificationRiskCalculation} $ r软件包,以帮助研究人员和数据传播者执行这些识别风险评估计算。我们通过R软件包与消费者支出调查的数据样本进行应用程序演示我们的方法,并讨论对风险和数据实用程序的影响1)半径$ r $,2)选择合成变量的选择,以及3)合成数据集数量的选择。我们为统计机构提出了综合和评估连续变量识别风险的建议。
We extend a general approach to evaluating identification risk of synthesized variables in partially synthetic data. For multiple continuous synthesized variables, we introduce the use of a radius $r$ in the construction of identification risk probability of each target record, and illustrate with working examples. We create the $\texttt{IdentificationRiskCalculation}$ R package to aid researchers and data disseminators in performing these identification risks evaluation calculations. We demonstrate our methods through the R package with applications to a data sample from the Consumer Expenditure Surveys, and discuss the impacts on risk and data utility of 1) the choice of radius $r$, 2) the choice of synthesized variables, and 3) the choice of number of synthetic datasets. We give recommendations for statistical agencies for synthesizing and evaluating identification risk of continuous variables.