论文标题

使用R软件包npbayesimputecat的多个插补和合成数据生成

Multiple Imputation and Synthetic Data Generation with the R package NPBayesImputeCat

论文作者

Hu, Jingchen, Akande, Olanrewaju, Wang, Quanli

论文摘要

在许多情况下,缺少数据和披露控制是无处不在且具有挑战性的问题。尤其是在统计机构,他们从调查和人口普查中收集的受访者级别的数据可能会遭受较高的失踪率。此外,在发布收集的数据供公众使用时,机构有义务保护受访者的隐私。本文介绍的NPBayesImputeCat R软件包为i)为丢失数据创建多个归档; ii)创建用于统计披露控制的合成数据,用于具有或没有结构零的多元分类数据。我们描述了包装中使用的多项式分布模型的产品的差异过程混合物,并使用美国社区调查(ACS)中的数据样本说明了包装的各种用途。我们还将丢失的数据插补的结果与MICE R软件包和合成数据生成的结果进行了比较,并将合成数据生成的结果与SynthPop R软件包进行了比较。

In many contexts, missing data and disclosure control are ubiquitous and challenging issues. In particular at statistical agencies, the respondent-level data they collect from surveys and censuses can suffer from high rates of missingness. Furthermore, agencies are obliged to protect respondents' privacy when publishing the collected data for public use. The NPBayesImputeCat R package, introduced in this paper, provides routines to i) create multiple imputations for missing data, and ii) create synthetic data for statistical disclosure control, for multivariate categorical data, with or without structural zeros. We describe the Dirichlet process mixture of products of multinomial distributions model used in the package, and illustrate various uses of the package using data samples from the American Community Survey (ACS). We also compare results of the missing data imputation to the mice R package and those of the synthetic data generation to the synthpop R package.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源