论文标题

面具后面:PII掩蔽的名称检测的人口偏见

Behind the Mask: Demographic bias in name detection for PII masking

论文作者

Mansfield, Courtney, Paullada, Amandalynne, Howell, Kristen

论文摘要

许多数据集包含个人身份信息或PII,对个人构成隐私风险。 PII蒙版通常用于编辑文本数据中的名称,地址和电话号码等个人信息。大多数现代的PII掩蔽管道都涉及机器学习算法。但是,这些系统的性能可能会有所不同,使得来自特定人群群体的个人承担暴露其个人信息的风险更高。在本文中,我们评估了三个现成的PII掩盖系统,以命名和修复。我们使用客户服务域中的名称和模板生成数据。我们发现,基于Roberta的开源系统显示出比我们测试的商业模型更少的差异。但是,所有系统均基于人口统计数据均显示出错误率的显着差异。特别是,与黑人和亚洲/太平洋岛民个人相关的名称的错误率最高。

Many datasets contain personally identifiable information, or PII, which poses privacy risks to individuals. PII masking is commonly used to redact personal information such as names, addresses, and phone numbers from text data. Most modern PII masking pipelines involve machine learning algorithms. However, these systems may vary in performance, such that individuals from particular demographic groups bear a higher risk for having their personal information exposed. In this paper, we evaluate the performance of three off-the-shelf PII masking systems on name detection and redaction. We generate data using names and templates from the customer service domain. We find that an open-source RoBERTa-based system shows fewer disparities than the commercial models we test. However, all systems demonstrate significant differences in error rate based on demographics. In particular, the highest error rates occurred for names associated with Black and Asian/Pacific Islander individuals.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源