论文标题
您的模型敏感吗? SPEDAC:用于检测和分类敏感个人数据的新基准
Is Your Model Sensitive? SPeDaC: A New Benchmark for Detecting and Classifying Sensitive Personal Data
论文作者
论文摘要
近年来,处理敏感个人信息在内的应用程序(包括对话系统)的指数增长。这已经揭示了在虚拟环境中个人数据保护的极为重要的问题。敏感信息检测(SID)接近文献中的不同领域和语言。但是,如果我们指的是个人数据域,共享基准或缺乏可用标签的资源,则将与最先进的比较进行比较。我们介绍和发布SPEDAC,这是一种新的注释资源,用于识别英语敏感的个人数据类别。 SPEDAC可以评估三个不同的SID子任务,并具有增加的复杂性水平。 SPEDAC 1对二进制分类进行了指责,一个模型必须检测句子是否包含敏感信息。鉴于,在SPEDAC 2中,我们使用与个人信息的宏观域相关的5个类别收集了标记的句子;在Spedac 3中,标签是细粒度的(61个个人数据类别)。我们使用不同的最先进的分类符对资源进行广泛的评估。结果表明,SPEDAC具有挑战性,尤其是在细粒度的分类方面。变压器模型取得了最佳结果(Spedac上的Roberta 1 = 98.20%,Spedac上的Deberta 2 = 95.81%,Spedac 3 = 77.63%)。
In recent years, there has been an exponential growth of applications, including dialogue systems, that handle sensitive personal information. This has brought to light the extremely important issue of personal data protection in virtual environments. Sensitive Information Detection (SID) approaches different domains and languages in literature. However, if we refer to the personal data domain, a shared benchmark or the absence of an available labeled resource makes comparison with the state-of-the-art difficult. We introduce and release SPeDaC , a new annotated resource for the identification of sensitive personal data categories in the English language. SPeDaC enables the evaluation of computational models for three different SID subtasks with increasing levels of complexity. SPeDaC 1 regards binary classification, a model has to detect if a sentence contains sensitive information or not; whereas, in SPeDaC 2 we collected labeled sentences using 5 categories that relate to macro-domains of personal information; in SPeDaC 3, the labeling is fine-grained (61 personal data categories). We conduct an extensive evaluation of the resource using different state-of-the-art-classifiers. The results show that SPeDaC is challenging, particularly with regard to fine-grained classification. The transformer models achieve the best results (acc. RoBERTa on SPeDaC 1 = 98.20%, DeBERTa on SPeDaC 2 = 95.81% and SPeDaC 3 = 77.63%).