论文标题
基于领域的潜在个人分析及其在社交媒体中的模拟检测
Domain-based Latent Personal Analysis and its use for impersonation detection in social media
论文作者
论文摘要
ZIPF定律定义了一个单词在给定语料库中的排名与其频率之间的反比比例,将词汇大致分为频繁的单词和频繁的单词。在这里,我们规定在一个域内,作者的签名可以从作者丢失的流行单词和经常使用的不频繁的字中得出。我们设计了一种称为潜在个人分析(LPA)的方法,用于在域中找到基于域的属性:它们与域及其签名的距离,这决定了它们与域最大的不同。我们确定了几种方法中最合适的距离度量,并为作者,域的实体构建了距离和个人签名。该签名包括两个过度使用的术语(与平均值相比)和缺少流行条款。我们验证签名在识别用户和设置存在条件方面的正确性和功能。然后,我们在可解释的作者归因中显示了该方法的用途:我们定义了利用LPA在社交媒体中识别两种模仿类型的算法:(1)具有Sockpuppets(多个)帐户的作者; (2)前用户帐户,由几位作者运营。我们验证算法,并通过从拥有4000多个用户的社交媒体网站获得的大规模数据集中使用它们。我们使用时间率分析来证实这些结果。 LPA可以进一步用于在广泛的科学领域中设计个人属性,其中组成部分具有长尾元素的分布。
Zipf's law defines an inverse proportion between a word's ranking in a given corpus and its frequency in it, roughly dividing the vocabulary into frequent words and infrequent ones. Here, we stipulate that within a domain an author's signature can be derived from, in loose terms, the author's missing popular words and frequently used infrequent-words. We devise a method, termed Latent Personal Analysis (LPA), for finding domain-based attributes for entities in a domain: their distance from the domain and their signature, which determines how they most differ from a domain. We identify the most suitable distance metric for the method among several and construct the distances and personal signatures for authors, the domain's entities. The signature consists of both over-used terms (compared to the average), and missing popular terms. We validate the correctness and power of the signatures in identifying users and set existence conditions. We then show uses for the method in explainable authorship attribution: we define algorithms that utilize LPA to identify two types of impersonation in social media: (1) authors with sockpuppets (multiple) accounts; (2) front users accounts, operated by several authors. We validate the algorithms and employ them over a large scale dataset obtained from a social media site with over 4000 users. We corroborate these results using temporal rate analysis. LPA can further be used to devise personal attributes in a wide range of scientific domains in which the constituents have a long-tail distribution of elements.