基于巧合的统计意义的两分数据集的分层聚类

论文标题

基于巧合的统计意义的两分数据集的分层聚类

Hierarchical clustering of bipartite data sets based on the statistical significance of coincidences

论文作者

Tamarit, Ignacio, Pereda, María, Cuesta, José A.

论文摘要

当某些“实体”与“功能”相关时，它们共享它们可以适应两部分网络表示。在民意调查中，植物 - 托管生态社区，科学论文的共同制作，客户和购买或答案只是一些例子。分析网络中此类实体的聚类是一个有用的工具，该工具在许多领域（例如Internet技术，推荐系统或疾病检测）中具有应用程序。最广泛地应用于在两部分网络中找到簇的算法是模块化优化的变体。在这里，我们基于实体之间的差异性提供了层次聚类算法，该实体量化了两个实体共享的特征仅仅是偶然的概率。当应用于一组N实体时，算法性能为$ O（n^2）$，其结果是展示这些实体连接的树状图。通过引入“敏感性”度量，我们可以为聚类和量化其质量提供“最佳”选择。树状图揭示了进一步有用的结构信息 - 例如簇中的子簇或不适合任何群集的节点的存在。我们通过首先将其应用于一组合成网络，然后将其应用于示例的选择来说明该算法。我们还说明了如何将我们的算法转换为一模式网络的有效替代方案，并表明它至少与标准的基于模块化的算法相同，具有较高的数值性能。我们提供了从Github自由访问Python中算法的实现。

When some 'entities' are related by the 'features' they share they are amenable to a bipartite network representation. Plant-pollinator ecological communities, co-authorship of scientific papers, customers and purchases, or answers in a poll, are but a few examples. Analyzing clustering of such entities in the network is a useful tool with applications in many fields, like internet technology, recommender systems, or detection of diseases. The algorithms most widely applied to find clusters in bipartite networks are variants of modularity optimization. Here we provide an hierarchical clustering algorithm based on a dissimilarity between entities that quantifies the probability that the features shared by two entities is due to mere chance. The algorithm performance is $O(n^2)$ when applied to a set of n entities, and its outcome is a dendrogram exhibiting the connections of those entities. Through the introduction of a 'susceptibility' measure we can provide an 'optimal' choice for the clustering as well as quantify its quality. The dendrogram reveals further useful structural information though -- like the existence of sub-clusters within clusters or of nodes that do not fit in any cluster. We illustrate the algorithm by applying it first to a set of synthetic networks, and then to a selection of examples. We also illustrate how to transform our algorithm into a valid alternative for one-mode networks as well, and show that it performs at least as well as the standard, modularity-based algorithms -- with a higher numerical performance. We provide an implementation of the algorithm in Python freely accessible from GitHub.

下载PDF全文

下载文献需遵守相关版权规定

论文标题