论文标题
使用监督的机器学习链接预测,基于汇总和拓扑特征
Link Prediction Using Supervised Machine Learning based on Aggregated and Topological Features
论文作者
论文摘要
链接预测是社交网络分析中的重要任务。社交网络中有不同的特征(功能),可用于链接预测。在本文中,我们评估了使用监督学习的链接预测中汇总特征和拓扑特征的有效性。在社交网络中,汇总功能是节点属性的某些聚合功能。拓扑特征描述了社交网络及其基础图的拓扑或结构。我们通过测量不同监督机器学习方法的性能来评估这些功能的有效性。具体而言,我们选择了五种众所周知的监督方法,包括J48决策树,多层感知器(MLP),支持向量机(SVM),逻辑回归和天真的贝叶斯(NB)。我们测量了这五种方法的性能,该方法具有DBLP数据集的不同功能集。我们的结果表明,汇总和拓扑特征的结合产生了最佳性能。出于评估目的,我们使用了ROC曲线(AUC)下的精度和F量。 我们选择的功能可用于分析几乎任何社交网络。这是因为这些功能提供了社交网络基础图的重要特征。我们工作的意义在于,所选功能在分析大型社交网络时可能非常有效。在这样的网络中,我们通常会处理数百万或数十亿个实例的大数据集。使用更少但更有效的功能可以帮助我们分析大型社交网络。
Link prediction is an important task in social network analysis. There are different characteristics (features) in a social network that can be used for link prediction. In this paper, we evaluate the effectiveness of aggregated features and topological features in link prediction using supervised learning. The aggregated features, in a social network, are some aggregation functions of the attributes of the nodes. Topological features describe the topology or structure of a social network, and its underlying graph. We evaluated the effectiveness of these features by measuring the performance of different supervised machine learning methods. Specifically, we selected five well-known supervised methods including J48 decision tree, multi-layer perceptron (MLP), support vector machine (SVM), logistic regression and Naive Bayes (NB). We measured the performance of these five methods with different sets of features of the DBLP Dataset. Our results indicate that the combination of aggregated and topological features generates the best performance. For evaluation purposes, we used accuracy, area under the ROC curve (AUC) and F-Measure. Our selected features can be used for the analysis of almost any social network. This is because these features provide the important characteristics of the underlying graph of the social networks. The significance of our work is that the selected features can be very effective in the analysis of big social networks. In such networks we usually deal with big data sets, with millions or billions of instances. Using fewer, but more effective, features can help us for the analysis of big social networks.