论文标题
图形表示学习的大规模数据库
A Large-Scale Database for Graph Representation Learning
论文作者
论文摘要
随着图形表示学习的快速出现,必须建立新的大规模数据集,以区分模型功能并准确评估每种技术的优势和劣势。通过仔细分析现有的图形数据库,我们确定了3个关键组件对于推进图表的领域学习至关重要:(1)大图,(2)许多图形和(3)类多样性。迄今为止,尚无单个图形数据库提供所有这些所需的属性。我们介绍了有史以来最大的公共图数据库Malnet,代表了恶意软件功能呼叫图的大规模本体。 Malnet包含超过120万张图,平均每张图超过15K节点和35K边缘,这些层次是47种类型和696个家庭的层次结构。与流行的Reddit-12k数据库相比,Malnet平均提供105倍的图表,39倍较大的图表和63倍的类别。我们提供了对Malnet的详细分析,讨论其特性和出处,以及对最先进的机器学习和图形神经网络技术的评估。 Malnet的前所未有的规模和多样性提供了令人兴奋的机会,可以推进图表学习的前沿,从而使新发现和研究对分类不平衡,解释性和阶级硬度的影响。该数据库可在www.mal-net.org上公开获取。
With the rapid emergence of graph representation learning, the construction of new large-scale datasets is necessary to distinguish model capabilities and accurately assess the strengths and weaknesses of each technique. By carefully analyzing existing graph databases, we identify 3 critical components important for advancing the field of graph representation learning: (1) large graphs, (2) many graphs, and (3) class diversity. To date, no single graph database offers all these desired properties. We introduce MalNet, the largest public graph database ever constructed, representing a large-scale ontology of malicious software function call graphs. MalNet contains over 1.2 million graphs, averaging over 15k nodes and 35k edges per graph, across a hierarchy of 47 types and 696 families. Compared to the popular REDDIT-12K database, MalNet offers 105x more graphs, 39x larger graphs on average, and 63x more classes. We provide a detailed analysis of MalNet, discussing its properties and provenance, along with the evaluation of state-of-the-art machine learning and graph neural network techniques. The unprecedented scale and diversity of MalNet offers exciting opportunities to advance the frontiers of graph representation learning--enabling new discoveries and research into imbalanced classification, explainability and the impact of class hardness. The database is publicly available at www.mal-net.org.