重新思考效率和冗余大规模图

论文标题

重新思考效率和冗余大规模图

Rethinking Efficiency and Redundancy in Training Large-scale Graphs

论文作者

Liu, Xin, Xiong, Xunbin, Yan, Mingyu, Xue, Runzhen, Pan, Shirui, Ye, Xiaochun, Fan, Dongrui

论文摘要

大规模图在现实世界中无处不在，可以通过图神经网络（GNN）训练以生成下游任务的表示形式。鉴于大规模图的丰富信息和复杂的拓扑结构，我们认为在这样的图中存在冗余，并会降低训练效率。不幸的是，模型可伸缩性严重限制了通过香草GNNS训练大规模图的效率。尽管在基于抽样的培训方法方面取得了最新进展，但基于抽样的GNN通常忽略了冗余问题。在大规模图上训练这些型号仍然需要无法容忍的时间。因此，我们建议通过重新思考图中的固有特征来下降冗余并提高使用GNN的训练大规模图的效率。在本文中，我们开拓者提出了一种称为dropreef的曾经使用的方法，以在大规模图中删除冗余。具体而言，我们首先进行初步实验，以探索大规模图中的潜在冗余。接下来，我们提出一个度量标准，用于量化图中所有节点的异质性。基于实验和理论分析，我们揭示了大规模图中的冗余，即具有高邻居杂质和大量邻居的节点。然后，我们建议Dropreef一劳永逸地检测并删除大规模图中的冗余，以帮助减少训练时间，同时确保模型准确性没有牺牲。为了证明DropReef的有效性，我们将其应用于最新的基于最新的采样GNN，以训练大规模图，这是由于此类模型的高精度。使用Dropreef杠杆，可以大力提高模型的训练效率。 Dropreef高度兼容，并且在离线执行，从而在很大程度上使目前和未来的最新采样GNN受益。

Large-scale graphs are ubiquitous in real-world scenarios and can be trained by Graph Neural Networks (GNNs) to generate representation for downstream tasks. Given the abundant information and complex topology of a large-scale graph, we argue that redundancy exists in such graphs and will degrade the training efficiency. Unfortunately, the model scalability severely restricts the efficiency of training large-scale graphs via vanilla GNNs. Despite recent advances in sampling-based training methods, sampling-based GNNs generally overlook the redundancy issue. It still takes intolerable time to train these models on large-scale graphs. Thereby, we propose to drop redundancy and improve efficiency of training large-scale graphs with GNNs, by rethinking the inherent characteristics in a graph. In this paper, we pioneer to propose a once-for-all method, termed DropReef, to drop the redundancy in large-scale graphs. Specifically, we first conduct preliminary experiments to explore potential redundancy in large-scale graphs. Next, we present a metric to quantify the neighbor heterophily of all nodes in a graph. Based on both experimental and theoretical analysis, we reveal the redundancy in a large-scale graph, i.e., nodes with high neighbor heterophily and a great number of neighbors. Then, we propose DropReef to detect and drop the redundancy in large-scale graphs once and for all, helping reduce the training time while ensuring no sacrifice in the model accuracy. To demonstrate the effectiveness of DropReef, we apply it to recent state-of-the-art sampling-based GNNs for training large-scale graphs, owing to the high precision of such models. With DropReef leveraged, the training efficiency of models can be greatly promoted. DropReef is highly compatible and is offline performed, benefiting the state-of-the-art sampling-based GNNs in the present and future to a significant extent.

下载PDF全文

下载文献需遵守相关版权规定

论文标题