论文标题

与图神经网络不平衡淋巴结分类的合成过度采样

Synthetic Over-sampling for Imbalanced Node Classification with Graph Neural Networks

论文作者

Zhao, Tianxiang, Zhang, Xiang, Wang, Suhang

论文摘要

近年来,图形神经网络(GNNS)已实现了节点分类的最先进性能。但是,大多数现有的GNN都会遭受图形不平衡问题。在许多现实世界中,节点类都是不平衡的,其中一些多数类构成了图的大部分部分。 GNN中的消息传播机制将进一步扩大这些多数类的主导地位,从而导致优化的分类性能。在这项工作中,我们试图通过生成少数族裔班级的伪实例来平衡培训数据,从而扩展了以前的基于过度采样的技术来解决这个问题。此任务是非平凡的,因为这些技术的设计是实例是独立的。忽视关系信息会使此过度采样过程变得复杂。此外,节点分类任务通常仅使用少数标记的节点进行半监督设置,从而为少数族裔实例提供了不足的监督。生成的低质量新节点会损害训练有素的分类器。在这项工作中,我们通过在构造的嵌入空间中综合新节点来解决这些困难,该节点编码节点属性和拓扑信息。此外,对边缘生成器进行同时训练,以建模图结构并为新样本提供关系。为了进一步提高数据效率,我们还探索了混合``中间''节点的合成,以在此过度采样过程中利用多数类的节点。对现实世界数据集的实验验证了我们提出的框架的有效性。

In recent years, graph neural networks (GNNs) have achieved state-of-the-art performance for node classification. However, most existing GNNs would suffer from the graph imbalance problem. In many real-world scenarios, node classes are imbalanced, with some majority classes making up most parts of the graph. The message propagation mechanism in GNNs would further amplify the dominance of those majority classes, resulting in sub-optimal classification performance. In this work, we seek to address this problem by generating pseudo instances of minority classes to balance the training data, extending previous over-sampling-based techniques. This task is non-trivial, as those techniques are designed with the assumption that instances are independent. Neglection of relation information would complicate this oversampling process. Furthermore, the node classification task typically takes the semi-supervised setting with only a few labeled nodes, providing insufficient supervision for the generation of minority instances. Generated new nodes of low quality would harm the trained classifier. In this work, we address these difficulties by synthesizing new nodes in a constructed embedding space, which encodes both node attributes and topology information. Furthermore, an edge generator is trained simultaneously to model the graph structure and provide relations for new samples. To further improve the data efficiency, we also explore synthesizing mixed ``in-between'' nodes to utilize nodes from the majority class in this over-sampling process. Experiments on real-world datasets validate the effectiveness of our proposed framework.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源