论文标题
Stad:使用模棱两可的数据进行自我训练,以进行低资源关系提取
STAD: Self-Training with Ambiguous Data for Low-Resource Relation Extraction
论文作者
论文摘要
我们提出了一种简单而有效的自我训练方法,称为Stad,用于低资源关系提取。该方法首先根据教师模型所预测的概率将自动注释的实例分为两组:自信实例和不确定实例。与大多数以前的研究相反,这主要仅利用自信实例进行自我训练,我们利用了不确定的实例。为此,我们提出了一种从不确定实例中识别模棱两可但有用的实例的方法,然后将关系分为每个模棱两可的实例中的候选标签集和负标签集。接下来,我们为模棱两可的实例的负标签集提出了一种设定的培训方法,并针对自信实例进行了积极的培训方法。最后,提出了一种联合培训方法来在所有数据上构建最终关系提取系统。在两个广泛使用的数据集SEMEVAL2010 TASK-8上进行的实验结果并重新鉴定了低资源设置,这表明,与几种竞争性自我训练系统相比,这种新的自我训练方法确实可以实现显着且一致的改进。代码可在https://github.com/jjyunlp/stad上公开获取
We present a simple yet effective self-training approach, named as STAD, for low-resource relation extraction. The approach first classifies the auto-annotated instances into two groups: confident instances and uncertain instances, according to the probabilities predicted by a teacher model. In contrast to most previous studies, which mainly only use the confident instances for self-training, we make use of the uncertain instances. To this end, we propose a method to identify ambiguous but useful instances from the uncertain instances and then divide the relations into candidate-label set and negative-label set for each ambiguous instance. Next, we propose a set-negative training method on the negative-label sets for the ambiguous instances and a positive training method for the confident instances. Finally, a joint-training method is proposed to build the final relation extraction system on all data. Experimental results on two widely used datasets SemEval2010 Task-8 and Re-TACRED with low-resource settings demonstrate that this new self-training approach indeed achieves significant and consistent improvements when comparing to several competitive self-training systems. Code is publicly available at https://github.com/jjyunlp/STAD