用于标签有效模因分析的域感知的自我监管的预训练预训练

论文标题

用于标签有效模因分析的域感知的自我监管的预训练预训练

Domain-aware Self-supervised Pre-training for Label-Efficient Meme Analysis

论文作者

Sharma, Shivam, Siddiqui, Mohd Khizir, Akhtar, Md. Shad, Chakraborty, Tanmoy

论文摘要

现有的自我监督学习策略被限制在有限的目标或主要针对单峰应用程序的通用下游任务中。对于在复杂性和域亲和力（例如模因分析）方面，这已经隔离了命令性多模式应用的孤立进展。在这里，我们介绍了两种自我监督的预训练方法，即ext-pie-net和MM-Simclr，（i）在培训预训练期间使用现成的多模式仇恨语音数据，以及（ii）通过将多个专业的姿势进行征服，从而有效地迎合了所需的复杂多模型分析，从而执行自我监督的学习。我们尝试不同的自我实验策略，包括可以帮助学习丰富的跨模式表示的潜在变体，并使用对仇恨模因任务的流行线性探测进行评估。拟议的解决方案通过标签有效的培训与完全监督的基线竞争，同时在梅诺特挑战的所有三个任务上明显优于他们，分别为0.18％，23.64％和0.93％的绩效增长。此外，我们通过在Harmeme任务上报告竞争性能来证明所提出的解决方案的普遍性。最后，我们通过分析特定于任务的学习，使用更少的标记培训样本来凭经验确定学习表现的质量，并争辩说自我划分策略和手头下游任务的复杂性是相关的。我们的努力强调了更好的多模式自学方法的要求，涉及有效的微调和可推广性能的专业借口任务。

Existing self-supervised learning strategies are constrained to either a limited set of objectives or generic downstream tasks that predominantly target uni-modal applications. This has isolated progress for imperative multi-modal applications that are diverse in terms of complexity and domain-affinity, such as meme analysis. Here, we introduce two self-supervised pre-training methods, namely Ext-PIE-Net and MM-SimCLR that (i) employ off-the-shelf multi-modal hate-speech data during pre-training and (ii) perform self-supervised learning by incorporating multiple specialized pretext tasks, effectively catering to the required complex multi-modal representation learning for meme analysis. We experiment with different self-supervision strategies, including potential variants that could help learn rich cross-modality representations and evaluate using popular linear probing on the Hateful Memes task. The proposed solutions strongly compete with the fully supervised baseline via label-efficient training while distinctly outperforming them on all three tasks of the Memotion challenge with 0.18%, 23.64%, and 0.93% performance gain, respectively. Further, we demonstrate the generalizability of the proposed solutions by reporting competitive performance on the HarMeme task. Finally, we empirically establish the quality of the learned representations by analyzing task-specific learning, using fewer labeled training samples, and arguing that the complexity of the self-supervision strategy and downstream task at hand are correlated. Our efforts highlight the requirement of better multi-modal self-supervision methods involving specialized pretext tasks for efficient fine-tuning and generalizable performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题