论文标题
基于图形的新型多模式融合编码器,用于神经机器翻译
A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation
论文作者
论文摘要
多模式神经机器翻译(NMT)旨在将源句子转换为与图像配对的目标语言。但是,主导的多模式NMT模型并未完全利用不同模态语义单元之间的细粒语义对应关系,这些语义单元具有优化多模式表示学习的潜力。为了解决这个问题,在本文中,我们提出了一种基于图形的新型多模式融合编码器,用于NMT。具体而言,我们首先使用统一的多模式图表示输入句子和图像,该图捕获了多模式语义单元(单词和视觉对象)之间的各种语义关系。然后,我们堆叠基于图形的多模式融合层,迭代地执行语义交互以学习节点表示。最后,这些表示为解码器提供了基于注意力的上下文向量。我们在Multi30k数据集上评估了建议的编码器。实验结果和深入分析显示了我们多模式NMT模型的优越性。
Multi-modal neural machine translation (NMT) aims to translate source sentences into a target language paired with images. However, dominant multi-modal NMT models do not fully exploit fine-grained semantic correspondences between semantic units of different modalities, which have potential to refine multi-modal representation learning. To deal with this issue, in this paper, we propose a novel graph-based multi-modal fusion encoder for NMT. Specifically, we first represent the input sentence and image using a unified multi-modal graph, which captures various semantic relationships between multi-modal semantic units (words and visual objects). We then stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations. Finally, these representations provide an attention-based context vector for the decoder. We evaluate our proposed encoder on the Multi30K datasets. Experimental results and in-depth analysis show the superiority of our multi-modal NMT model.