基于图形的新型多模式融合编码器，用于神经机器翻译

论文标题

基于图形的新型多模式融合编码器，用于神经机器翻译

A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation

论文作者

Yin, Yongjing, Meng, Fandong, Su, Jinsong, Zhou, Chulun, Yang, Zhengyuan, Zhou, Jie, Luo, Jiebo

论文摘要

多模式神经机器翻译（NMT）旨在将源句子转换为与图像配对的目标语言。但是，主导的多模式NMT模型并未完全利用不同模态语义单元之间的细粒语义对应关系，这些语义单元具有优化多模式表示学习的潜力。为了解决这个问题，在本文中，我们提出了一种基于图形的新型多模式融合编码器，用于NMT。具体而言，我们首先使用统一的多模式图表示输入句子和图像，该图捕获了多模式语义单元（单词和视觉对象）之间的各种语义关系。然后，我们堆叠基于图形的多模式融合层，迭代地执行语义交互以学习节点表示。最后，这些表示为解码器提供了基于注意力的上下文向量。我们在Multi30k数据集上评估了建议的编码器。实验结果和深入分析显示了我们多模式NMT模型的优越性。

Multi-modal neural machine translation (NMT) aims to translate source sentences into a target language paired with images. However, dominant multi-modal NMT models do not fully exploit fine-grained semantic correspondences between semantic units of different modalities, which have potential to refine multi-modal representation learning. To deal with this issue, in this paper, we propose a novel graph-based multi-modal fusion encoder for NMT. Specifically, we first represent the input sentence and image using a unified multi-modal graph, which captures various semantic relationships between multi-modal semantic units (words and visual objects). We then stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations. Finally, these representations provide an attention-based context vector for the decoder. We evaluate our proposed encoder on the Multi30K datasets. Experimental results and in-depth analysis show the superiority of our multi-modal NMT model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题