一致的多序列解码

论文标题

一致的多序列解码

Consistent Multiple Sequence Decoding

论文作者

Xu, Bicheng, Sigal, Leonid

论文摘要

序列解码是大多数视觉语言模型的核心组成部分之一。然而，当面对多个（可能相关的令牌序列）时，典型的神经解码将诉诸简单的独立解码方案。在本文中，我们引入了一致的多个序列解码体系结构，虽然相对简单，但却是一般的，并且可以同时且同时解码任意数量的序列。我们的公式利用了使用图形神经网络（GNN）中的消息传递实现的一致性融合机制来从相关解码器中汇总上下文。然后，除了先前生成的输出外，还将此上下文用作辅助输入，以在给定的解码步骤中进行预测。在GNN中，自我注意力用于在每个节点和解码过程中的每个步骤中调节局部融合机制。我们显示了一致的多个序列解码器对密集的关系图像字幕的任务的功效，并在任务上说明了最先进的性能（MAP中的5.2％）。更重要的是，我们说明对同一地区的解码句子更加一致（提高了9.5％），而跨图像和地区则保持多样性。

Sequence decoding is one of the core components of most visual-lingual models. However, typical neural decoders when faced with decoding multiple, possibly correlated, sequences of tokens resort to simple independent decoding schemes. In this paper, we introduce a consistent multiple sequence decoding architecture, which is while relatively simple, is general and allows for consistent and simultaneous decoding of an arbitrary number of sequences. Our formulation utilizes a consistency fusion mechanism, implemented using message passing in a Graph Neural Network (GNN), to aggregate context from related decoders. This context is then utilized as a secondary input, in addition to previously generated output, to make a prediction at a given step of decoding. Self-attention, in the GNN, is used to modulate the fusion mechanism locally at each node and each step in the decoding process. We show the efficacy of our consistent multiple sequence decoder on the task of dense relational image captioning and illustrate state-of-the-art performance (+ 5.2% in mAP) on the task. More importantly, we illustrate that the decoded sentences, for the same regions, are more consistent (improvement of 9.5%), while across images and regions maintain diversity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题