表达对象就像单词：图像文本匹配的经常性视觉嵌入

论文标题

表达对象就像单词：图像文本匹配的经常性视觉嵌入

Expressing Objects just like Words: Recurrent Visual Embedding for Image-Text Matching

论文作者

Chen, Tianlang, Luo, Jiebo

论文摘要

现有的图像文本匹配方法通常通过捕获和汇总图像的每个独立对象之间的亲和力来推断图像文本对的相似性。但是，他们忽略了语义相关的对象之间的连接。这些对象可以集体确定图像是否对应文本。为了解决此问题，我们提出了一个双路径复发性神经网络（DP-RNN），该神经网络通过复发性神经网络（RNN）对称地处理图像和句子。特别是，给定输入图像文本对，我们的模型会根据文本中最相关的单词的位置来重新定位图像对象。就像从单词嵌入式中提取隐藏的特征一样，该模型利用RNN从重新排序的对象输入中提取高级对象特征。我们验证高级对象功能包含有用的语义相关对象的有用的联合信息，从而使检索任务受益。为了计算图像文本的相似性，我们将多发的交叉匹配模型结合到DP-RNN中。它通过引导注意力和自我注意力来汇总对象和单词之间的亲和力。我们的模型在FlickR30K数据集上实现了最先进的性能以及MS-Coco数据集的竞争性能。广泛的实验证明了我们模型的有效性。

Existing image-text matching approaches typically infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image. However, they ignore the connections between the objects that are semantically related. These objects may collectively determine whether the image corresponds to a text or not. To address this problem, we propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN). In particular, given an input image-text pair, our model reorders the image objects based on the positions of their most related words in the text. In the same way as extracting the hidden features from word embeddings, the model leverages RNN to extract high-level object features from the reordered object inputs. We validate that the high-level object features contain useful joint information of semantically related objects, which benefit the retrieval task. To compute the image-text similarity, we incorporate a Multi-attention Cross Matching Model into DP-RNN. It aggregates the affinity between objects and words with cross-modality guided attention and self-attention. Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset. Extensive experiments demonstrate the effectiveness of our model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题