文本自适应多个视觉原型匹配视频文本检索

论文标题

文本自适应多个视觉原型匹配视频文本检索

Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval

论文作者

Lin, Chengzhi, Wu, Ancong, Liang, Junwei, Zhang, Jun, Ge, Wenhang, Zheng, Wei-Shi, Shen, Chunhua

论文摘要

视频和文本之间的跨模式检索因网络上的视频迅速出现而越来越多。通常，视频包含丰富的实例和事件信息，查询文本仅描述信息的一部分。因此，视频可以对应于多个不同的文本说明和查询。我们将此现象称为``视频文本对应歧义''问题。当前技术主要集中于挖掘视频和文本内容之间的本地或多级对齐（\ textit {e.g。}，对实体和动词的动作对象）。这些方法很难通过仅使用一个单个功能来描述视频来减轻视频文本的信号歧义，该视频同时需要与多个不同的文本功能匹配。为了解决这个问题，我们提出了一个文本自适应多个视觉原型匹配模型，该模型会自动捕获多个原型来描述视频通过自适应聚合视频令牌功能。给定查询文本，相似性由最相似的原型确定，以在视频中找到对应关系，该视频称为文本自适应匹配。为了学习代表视频中丰富信息的多种原型，我们提出了差异损失，以鼓励不同的原型参与视频的不同内容。我们的方法在四个公共视频检索数据集上优于最先进的方法。

Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web. Generally, a video contains rich instance and event information and the query text only describes a part of the information. Thus, a video can correspond to multiple different text descriptions and queries. We call this phenomenon the ``Video-Text Correspondence Ambiguity'' problem. Current techniques mostly concentrate on mining local or multi-level alignment between contents of a video and text (\textit{e.g.}, object to entity and action to verb). It is difficult for these methods to alleviate the video-text correspondence ambiguity by describing a video using only one single feature, which is required to be matched with multiple different text features at the same time. To address this problem, we propose a Text-Adaptive Multiple Visual Prototype Matching model, which automatically captures multiple prototypes to describe a video by adaptive aggregation of video token features. Given a query text, the similarity is determined by the most similar prototype to find correspondence in the video, which is termed text-adaptive matching. To learn diverse prototypes for representing the rich information in videos, we propose a variance loss to encourage different prototypes to attend to different contents of the video. Our method outperforms state-of-the-art methods on four public video retrieval datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题