从查询句子中检索视频的多个视觉语义嵌入

论文标题

从查询句子中检索视频的多个视觉语义嵌入

Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence

论文作者

Nguyen, Huy Manh, Miyazaki, Tomo, Sugaya, Yoshihiro, Omachi, Shinichiro

论文摘要

视觉语义嵌入旨在学习一个相关视频和句子实例彼此靠近的联合嵌入空间。大多数现有方法将实例放在单个嵌入空间中。但是，由于难以将视频中的视觉动态与句子中的文字功能匹配，因此他们很难嵌入实例。一个空间不足以容纳各种视频和句子。在本文中，我们提出了一个新颖的框架，该框架将实例映射到多个单独的嵌入空间中，以便我们可以在实例之间捕获多个关系，从而导致引人注目的视频检索。我们建议通过使用加权总和策略在每个嵌入空间中测量的相似性融合相似性，从而产生最终相似之处。我们根据句子确定权重。因此，我们可以灵活地强调一个嵌入空间。我们在基准数据集上进行了句子到视频检索实验。提出的方法实现了卓越的性能，结果与最先进的方法具有竞争力。这些实验结果证明了与现有方法相比，提出的多个嵌入方法的有效性。

Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. Most existing methods put instances in a single embedding space. However, they struggle to embed instances due to the difficulty of matching visual dynamics in videos to textual features in sentences. A single space is not enough to accommodate various videos and sentences. In this paper, we propose a novel framework that maps instances into multiple individual embedding spaces so that we can capture multiple relationships between instances, leading to compelling video retrieval. We propose to produce a final similarity between instances by fusing similarities measured in each embedding space using a weighted sum strategy. We determine the weights according to a sentence. Therefore, we can flexibly emphasize an embedding space. We conducted sentence-to-video retrieval experiments on a benchmark dataset. The proposed method achieved superior performance, and the results are competitive to state-of-the-art methods. These experimental results demonstrated the effectiveness of the proposed multiple embedding approach compared to existing methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题