英雄：层次时空推理与端到端视频对象接地的对比度对应关系

论文标题

英雄：层次时空推理与端到端视频对象接地的对比度对应关系

HERO: HiErarchical spatio-tempoRal reasOning with Contrastive Action Correspondence for End-to-End Video Object Grounding

论文作者

Li, Mengze, Wang, Tianbao, Zhang, Haoyu, Zhang, Shengyu, Zhao, Zhou, Zhang, Wenqiao, Miao, Jiaxu, Pu, Shiliang, Wu, Fei

论文摘要

视频对象接地（VOG）是将视频中的空间对象区域与描述性自然语言查询相关联的问题。这是一项具有挑战性的视觉语言任务，需要构建正确的跨模式对应关系并建模查询视频和标题的适当时空上下文，从而准确地定位了特定对象。在本文中，我们通过一个名为“等级时空推理（英雄）的新型框架”以对比度的对应关系来解决这项任务。我们研究了先前工作的两个方面的VOG任务：（1）对应的对应通信感知检索。请注意，细颗粒的视频语义（例如，多个动作）与带注释的语言查询（例如，单个动作）并不完全一致，我们首先介绍了弱监督的对比学习，该学习将视频归类为依靠视频匹配动作动作语义对应的动作和动作独立的框架。这样的设计可以构建细粒的跨模式对应关系，以获得更准确的后续VOG。（2）分层时空建模的改进。尽管基于变压器的VOG模型在顺序模态（即视频和标题）建模中具有潜力，但现有证据也表明变压器遭受了不敏感的时空位置问题。由此激励，我们仔细设计了分层推理层，以使完全连接的多头关注并消除冗余干扰相关性。此外，我们提出的金字塔和转移的比对机制可有效改善邻里空间区域和时间框架的跨模式信息利用。我们进行了广泛的实验，以表明我们的英雄在两个基准数据集上取得了重大改进，以优于现有技术。

Video Object Grounding (VOG) is the problem of associating spatial object regions in the video to a descriptive natural language query. This is a challenging vision-language task that necessitates constructing the correct cross-modal correspondence and modeling the appropriate spatio-temporal context of the query video and caption, thereby localizing the specific objects accurately. In this paper, we tackle this task by a novel framework called HiErarchical spatio-tempoRal reasOning (HERO) with contrastive action correspondence. We study the VOG task at two aspects that prior works overlooked: (1) Contrastive Action Correspondence-aware Retrieval. Notice that the fine-grained video semantics (e.g., multiple actions) is not totally aligned with the annotated language query (e.g., single action), we first introduce the weakly-supervised contrastive learning that classifies the video as action-consistent and action-independent frames relying on the video-caption action semantic correspondence. Such a design can build the fine-grained cross-modal correspondence for more accurate subsequent VOG. (2) Hierarchical Spatio-temporal Modeling Improvement. While transformer-based VOG models present their potential in sequential modality (i.e., video and caption) modeling, existing evidence also indicates that the transformer suffers from the issue of the insensitive spatio-temporal locality. Motivated by that, we carefully design the hierarchical reasoning layers to decouple fully connected multi-head attention and remove the redundant interfering correlations. Furthermore, our proposed pyramid and shifted alignment mechanisms are effective to improve the cross-modal information utilization of neighborhood spatial regions and temporal frames. We conducted extensive experiments to show our HERO outperforms existing techniques by achieving significant improvement on two benchmark datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题