用于压缩视频引用对象细分的多发注意力网络

论文标题

用于压缩视频引用对象细分的多发注意力网络

Multi-Attention Network for Compressed Video Referring Object Segmentation

论文作者

Chen, Weidong, Hong, Dexiang, Qi, Yuankai, Han, Zhenjun, Wang, Shuhui, Qing, Laiyun, Huang, Qingming, Li, Guorong

论文摘要

引用视频对象细分旨在分割给定语言表达式所引用的对象。现有作品通常需要在分割之前将压缩视频bitstream解码为RGB帧，从而增加了计算和存储要求，并最终减慢了推理。这可能会妨碍其在现实计算资源有限的场景中的应用，例如自动驾驶汽车和无人机。为了减轻此问题，在本文中，我们探讨了压缩视频中的参考对象分割任务，即原始视频数据流。除了视频引用对象分割任务本身的固有难度外，从压缩视频中获得歧视性表示也很具有挑战性。为了解决这个问题，我们提出了一个多发明网络，该网络由双路线双意见模块和一个基于查询的跨模式变压器模块组成。具体而言，双路线双意见模块旨在从三种模态的压缩数据中提取有效表示，即i框架，运动矢量和残留。基于查询的跨模式变压器首先对语言和视觉方式之间的相关性进行建模，然后使用融合的多模式特征来指导对象查询以生成内容感知的动态内核并预测最终的细分掩码。与以前的作品不同，我们建议只学习一个内核，因此，它消除了现有方法的复杂后掩盖匹配过程。在三个具有挑战性的数据集上进行的广泛有希望的实验结果表明，与几种用于处理RGB数据的最新方法相比，我们的方法的有效性。源代码可在以下网址获得：https：//github.com/dexianghong/manet。

Referring video object segmentation aims to segment the object referred by a given language expression. Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented, which increases computation and storage requirements and ultimately slows the inference down. This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones. To alleviate this problem, in this paper, we explore the referring object segmentation task on compressed videos, namely on the original video data flow. Besides the inherent difficulty of the video referring object segmentation task itself, obtaining discriminative representation from compressed video is also rather challenging. To address this problem, we propose a multi-attention network which consists of dual-path dual-attention module and a query-based cross-modal Transformer module. Specifically, the dual-path dual-attention module is designed to extract effective representation from compressed data in three modalities, i.e., I-frame, Motion Vector and Residual. The query-based cross-modal Transformer firstly models the correlation between linguistic and visual modalities, and then the fused multi-modality features are used to guide object queries to generate a content-aware dynamic kernel and to predict final segmentation masks. Different from previous works, we propose to learn just one kernel, which thus removes the complicated post mask-matching procedure of existing methods. Extensive promising experimental results on three challenging datasets show the effectiveness of our method compared against several state-of-the-art methods which are proposed for processing RGB data. Source code is available at: https://github.com/DexiangHong/MANet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题