以人为中心的时空视频通过相互匹配网络和管子的结合

论文标题

以人为中心的时空视频通过相互匹配网络和管子的结合

Human-centric Spatio-Temporal Video Grounding via the Combination of Mutual Matching Network and TubeDETR

论文作者

Yu, Fan, Zhao, Zhixiang, Wang, Yuchen, Xu, Yi, Ren, Tongwei, Wu, Gangshan

论文摘要

在这份技术报告中，我们代表了以人为中心的时空视频接地（HC-STVG）在上下文（PIC）研讨会和挑战中的解决方案。我们的解决方案是基于管子和相互匹配网络（MMN）构建的。具体而言，Tubedetr利用视频文本编码器和时空解码器来预测目标人的起始时间，结束时间和管子。 MMN检测到图像中的人，将其链接为管子，提取人管的特征和文本描述，并预测它们之间的相似性以选择最可能的人管作为接地结果。我们的解决方案最终通过将MMN的空间定位与管胞纤维的时间定位相结合，最终将结果进行了验证。在第四次PIC挑战的HC-STVG轨道中，我们的解决方案获得了第三名。

In this technical report, we represent our solution for the Human-centric Spatio-Temporal Video Grounding (HC-STVG) track of the 4th Person in Context (PIC) workshop and challenge. Our solution is built on the basis of TubeDETR and Mutual Matching Network (MMN). Specifically, TubeDETR exploits a video-text encoder and a space-time decoder to predict the starting time, the ending time and the tube of the target person. MMN detects persons in images, links them as tubes, extracts features of person tubes and the text description, and predicts the similarities between them to choose the most likely person tube as the grounding result. Our solution finally finetunes the results by combining the spatio localization of MMN and with temporal localization of TubeDETR. In the HC-STVG track of the 4th PIC challenge, our solution achieves the third place.

下载PDF全文

下载文献需遵守相关版权规定

论文标题