较弱的监督和推荐时间文本协会学习的关注

论文标题

较弱的监督和推荐时间文本协会学习的关注

Weak Supervision and Referring Attention for Temporal-Textual Association Learning

论文作者

Fang, Zhiyuan, Kong, Shu, Wang, Zhe, Fowlkes, Charless, Yang, Yezhou

论文摘要

捕获视频框架和文本查询之间关联的系统为更好的视频分析提供了巨大的潜力。但是，以完全监督的方式训练这样的系统不可避免地需要具有时间文本注释的精心策划的视频数据集。因此，我们通过提出的参考注意机制为学习时间文本关联（称为WSRA）提供了一种弱监督的替代方案。薄弱的监督只是视频级别上的文本表达式（例如，简短的短语或句子），表明该视频包含相关帧。参考的关注是我们设计的机制，它是一个评分函数，用于将给定的查询置于时间上。它由多种新颖的损失和采样策略组成，以进行更好的培训。我们设计的机制的原理是完全利用1）通过考虑与文本查询锚定的视频内部段的信息和歧视性线索，2）与单个视频相比的多个查询，以及3）交叉视频视觉相似性。我们通过广泛的实验来验证我们的WSRA，以通过语言在时间上扎根，这表明它的表现优于最先进的弱监督方法。

A system capturing the association between video frames and textual queries offer great potential for better video analysis. However, training such a system in a fully supervised way inevitably demands a meticulously curated video dataset with temporal-textual annotations. Therefore we provide a Weak-Supervised alternative with our proposed Referring Attention mechanism to learn temporal-textual association (dubbed WSRA). The weak supervision is simply a textual expression (e.g., short phrases or sentences) at video level, indicating this video contains relevant frames. The referring attention is our designed mechanism acting as a scoring function for grounding the given queries over frames temporally. It consists of multiple novel losses and sampling strategies for better training. The principle in our designed mechanism is to fully exploit 1) the weak supervision by considering informative and discriminative cues from intra-video segments anchored with the textual query, 2) multiple queries compared to the single video, and 3) cross-video visual similarities. We validate our WSRA through extensive experiments for temporally grounding by languages, demonstrating that it outperforms the state-of-the-art weakly-supervised methods notably.

下载PDF全文

下载文献需遵守相关版权规定

论文标题