论文标题
自适应时空注意的多元化视频字幕
Diverse Video Captioning by Adaptive Spatio-temporal Attention
论文作者
论文摘要
为了为视频产生适当的标题,推理需要确定相关的概念并注意它们之间的空间关系以及剪辑中的时间发展。我们的端到端编码器视频字幕框架结合了两个基于变压器的体系结构,这是一种用于单个关节时空视频分析的改编变压器,以及用于高级文本生成的基于自我注意力的解码器。此外,我们引入了一种自适应框架选择方案,以减少所需的传入帧数,同时在训练两个变压器时保持相关内容。此外,我们通过汇总每个样本的所有基础真相标题来估计与视频字幕相关的语义概念。我们的方法在MSVD以及大规模的MSR-VTT和VATEX基准数据集上实现了最新的结果,并考虑了多个自然语言产生(NLG)指标。对多样性得分的其他评估突出了我们生成的标题结构的表现力和多样性。
To generate proper captions for videos, the inference needs to identify relevant concepts and pay attention to the spatial relationships between them as well as to the temporal development in the clip. Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures, an adapted transformer for a single joint spatio-temporal video analysis as well as a self-attention-based decoder for advanced text generation. Furthermore, we introduce an adaptive frame selection scheme to reduce the number of required incoming frames while maintaining the relevant content when training both transformers. Additionally, we estimate semantic concepts relevant for video captioning by aggregating all ground truth captions of each sample. Our approach achieves state-of-the-art results on the MSVD, as well as on the large-scale MSR-VTT and the VATEX benchmark datasets considering multiple Natural Language Generation (NLG) metrics. Additional evaluations on diversity scores highlight the expressiveness and diversity in the structure of our generated captions.