基于视频的人识别的框架聚合和多模式融合框架

论文标题

基于视频的人识别的框架聚合和多模式融合框架

Frame Aggregation and Multi-Modal Fusion Framework for Video-Based Person Recognition

论文作者

Li, Fangtao, Wang, Wenzhe, Liu, Zihe, Wang, Haoran, Yan, Chenghao, Wu, Bin

论文摘要

基于视频的人的认可由于被阻止和模糊的人以及拍摄角度的变化而具有挑战性。先前的研究始终集中在人们对静止图像的认可，无视视频帧之间的相似性和连续性。为了应对上述挑战，我们建议针对基于视频的人识别的新型框架聚合和多模式融合（FAMF）框架，该框架汇总了面部功能，并将其与多模式信息结合在一起，以识别视频中的人。对于帧汇总，我们提出了一个基于NetVlad（名为ActivationVlad）的新型可训练层，该层将任意数量的功能作为输入，并根据功能质量计算固定长度的聚合功能。我们表明，将注意力机制引入NetVlad可以有效地降低低质量框架的影响。对于视频的多模型信息，我们提出了一个多层多模式关注（MLMA）模块，以通过自适应更新革兰氏矩阵来学习多模式的相关性。 IQIYI-VID-2019数据集的实验结果表明，我们的框架的表现优于其他最先进的方法。

Video-based person recognition is challenging due to persons being blocked and blurred, and the variation of shooting angle. Previous research always focused on person recognition on still images, ignoring similarity and continuity between video frames. To tackle the challenges above, we propose a novel Frame Aggregation and Multi-Modal Fusion (FAMF) framework for video-based person recognition, which aggregates face features and incorporates them with multi-modal information to identify persons in videos. For frame aggregation, we propose a novel trainable layer based on NetVLAD (named AttentionVLAD), which takes arbitrary number of features as input and computes a fixed-length aggregation feature based on feature quality. We show that introducing an attention mechanism to NetVLAD can effectively decrease the impact of low-quality frames. For the multi-model information of videos, we propose a Multi-Layer Multi-Modal Attention (MLMA) module to learn the correlation of multi-modality by adaptively updating Gram matrix. Experimental results on iQIYI-VID-2019 dataset show that our framework outperforms other state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题