使用增强时空对齐网络的视频显着性预测

论文标题

使用增强时空对齐网络的视频显着性预测

Video Saliency Prediction Using Enhanced Spatiotemporal Alignment Network

论文作者

Chen, Jin, Song, Huihui, Zhang, Kaihua, Liu, Bo, Liu, Qingshan

论文摘要

由于各种框架的各种动作，学习有效的时空表示以进行准确的视频显着性预测（VSP）是高度挑战的。为了解决此问题，我们开发了一个针对VSP的有效时空特征对齐网络，主要包括两个关键子网络：多尺度可变形的卷积对准网络（MDAN）和双向卷积长期短期内存（BI-CONVLSTM）网络。 MDAN学会了以粗到1的方式将相邻框架的特征与参考框架相结合，这可以很好地处理各种动作。具体而言，MDAN拥有一个锥体特征层次结构，该结构首先利用可变形的卷积（DCONV）来对齐跨帧的较低分辨率特征，然后汇总对齐的特征以对齐高分辨率的特征，从而逐渐增强从上到下的特征。然后，将MDAN的输出送入BI-ConvlSTM以进行进一步增强，从而捕获了沿向前和向后定时方向的有用的长期时间信息，以有效指导复杂场景转换下的注意力方向转移预测。最后，增强功能被解码以生成预测的显着图。所提出的模型是端到端训练的，没有任何复杂的后处理。对四个VSP基准数据集进行了广泛的评估表明，所提出的方法可以针对最新方法实现良好的性能。源代码和所有结果将发布。

Due to a variety of motions across different frames, it is highly challenging to learn an effective spatiotemporal representation for accurate video saliency prediction (VSP). To address this issue, we develop an effective spatiotemporal feature alignment network tailored to VSP, mainly including two key sub-networks: a multi-scale deformable convolutional alignment network (MDAN) and a bidirectional convolutional Long Short-Term Memory (Bi-ConvLSTM) network. The MDAN learns to align the features of the neighboring frames to the reference one in a coarse-to-fine manner, which can well handle various motions. Specifically, the MDAN owns a pyramidal feature hierarchy structure that first leverages deformable convolution (Dconv) to align the lower-resolution features across frames, and then aggregates the aligned features to align the higher-resolution features, progressively enhancing the features from top to bottom. The output of MDAN is then fed into the Bi-ConvLSTM for further enhancement, which captures the useful long-time temporal information along forward and backward timing directions to effectively guide attention orientation shift prediction under complex scene transformation. Finally, the enhanced features are decoded to generate the predicted saliency map. The proposed model is trained end-to-end without any intricate post processing. Extensive evaluations on four VSP benchmark datasets demonstrate that the proposed method achieves favorable performance against state-of-the-art methods. The source codes and all the results will be released.

下载PDF全文

下载文献需遵守相关版权规定

论文标题