论文标题
VRT:视频恢复变压器
VRT: A Video Restoration Transformer
论文作者
论文摘要
视频修复(例如,视频超分辨率)旨在从低品质框架中恢复高质量的帧。不同于单图像恢复,视频修复通常需要利用来自多个相邻但通常未对准视频帧的时间信息。现有的深度方法通常通过利用滑动窗口策略或经常性体系结构来解决这个问题,该策略要么受逐帧恢复的限制,要么缺乏远程建模能力。在本文中,我们提出了一个带有平行框架预测和远程时间依赖性建模能力的视频恢复变压器(VRT)。更具体地说,VRT由多个量表组成,每个量表由两种模块组成:时间相互注意(TMSA)和平行扭曲。 TMSA将视频分为小剪辑,将相互关注用于关节运动估计,特征对齐和特征融合,而自我关注则用于特征提取。为了启用交叉交互,视频序列对其他每一层都发生了变化。此外,通过并行功能翘曲,并行翘曲用于进一步从相邻帧中融合信息。关于五项任务的实验结果,包括视频超分辨率,视频脱张,视频deNoising,视频框架插值和时空视频超级分辨率,表明VRT在fourter-textbf {$ \ textbf {textbf {textbf {最高2.16db} $上都超过了最先进的方法。
Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames. Different from single image restoration, video restoration generally requires to utilize temporal information from multiple adjacent but usually misaligned video frames. Existing deep methods generally tackle with this by exploiting a sliding window strategy or a recurrent architecture, which either is restricted by frame-by-frame restoration or lacks long-range modelling ability. In this paper, we propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities. More specifically, VRT is composed of multiple scales, each of which consists of two kinds of modules: temporal mutual self attention (TMSA) and parallel warping. TMSA divides the video into small clips, on which mutual attention is applied for joint motion estimation, feature alignment and feature fusion, while self attention is used for feature extraction. To enable cross-clip interactions, the video sequence is shifted for every other layer. Besides, parallel warping is used to further fuse information from neighboring frames by parallel feature warping. Experimental results on five tasks, including video super-resolution, video deblurring, video denoising, video frame interpolation and space-time video super-resolution, demonstrate that VRT outperforms the state-of-the-art methods by large margins ($\textbf{up to 2.16dB}$) on fourteen benchmark datasets.