H-VFI：带有大动作的视频的分层框架插值

论文标题

H-VFI：带有大动作的视频的分层框架插值

H-VFI: Hierarchical Frame Interpolation for Videos with Large Motions

论文作者

Li, Changlin, Wu, Guangyang, Sun, Yanan, Tao, Xin, Tang, Chi-Keung, Tai, Yu-Wing

论文摘要

利用神经网络的快速发展，最近的视频框架插值（VFI）方法取得了显着改进。但是，它们仍然缺乏包含大型动作的现实视频。大型动作引起的复杂变形和/或遮挡使其在视频框架插值中是一个极其困难的问题。在本文中，我们提出了一个简单而有效的解决方案H-VFI，以处理视频框架插值中的大动作。 H-VFI贡献了分层视频插值变压器（HVIT），以在多个尺度的粗到精细策略中学习可变形的内核。然后，将学习的可变形核用于卷积输入帧以预测插值框架。 H-VFI从最小的量表开始，通过基于以前的预测内核，中间插值结果和Transficeer的层次特征，通过替补的残差更新可变形的内核。然后，基于插值结果，通过变压器块预测偏置和掩模以完善最终输出。这种渐进近似的优点是，大型运动框架插值问题可以分解为几个相对简单的子任务，这可以在最终结果中进行非常准确的预测。我们论文的另一个值得注意的贡献是由一个大型高质量数据集YouTube200K组成，其中包含描述以高分辨率和高框架速率捕获的各种情况的视频。在多个框架插值基准测试基准上进行的广泛实验可以验证H-VFI的表现优于现有的最新方法，尤其是对于具有较大动作的视频而言。

Capitalizing on the rapid development of neural networks, recent video frame interpolation (VFI) methods have achieved notable improvements. However, they still fall short for real-world videos containing large motions. Complex deformation and/or occlusion caused by large motions make it an extremely difficult problem in video frame interpolation. In this paper, we propose a simple yet effective solution, H-VFI, to deal with large motions in video frame interpolation. H-VFI contributes a hierarchical video interpolation transformer (HVIT) to learn a deformable kernel in a coarse-to-fine strategy in multiple scales. The learnt deformable kernel is then utilized in convolving the input frames for predicting the interpolated frame. Starting from the smallest scale, H-VFI updates the deformable kernel by a residual in succession based on former predicted kernels, intermediate interpolated results and hierarchical features from transformer. Bias and masks to refine the final outputs are then predicted by a transformer block based on interpolated results. The advantage of such a progressive approximation is that the large motion frame interpolation problem can be decomposed into several relatively simpler sub-tasks, which enables a very accurate prediction in the final results. Another noteworthy contribution of our paper consists of a large-scale high-quality dataset, YouTube200K, which contains videos depicting a great variety of scenarios captured at high resolution and high frame rate. Extensive experiments on multiple frame interpolation benchmarks validate that H-VFI outperforms existing state-of-the-art methods especially for videos with large motions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题