论文标题

H-VFI:带有大动作的视频的分层框架插值

H-VFI: Hierarchical Frame Interpolation for Videos with Large Motions

论文作者

Li, Changlin, Wu, Guangyang, Sun, Yanan, Tao, Xin, Tang, Chi-Keung, Tai, Yu-Wing

论文摘要

利用神经网络的快速发展,最近的视频框架插值(VFI)方法取得了显着改进。但是,它们仍然缺乏包含大型动作的现实视频。大型动作引起的复杂变形和/或遮挡使其在视频框架插值中是一个极其困难的问题。在本文中,我们提出了一个简单而有效的解决方案H-VFI,以处理视频框架插值中的大动作。 H-VFI贡献了分层视频插值变压器(HVIT),以在多个尺度的粗到精细策略中学习可变形的内核。然后,将学习的可变形核用于卷积输入帧以预测插值框架。 H-VFI从最小的量表开始,通过基于以前的预测内核,中间插值结果和Transficeer的层次特征,通过替补的残差更新可变形的内核。然后,基于插值结果,通过变压器块预测偏置和掩模以完善最终输出。这种渐进近似的优点是,大型运动框架插值问题可以分解为几个相对简单的子任务,这可以在最终结果中进行非常准确的预测。我们论文的另一个值得注意的贡献是由一个大型高质量数据集YouTube200K组成,其中包含描述以高分辨率和高框架速率捕获的各种情况的视频。在多个框架插值基准测试基准上进行的广泛实验可以验证H-VFI的表现优于现有的最新方法,尤其是对于具有较大动作的视频而言。

Capitalizing on the rapid development of neural networks, recent video frame interpolation (VFI) methods have achieved notable improvements. However, they still fall short for real-world videos containing large motions. Complex deformation and/or occlusion caused by large motions make it an extremely difficult problem in video frame interpolation. In this paper, we propose a simple yet effective solution, H-VFI, to deal with large motions in video frame interpolation. H-VFI contributes a hierarchical video interpolation transformer (HVIT) to learn a deformable kernel in a coarse-to-fine strategy in multiple scales. The learnt deformable kernel is then utilized in convolving the input frames for predicting the interpolated frame. Starting from the smallest scale, H-VFI updates the deformable kernel by a residual in succession based on former predicted kernels, intermediate interpolated results and hierarchical features from transformer. Bias and masks to refine the final outputs are then predicted by a transformer block based on interpolated results. The advantage of such a progressive approximation is that the large motion frame interpolation problem can be decomposed into several relatively simpler sub-tasks, which enables a very accurate prediction in the final results. Another noteworthy contribution of our paper consists of a large-scale high-quality dataset, YouTube200K, which contains videos depicting a great variety of scenarios captured at high resolution and high frame rate. Extensive experiments on multiple frame interpolation benchmarks validate that H-VFI outperforms existing state-of-the-art methods especially for videos with large motions.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源