潜在视频变压器

论文标题

潜在视频变压器

Latent Video Transformer

论文作者

Rakhimov, Ruslan, Volkhonskiy, Denis, Artemov, Alexey, Zorin, Denis, Burnaev, Evgeny

论文摘要

视频生成任务可以作为过去框架的未来视频帧的预测。视频的最新生成模型面临着高计算要求的问题。一些型号最多需要512个张量处理单元才能并行训练。在这项工作中，我们通过对潜在空间中的动态进行建模来解决此问题。在将帧转换为潜在空间后，我们的模型以自回归方式预测了下一个帧的潜在表示。我们演示了我们在Bair机器人推动和动力学600数据集上的方法的性能。该方法倾向于将需求减少到8个图形处理单元，以训练模型，同时保持可比的发电质量。

The video generation task can be formulated as a prediction of future video frames given some past frames. Recent generative models for videos face the problem of high computational requirements. Some models require up to 512 Tensor Processing Units for parallel training. In this work, we address this problem via modeling the dynamics in a latent space. After the transformation of frames into the latent space, our model predicts latent representation for the next frames in an autoregressive manner. We demonstrate the performance of our approach on BAIR Robot Pushing and Kinetics-600 datasets. The approach tends to reduce requirements to 8 Graphical Processing Units for training the models while maintaining comparable generation quality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题