论文标题

模具阶层:视觉记忆条件一致的故事产生

Make-A-Story: Visual Memory Conditioned Consistent Story Generation

论文作者

Rahman, Tanzila, Lee, Hsin-Ying, Ren, Jian, Tulyakov, Sergey, Mahajan, Shweta, Sigal, Leonid

论文摘要

最近发生了令人印象深刻的生成模型的爆炸,这些模型可以产生以文本描述为条件的高质量图像(或视频)。但是,所有这些方法都依赖于有条件的句子,这些句子包含对场景和主要参与者的明确描述。因此,采用此类模型来进行更复杂的故事可视化任务,自然而然地参考和共同参考的情况,并且需要推理何时维持跨框架/场景的演员和背景的一致性,以及何时不基于故事的进步,仍然是一个挑战。在这项工作中,我们解决了上述挑战,并通过视觉内存模块提出了一种新颖的基于自回旋扩散的框架,该模块隐含地捕获了生成的帧中的演员和背景上下文。句子条件对记忆的软关注可以有效地参考分辨率,并学会在需要时保持场景和演员的一致性。为了验证方法的有效性,我们扩展了Mugen数据集,并在多句子故事情节中介绍其他字符,背景和引用。我们在Mugen,PororoSV和FlintStonessV数据集上进行故事生成的实验表明,我们的方法不仅在生成具有高视觉质量的框架方面优于先前的最新框架,这些框架与故事一致,还可以模拟角色和背景之间的适当对应关系。

There has been a recent explosion of impressive generative models that can produce high quality images (or videos) conditioned on text descriptions. However, all such approaches rely on conditional sentences that contain unambiguous descriptions of scenes and main actors in them. Therefore employing such models for more complex task of story visualization, where naturally references and co-references exist, and one requires to reason about when to maintain consistency of actors and backgrounds across frames/scenes, and when not to, based on story progression, remains a challenge. In this work, we address the aforementioned challenges and propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context across the generated frames. Sentence-conditioned soft attention over the memories enables effective reference resolution and learns to maintain scene and actor consistency when needed. To validate the effectiveness of our approach, we extend the MUGEN dataset and introduce additional characters, backgrounds and referencing in multi-sentence storylines. Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, which are consistent with the story, but also models appropriate correspondences between the characters and the background.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源