Victr：视觉信息捕获的文本到图像多模式任务的文本表示形式

论文标题

Victr：视觉信息捕获的文本到图像多模式任务的文本表示形式

VICTR: Visual Information Captured Text Representation for Text-to-Image Multimodal Tasks

论文作者

Han, Soyeon Caren, Long, Siqu, Luo, Siwen, Wang, Kunze, Poon, Josiah

论文摘要

文本到图像多模式任务，从给定的文本描述中生成/检索图像，是极具挑战性的任务，因为原始文本说明涵盖了非常有限的信息以完全描述视觉上现实的图像。我们为文本到图像多模式任务Victr提出了一个新的视觉上下文文本表示，该表示从文本输入中捕获了对象的丰富视觉语义信息。首先，我们将文本描述用作初始输入，并进行依赖解析来提取句法结构并分析语义方面，包括对象数量，以提取场景图。然后，我们使用图形卷积网络训练场景图中提取的对象，属性和关系以及相应的几何关系信息，并生成文本表示，以整合文本和视觉语义信息。文本表示与单词级别和句子级嵌入汇总，以生成视觉上下文单词和句子表示。为了进行评估，我们将Victr附加到文本到图像生成中的最新模型上。Victr很容易添加到现有模型中，并在定量和定性方面都改进了。

Text-to-image multimodal tasks, generating/retrieving an image from a given text description, are extremely challenging tasks since raw text descriptions cover quite limited information in order to fully describe visually realistic images. We propose a new visual contextual text representation for text-to-image multimodal tasks, VICTR, which captures rich visual semantic information of objects from the text input. First, we use the text description as initial input and conduct dependency parsing to extract the syntactic structure and analyse the semantic aspect, including object quantities, to extract the scene graph. Then, we train the extracted objects, attributes, and relations in the scene graph and the corresponding geometric relation information using Graph Convolutional Networks, and it generates text representation which integrates textual and visual semantic information. The text representation is aggregated with word-level and sentence-level embedding to generate both visual contextual word and sentence representation. For the evaluation, we attached VICTR to the state-of-the-art models in text-to-image generation.VICTR is easily added to existing models and improves across both quantitative and qualitative aspects.

下载PDF全文

下载文献需遵守相关版权规定

论文标题