论文标题
关于文本生成的进步,从图像超出字幕的图像:自我理性化的案例研究
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization
论文作者
论文摘要
将视觉模态与预验证的语言模型相结合,对于简单的描述性任务(例如图像字幕)而言,令人惊讶地有效。然而,更一般的文字生成仍然难以捉摸。我们退后一步,问:这些模型如何用于更复杂的生成任务,即对文本和图像进行调节?多模型模型是视觉上适用的语言模型还是将它们共同理解与模式相结合? 我们在三个任务的自我理性化(共同生成任务标签/答案和自由文本说明)的背景下研究了这些问题:(i)VQA-X中的视觉问题答案,(ii)VCR中的视觉常识性推理,以及(iii)E-Snli-ve中的视觉文本构成。我们表明,最近的单峰进步,剪辑图像表示和语言模型的缩放,并不能始终如一地改善多模式任务中的自我理性化。我们发现,没有单个模型类型在任务,数据集和填充数据大小之间普遍效果最好。我们的发现激发了对新颖的通用骨架方法的需求,该方法将文本生成从图像和文本超出图像字幕的文字移动。
Combining the visual modality with pretrained language models has been surprisingly effective for simple descriptive tasks such as image captioning. More general text generation however remains elusive. We take a step back and ask: How do these models work for more complex generative tasks, i.e. conditioning on both text and images? Are multimodal models simply visually adapted language models, or do they combine they reason jointly over modalities? We investigate these questions in the context of self-rationalization (jointly generating task labels/answers and free-text explanations) of three tasks: (i) visual question answering in VQA-X, (ii) visual commonsense reasoning in VCR, and (iii) visual-textual entailment in e-SNLI-VE. We show that recent unimodal advances, CLIP image representations and scaling of language models, do not consistently improve self-rationalization in multimodal tasks. We find that no single model type works universally best across tasks, datasets, and finetuning data sizes. Our findings motivate the need for novel general backbones approach that move text generation from images and text beyond image captioning.