Schrödinger的蝙蝠：扩散模型有时会在叠加中产生多义单词

论文标题

Schrödinger的蝙蝠：扩散模型有时会在叠加中产生多义单词

Schrödinger's Bat: Diffusion Models Sometimes Generate Polysemous Words in Superposition

论文作者

White, Jennifer C., Cotterell, Ryan

论文摘要

最近的工作表明，尽管具有令人印象深刻的能力，但文本到图像扩散模型（例如DALL-E 2）（Ramesh等，2022）可以显示出奇怪的行为，当提示包含一个具有多种可能含义的单词，通常会生成包含两个单词感觉的图像（Rassin等，2022）。在这项工作中，我们试图提出对这一现象的可能解释。使用类似的稳定扩散模型（Rombach等，2022），我们首先表明，当给定一个输入是两个不同单词的编码之和时，该模型可以产生包含总和中表示的两个概念的图像。然后，我们证明了用于编码提示的剪辑编码器（Radford等，2021）编码多义单词作为含义的叠加，并且使用线性代数技术，我们可以编辑这些表示以影响生成图像中表示的感官。结合了这两个发现，我们建议Rassin等人描述的同义词复制现象。（2022）是由产生图像的扩散模型引起的，这些图像代表了多义单词编码中叠加中存在的两种含义。

Recent work has shown that despite their impressive capabilities, text-to-image diffusion models such as DALL-E 2 (Ramesh et al., 2022) can display strange behaviours when a prompt contains a word with multiple possible meanings, often generating images containing both senses of the word (Rassin et al., 2022). In this work we seek to put forward a possible explanation of this phenomenon. Using the similar Stable Diffusion model (Rombach et al., 2022), we first show that when given an input that is the sum of encodings of two distinct words, the model can produce an image containing both concepts represented in the sum. We then demonstrate that the CLIP encoder used to encode prompts (Radford et al., 2021) encodes polysemous words as a superposition of meanings, and that using linear algebraic techniques we can edit these representations to influence the senses represented in the generated images. Combining these two findings, we suggest that the homonym duplication phenomenon described by Rassin et al. (2022) is caused by diffusion models producing images representing both of the meanings that are present in superposition in the encoding of a polysemous word.

下载PDF全文

下载文献需遵守相关版权规定

论文标题