Audiogen：文本引导的音频产生

论文标题

Audiogen：文本引导的音频产生

AudioGen: Textually Guided Audio Generation

论文作者

Kreuk, Felix, Synnaeve, Gabriel, Polyak, Adam, Singer, Uriel, Défossez, Alexandre, Copet, Jade, Parikh, Devi, Taigman, Yaniv, Adi, Yossi

论文摘要

我们解决了以描述性文本标题为条件的音频样本的问题。在这项工作中，我们提出了Aaudiogen，这是一种自动回归生成模型，生成以文本输入为条件的音频样本。 Audiogen在学习的离散音频表示方面运行。文本到原告的任务提出了多个挑战。由于音频通过媒介传播的方式，区分``对象''可能是一项艰巨的任务（例如，同时分开多个人）。现实世界记录条件（例如，背景噪声，混响等）更加复杂。稀缺的文本注释施加了另一个约束，限制了扩展模型的能力。最后，对高保真音频进行建模需要以高采样率编码音频，从而导致极长的序列。为了减轻上述挑战，我们提出了一种将不同音频样本混合在一起的增强技术，驱动模型内部学会分开多个来源。我们策划了10个数据集，其中包含不同类型的音频和文本注释，以处理文本审计数据点的稀缺性。为了更快的推断，我们探讨了多流建模的使用，允许使用较短的序列，同时保持相似的比特率和感知质量。我们应用无分类器指导来提高对文本的依从性。与评估的基线相比，有声原比客观和主观指标的表现优于胜过。最后，我们探讨了提出的方法在有条件和无条件地生成音频延续的能力。样本：https：//felixkreuk.github.io/audiogen

We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating ``objects'' can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models. Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges we propose an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple sources. We curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-audio data points. For faster inference, we explore the use of multi-stream modeling, allowing the use of shorter sequences while maintaining a similar bitrate and perceptual quality. We apply classifier-free guidance to improve adherence to text. Comparing to the evaluated baselines, AudioGen outperforms over both objective and subjective metrics. Finally, we explore the ability of the proposed method to generate audio continuation conditionally and unconditionally. Samples: https://felixkreuk.github.io/audiogen

下载PDF全文

下载文献需遵守相关版权规定

论文标题