论文标题
可控制的图像字幕
Length-Controllable Image Captioning
论文作者
论文摘要
在过去的十年中,图像字幕的任务取得了显着进展。但是,大多数现有方法无法控制其标题\ emph {e.g。},而是选择大致或详细描述图像。在本文中,我们建议使用简单的长度嵌入来赋予他们这种能力。此外,由于其自回旋性质,随着生成的字幕的长度的增长,现有模型的计算复杂性线性增加。因此,我们进一步设计了一种非自动回忆图像字幕方法,该方法可以在长度 - 近乎长度的复杂性中生成字幕。我们验证了嵌入在三个模型上的提议长度级别的优点:具有不同类型的解码器的两个最先进的自动回归模型以及我们提出的非自动退步模型,以显示其泛化能力。在实验中,我们的长度可控制的图像字幕模型不仅在具有挑战性的MS可可数据集上实现了SOTA性能,还可以生成长度控制和多样化的图像标题。具体而言,我们的非自助力模型在可控性和多样性方面优于自回归基线,并且显着提高了长字幕的解码效率。我们的代码和模型以\ textColor {magenta} {\ texttt {https://github.com/bearcatt/labert}}}发布。
The last decade has witnessed remarkable progress in the image captioning task; however, most existing methods cannot control their captions, \emph{e.g.}, choosing to describe the image either roughly or in detail. In this paper, we propose to use a simple length level embedding to endow them with this ability. Moreover, due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows. Thus, we further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity. We verify the merit of the proposed length level embedding on three models: two state-of-the-art (SOTA) autoregressive models with different types of decoder, as well as our proposed non-autoregressive model, to show its generalization ability. In the experiments, our length-controllable image captioning models not only achieve SOTA performance on the challenging MS COCO dataset but also generate length-controllable and diverse image captions. Specifically, our non-autoregressive model outperforms the autoregressive baselines in terms of controllability and diversity, and also significantly improves the decoding efficiency for long captions. Our code and models are released at \textcolor{magenta}{\texttt{https://github.com/bearcatt/LaBERT}}.