论文标题
砂砾:使用双重视觉功能更快,更好的图像字幕变压器
GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features
论文作者
论文摘要
当前用于图像字幕的最新方法采用基于区域的特征,因为它们提供了对象级信息,对于描述图像的内容至关重要;它们通常由对象检测器(例如更快的R-CNN)提取。但是,他们有几个问题,例如缺乏上下文信息,检测不准确的风险以及高计算成本。可以通过使用基于网格的功能来解决前两个。但是,如何提取和融合这两种功能是未知的。本文提出了一种仅使用变压器的神经架构,称为砂砾(基于网格和区域的图像字幕变压器),该构造有效地利用了两个视觉特征来生成更好的字幕。 Grit用基于DETR的方法代替了以前方法中使用的基于CNN的检测器,从而使其更快地计算。此外,它的整体设计仅由变压器组成,可实现模型的端到端培训。这种创新的设计和双重视觉功能的集成带来了重大的性能提高。几个图像字幕基准的实验结果表明,在推理准确性和速度方面,砂砾的表现优于先前的方法。
Current state-of-the-art methods for image captioning employ region-based features, as they provide object-level information that is essential to describe the content of images; they are usually extracted by an object detector such as Faster R-CNN. However, they have several issues, such as lack of contextual information, the risk of inaccurate detection, and the high computational cost. The first two could be resolved by additionally using grid-based features. However, how to extract and fuse these two types of features is uncharted. This paper proposes a Transformer-only neural architecture, dubbed GRIT (Grid- and Region-based Image captioning Transformer), that effectively utilizes the two visual features to generate better captions. GRIT replaces the CNN-based detector employed in previous methods with a DETR-based one, making it computationally faster. Moreover, its monolithic design consisting only of Transformers enables end-to-end training of the model. This innovative design and the integration of the dual visual features bring about significant performance improvement. The experimental results on several image captioning benchmarks show that GRIT outperforms previous methods in inference accuracy and speed.