HIVLP：快速图像检索的层次视觉语言预训练

论文标题

HIVLP：快速图像检索的层次视觉语言预训练

HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval

论文作者

Chen, Feilong, Chen, Xiuyi, Shi, Jiaxin, Zhang, Duzhen, Chang, Jianlong, Tian, Qi

论文摘要

在过去的几年中，视觉培训（VLP）的出现将跨模式检索带到了一个新时代。但是，由于延迟和计算需求，在实时在线检索系统中应用VLP通常是具有挑战性的。为了减轻缺陷，本文提出了一个\ textbf {hi} erarchical \ textbf {v} ision- \ textbf {}语言\ textbf {p}重新训练（\ textbf {hivlp}），用于快速image-text-text reterieval（itr）。具体而言，我们设计了一个新型的层次检索目标，该目标使用不同维度的不同维度的表示，即使用低维表示，用于大规模的粗糙检索和高维表示，以进行小规模的罚款。我们在两个流行的图像文本检索基准（即Flickr30k和Coco）上评估了我们提出的HIVLP。广泛的实验表明，我们的HIVLP不仅具有快速的推理速度，而且很容易缩放到大规模的ITR方案。详细的结果表明，HIVLP $ 1,427 $$ \ sim $$ 120,649 \ times \ times $ $比基于融合的型号uniter快，而2 $ \ sim $ 5 $ 5 $ 5 $ 5的速度比不同候选场景中的基于最快的基于嵌入式的模型LightingDot快。与LightingDot相比，它在Coco上的可可量和+3.8 AR也达到了+4.9 AR，并且与基于最新的（SOTA）基于融合的型号仪表的性能可比性可比性。

In the past few years, the emergence of vision-language pre-training (VLP) has brought cross-modal retrieval to a new era. However, due to the latency and computation demand, it is commonly challenging to apply VLP in a real-time online retrieval system. To alleviate the defect, this paper proposes a \textbf{Hi}erarchical \textbf{V}ision-\textbf{}Language \textbf{P}re-Training (\textbf{HiVLP}) for fast Image-Text Retrieval (ITR). Specifically, we design a novel hierarchical retrieval objective, which uses the representation of different dimensions for coarse-to-fine ITR, i.e., using low-dimensional representation for large-scale coarse retrieval and high-dimensional representation for small-scale fine retrieval. We evaluate our proposed HiVLP on two popular image-text retrieval benchmarks, i.e., Flickr30k and COCO. Extensive experiments demonstrate that our HiVLP not only has fast inference speed but also can be easily scaled to large-scale ITR scenarios. The detailed results show that HiVLP is $1,427$$\sim$$120,649\times$ faster than the fusion-based model UNITER and 2$\sim$5 faster than the fastest embedding-based model LightingDot in different candidate scenarios. It also achieves about +4.9 AR on COCO and +3.8 AR on Flickr30K than LightingDot and achieves comparable performance with the state-of-the-art (SOTA) fusion-based model METER.

下载PDF全文

下载文献需遵守相关版权规定

论文标题