论文标题
HIVLP:快速图像检索的层次视觉语言预训练
HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval
论文作者
论文摘要
在过去的几年中,视觉培训(VLP)的出现将跨模式检索带到了一个新时代。但是,由于延迟和计算需求,在实时在线检索系统中应用VLP通常是具有挑战性的。为了减轻缺陷,本文提出了一个\ textbf {hi} erarchical \ textbf {v} ision- \ textbf {}语言\ textbf {p}重新训练(\ textbf {hivlp}),用于快速image-text-text reterieval(itr)。具体而言,我们设计了一个新型的层次检索目标,该目标使用不同维度的不同维度的表示,即使用低维表示,用于大规模的粗糙检索和高维表示,以进行小规模的罚款。我们在两个流行的图像文本检索基准(即Flickr30k和Coco)上评估了我们提出的HIVLP。广泛的实验表明,我们的HIVLP不仅具有快速的推理速度,而且很容易缩放到大规模的ITR方案。详细的结果表明,HIVLP $ 1,427 $$ \ sim $$ 120,649 \ times \ times $ $比基于融合的型号uniter快,而2 $ \ sim $ 5 $ 5 $ 5 $ 5的速度比不同候选场景中的基于最快的基于嵌入式的模型LightingDot快。与LightingDot相比,它在Coco上的可可量和+3.8 AR也达到了+4.9 AR,并且与基于最新的(SOTA)基于融合的型号仪表的性能可比性可比性。
In the past few years, the emergence of vision-language pre-training (VLP) has brought cross-modal retrieval to a new era. However, due to the latency and computation demand, it is commonly challenging to apply VLP in a real-time online retrieval system. To alleviate the defect, this paper proposes a \textbf{Hi}erarchical \textbf{V}ision-\textbf{}Language \textbf{P}re-Training (\textbf{HiVLP}) for fast Image-Text Retrieval (ITR). Specifically, we design a novel hierarchical retrieval objective, which uses the representation of different dimensions for coarse-to-fine ITR, i.e., using low-dimensional representation for large-scale coarse retrieval and high-dimensional representation for small-scale fine retrieval. We evaluate our proposed HiVLP on two popular image-text retrieval benchmarks, i.e., Flickr30k and COCO. Extensive experiments demonstrate that our HiVLP not only has fast inference speed but also can be easily scaled to large-scale ITR scenarios. The detailed results show that HiVLP is $1,427$$\sim$$120,649\times$ faster than the fusion-based model UNITER and 2$\sim$5 faster than the fastest embedding-based model LightingDot in different candidate scenarios. It also achieves about +4.9 AR on COCO and +3.8 AR on Flickr30K than LightingDot and achieves comparable performance with the state-of-the-art (SOTA) fusion-based model METER.