通过结合视觉和本地汇总的文本功能，细粒度的图像分类和检索

论文标题

通过结合视觉和本地汇总的文本功能，细粒度的图像分类和检索

Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual Features

论文作者

Mafla, Andres, Dey, Sounak, Biten, Ali Furkan, Gomez, Lluis, Karatzas, Dimosthenis

论文摘要

图像中包含的文本具有高级语义，可以利用这些语义来获得更丰富的图像理解。特别是，仅文本的存在提供了强大的指导内容，应采用这些内容来解决各种计算机视觉任务，例如图像检索，细粒度的分类和视觉问题回答。在本文中，我们通过利用文本信息以及视觉提示来理解两种模式之间现有的内在关系，以解决细粒度分类和图像检索的问题。所提出的模型的新颖性包括使用PHOC的描述符，以构建一袋文本单词以及捕获文本形态的Fisher Vector编码。这种方法为这项任务提供了更强的多模式表示形式，正如我们的实验所证明的那样，它在两个不同的任务上实现了最新的结果，即细粒度的分类和图像检索。

Text contained in an image carries high-level semantics that can be exploited to achieve richer image understanding. In particular, the mere presence of text provides strong guiding content that should be employed to tackle a diversity of computer vision tasks such as image retrieval, fine-grained classification, and visual question answering. In this paper, we address the problem of fine-grained classification and image retrieval by leveraging textual information along with visual cues to comprehend the existing intrinsic relation between the two modalities. The novelty of the proposed model consists of the usage of a PHOC descriptor to construct a bag of textual words along with a Fisher Vector Encoding that captures the morphology of text. This approach provides a stronger multimodal representation for this task and as our experiments demonstrate, it achieves state-of-the-art results on two different tasks, fine-grained classification and image retrieval.

下载PDF全文

下载文献需遵守相关版权规定

论文标题