参考表达理解：方法和数据集的调查

论文标题

参考表达理解：方法和数据集的调查

Referring Expression Comprehension: A Survey of Methods and Datasets

论文作者

Qiao, Yanyuan, Deng, Chaorui, Wu, Qi

论文摘要

参考表达理解（REC）旨在将目标对象定位在自然语言中的引用表达式所描述的图像中。与已预先定义的对象检测任务不同，REC问题只能在测试过程中观察查询。因此，它比传统的计算机视觉问题更具挑战性。这项任务吸引了计算机视觉和自然语言处理社区的广泛关注，并且已经提出了几条工作，从CNN-RNN模型，模块化网络到基于图形的复杂模型。在这项调查中，我们首先通过将现代方法与该问题进行比较来检查最新技术。我们通过其机制对方法进行分类以编码视觉和文本方式。特别是，我们检查了关节嵌入图像和表达式对共同特征空间的共同方法。我们还讨论了与结构化图表示的模块化体系结构和基于图的模型。在本调查的第二部分中，我们审查可用于培训和评估REC系统的数据集。然后，我们根据数据集，骨干模型，设置进行分组，以便可以进行公平比较。最后，我们讨论了该领域的有希望的未来方向，尤其是需要更长的推理链要解决的组成参考表达理解。

Referring expression comprehension (REC) aims to localize a target object in an image described by a referring expression phrased in natural language. Different from the object detection task that queried object labels have been pre-defined, the REC problem only can observe the queries during the test. It thus more challenging than a conventional computer vision problem. This task has attracted a lot of attention from both computer vision and natural language processing community, and several lines of work have been proposed, from CNN-RNN model, modular network to complex graph-based model. In this survey, we first examine the state of the art by comparing modern approaches to the problem. We classify methods by their mechanism to encode the visual and textual modalities. In particular, we examine the common approach of joint embedding images and expressions to a common feature space. We also discuss modular architectures and graph-based models that interface with structured graph representation. In the second part of this survey, we review the datasets available for training and evaluating REC systems. We then group results according to the datasets, backbone models, settings so that they can be fairly compared. Finally, we discuss promising future directions for the field, in particular the compositional referring expression comprehension that requires longer reasoning chain to address.

下载PDF全文

下载文献需遵守相关版权规定

论文标题