论文标题
代理导航的语言和视觉实体关系图
Language and Visual Entity Relationship Graph for Agent Navigation
论文作者
论文摘要
视觉和语言导航(VLN)要求代理按照自然语言说明在实际环境中导航。从文本和视觉的角度来看,我们发现场景之间的关系,其对象和方向线索对于代理人解释复杂的指示并正确感知环境至关重要。为了捕获和利用关系,我们提出了一个新颖的语言和视觉实体关系图,用于建模文本和视觉之间的模式间关系以及视觉实体之间的模式内关系。我们提出了一条消息传递算法,以在图表中传播语言元素和视觉实体之间的信息,然后将其组合起来以确定下一个动作。实验表明,通过利用关系,我们可以改善最先进的方法。在房间到室(R2R)基准测试中,我们的方法在未看到的测试中实现了新的最佳性能,而成功率(SPL)的成功率为52%。在房间(R4R)数据集中,我们的方法将以前的最佳成绩从13%提高到34%,从而加权了正常的动态时间扭曲(SDTW)。代码可在以下网址获得:https://github.com/yiconghong/entity-graph-vln。
Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions. From both the textual and visual perspectives, we find that the relationships among the scene, its objects,and directional clues are essential for the agent to interpret complex instructions and correctly perceive the environment. To capture and utilize the relationships, we propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision, and the intra-modal relationships among visual entities. We propose a message passing algorithm for propagating information between language elements and visual entities in the graph, which we then combine to determine the next action to take. Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art. On the Room-to-Room (R2R) benchmark, our method achieves the new best performance on the test unseen split with success rate weighted by path length (SPL) of 52%. On the Room-for-Room (R4R) dataset, our method significantly improves the previous best from 13% to 34% on the success weighted by normalized dynamic time warping (SDTW). Code is available at: https://github.com/YicongHong/Entity-Graph-VLN.