论文标题
部分可观测时空混沌系统的无模型预测
Target-Driven Structured Transformer Planner for Vision-Language Navigation
论文作者
论文摘要
视觉语言导航是指导一个具有自然语言说明的3D场景中导航的任务。对于代理商而言,从视觉语言线索中推断长期导航目标对于可靠的路径计划至关重要,但是,在文献中很少对此进行研究。在本文中,我们提出了一个目标驱动的结构化变压器计划器(TD-STP),以实现长距离目标引导和房间布局感知的导航。具体而言,我们设计了一种虚构的场景令牌机制,以显式估计长期目标(甚至位于未开发的环境中)。此外,我们设计了一个结构化的变压器规划师,将探索的房间布局优雅地融入了用于结构化和全球计划的神经关注体系结构中。实验结果表明,在R2R和Reverie基准测试集中,我们的TD-STP基本上将以前最佳方法的成功率分别提高了2%和5%。我们的代码可在https://github.com/yushengzhao/td-stp上找到。
Vision-language navigation is the task of directing an embodied agent to navigate in 3D scenes with natural language instructions. For the agent, inferring the long-term navigation target from visual-linguistic clues is crucial for reliable path planning, which, however, has rarely been studied before in literature. In this article, we propose a Target-Driven Structured Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware navigation. Specifically, we devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target (even located in unexplored environments). In addition, we design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning. Experimental results demonstrate that our TD-STP substantially improves previous best methods' success rate by 2% and 5% on the test set of R2R and REVERIE benchmarks, respectively. Our code is available at https://github.com/YushengZhao/TD-STP .