论文标题
元学习的多模式聚集方法
Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning
论文作者
论文摘要
视觉和声音是代理商互动和学习的两个重要钥匙。在本文中,我们提出了一种新颖的室内导航模型,称为“内存视觉”室内导航(MVV-IN),该模型接收语音命令并分析视觉观察的多模式信息,以增强机器人的环境理解。我们利用第一视图单眼相机拍摄的单一RGB图像。我们还采用了自我发挥的机制来使代理商专注于关键领域。内存对于代理人避免不必要地重复某些任务并为其适应新场景而言很重要,因此,我们使用元学习。我们已经尝试了从视觉观察中提取的各种功能特征。比较实验证明,我们的方法的表现优于最先进的基线。
Vision and voice are two vital keys for agents' interaction and learning. In this paper, we present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN), which receives voice commands and analyzes multimodal information of visual observation in order to enhance robots' environment understanding. We make use of single RGB images taken by a first-view monocular camera. We also apply a self-attention mechanism to keep the agent focusing on key areas. Memory is important for the agent to avoid repeating certain tasks unnecessarily and in order for it to adapt adequately to new scenes, therefore, we make use of meta-learning. We have experimented with various functional features extracted from visual observation. Comparative experiments prove that our methods outperform state-of-the-art baselines.