当我们第一次见面时：共同机器人集合的视觉惯性人本地化

论文标题

当我们第一次见面时：共同机器人集合的视觉惯性人本地化

When We First Met: Visual-Inertial Person Localization for Co-Robot Rendezvous

论文作者

Sun, Xi, Weng, Xinshuo, Kitani, Kris

论文摘要

我们的目标是使机器人能够通过额外的感应方式（目标人的3D惯性测量值）在视觉上定位目标人。当机器人第一次与人群中的人见面或自动驾驶汽车必须与人群中的骑手会合时，可能会出现这种技术的需求。一个人的惯性信息可以用可穿戴设备（例如智能手机）来测量，并且可以在会合过程中与自主系统选择性共享。我们提出了一种学习视觉惯性特征空间的方法，在该空间中，视频中某人的运动可以很容易地与可穿戴惯性测量单元（IMU）测量的运动相匹配。通过使用对比损失，这两种方式将两种方式转化为关节特征空间，该损失迫使同一人产生的惯性运动特征和视频运动特征，以靠近关节特征空间。为了验证我们的方法，我们撰写了一个超过60,000个视频段的数据集，其中包括可穿戴IMU数据。我们的实验表明，我们提出的方法仅使用IMU数据和视频仅5秒钟就能准确地定位具有80.7％精度的目标人。

We aim to enable robots to visually localize a target person through the aid of an additional sensing modality -- the target person's 3D inertial measurements. The need for such technology may arise when a robot is to meet person in a crowd for the first time or when an autonomous vehicle must rendezvous with a rider amongst a crowd without knowing the appearance of the person in advance. A person's inertial information can be measured with a wearable device such as a smart-phone and can be shared selectively with an autonomous system during the rendezvous. We propose a method to learn a visual-inertial feature space in which the motion of a person in video can be easily matched to the motion measured by a wearable inertial measurement unit (IMU). The transformation of the two modalities into the joint feature space is learned through the use of a contrastive loss which forces inertial motion features and video motion features generated by the same person to lie close in the joint feature space. To validate our approach, we compose a dataset of over 60,000 video segments of moving people along with wearable IMU data. Our experiments show that our proposed method is able to accurately localize a target person with 80.7% accuracy using only 5 seconds of IMU data and video.

下载PDF全文

下载文献需遵守相关版权规定

论文标题