自我监督的视觉训练预处理以找到和遵循服装功能

论文标题

自我监督的视觉训练预处理以找到和遵循服装功能

Self-Supervised Visuo-Tactile Pretraining to Locate and Follow Garment Features

论文作者

Kerr, Justin, Huang, Huang, Wilcox, Albert, Hoque, Ryan, Ichnowski, Jeffrey, Calandra, Roberto, Goldberg, Ken

论文摘要

人类广泛利用视觉和触摸作为互补的感官，视觉提供有关场景的全球信息，并在操纵过程中触摸当地信息而不会受到阻塞。虽然先前的工作证明了触觉感知对可变形物的精确操纵的功效，但它们通常依靠受监督的，人体标记的数据集。我们建议通过交叉模式的监督以自我监督的方式以自我监督的方式学习多任务视觉tactile表示的框架。我们设计了一种机制，该机制使机器人能够自主收集精确的空间视觉和触觉图像对，然后使用交叉模式对比损失训练视觉和触觉编码器将这些对嵌入共享的潜在空间中。我们将此潜在空间应用于下游的感知和控制扁平表面上可变形服装的感知，并评估学习表示的灵活性，而无需对5个任务进行微调：特征分类，接触定位，异常检测，从视觉询问中搜索特征搜索（例如，在遮挡下的服装特征定位），以及沿遮挡的范围），以及沿着衣服的边缘。预处理的表示在这5个任务上取得了73-100％的成功率。

Humans make extensive use of vision and touch as complementary senses, with vision providing global information about the scene and touch measuring local information during manipulation without suffering from occlusions. While prior work demonstrates the efficacy of tactile sensing for precise manipulation of deformables, they typically rely on supervised, human-labeled datasets. We propose Self-Supervised Visuo-Tactile Pretraining (SSVTP), a framework for learning multi-task visuo-tactile representations in a self-supervised manner through cross-modal supervision. We design a mechanism that enables a robot to autonomously collect precisely spatially-aligned visual and tactile image pairs, then train visual and tactile encoders to embed these pairs into a shared latent space using cross-modal contrastive loss. We apply this latent space to downstream perception and control of deformable garments on flat surfaces, and evaluate the flexibility of the learned representations without fine-tuning on 5 tasks: feature classification, contact localization, anomaly detection, feature search from a visual query (e.g., garment feature localization under occlusion), and edge following along cloth edges. The pretrained representations achieve a 73-100% success rate on these 5 tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题