论文标题
从视听提示中自我监督的移动车辆检测
Self-Supervised Moving Vehicle Detection from Audio-Visual Cues
论文作者
论文摘要
对于任何自主操作的户外机器人或自动驾驶车辆的自主操作,可驾驶车辆的强大检测是一项至关重要的任务。解决此任务的大多数现代方法都依赖于使用大规模车辆检测数据集(例如Nuscenes或Waymo Open Dataset)训练基于图像的探测器。提供手动注释是一种昂贵且费力的练习,在实践中不能很好地扩展。为了解决这个问题,我们提出了一种自我监督的方法,该方法利用音频提示来检测视频中移动的车辆。我们的方法采用对比度学习,以在相应的图像和录制音频对中将车辆定位在图像中。在使用现实世界数据集进行的广泛实验中,我们证明了我们的方法提供了对移动车辆的准确检测,并且不需要手动注释。我们此外表明,我们的模型可以用作教师来监督仅音频检测模型。该学生模型在照明变化中不变,因此有效地弥合了仅将视觉视为主要模态的模型固有的域间隙。
Robust detection of moving vehicles is a critical task for any autonomously operating outdoor robot or self-driving vehicle. Most modern approaches for solving this task rely on training image-based detectors using large-scale vehicle detection datasets such as nuScenes or the Waymo Open Dataset. Providing manual annotations is an expensive and laborious exercise that does not scale well in practice. To tackle this problem, we propose a self-supervised approach that leverages audio-visual cues to detect moving vehicles in videos. Our approach employs contrastive learning for localizing vehicles in images from corresponding pairs of images and recorded audio. In extensive experiments carried out with a real-world dataset, we demonstrate that our approach provides accurate detections of moving vehicles and does not require manual annotations. We furthermore show that our model can be used as a teacher to supervise an audio-only detection model. This student model is invariant to illumination changes and thus effectively bridges the domain gap inherent to models leveraging exclusively vision as the predominant modality.