通过自适应对象校准进行健壮的视频对象细分

论文标题

通过自适应对象校准进行健壮的视频对象细分

Towards Robust Video Object Segmentation with Adaptive Object Calibration

论文作者

Xu, Xiaohao, Wang, Jinglu, Ming, Xiang, Lu, Yan

论文摘要

在蓬勃发展的视频时代，视频细分吸引了多媒体社区的越来越多的研究关注。半监督视频对象细分（VOS）旨在分割视频的所有目标框架中的对象，给定带注释的参考帧掩码。大多数现有方法构建像素参考目标相关性，然后执行像素跟踪以获得目标掩码。由于忽略对象级别的提示，像素级方法的方法使跟踪容易受到扰动的影响，甚至在相似对象之间进行了不加区分的跟踪。朝着强大的VOS，关键见解是校准每个特定对象的表示和掩盖，以表达和歧视性。因此，我们提出了一个新的深层网络，该网络可以自适应地构建对象表示并校准对象掩盖以实现更强的鲁棒性。首先，我们通过应用自适应对象代理（AOP）聚合方法来构建对象表示，其中代理代表在多级别上进行任意形状的段供参考。然后，原型掩码最初是从基于AOP的参考目标相关性生成的。之后，通过网络调制进一步校准了此类原始掩码，并根据对象代理表示条件。我们以渐进式的方式巩固了此条件掩盖校准过程，其中对象表示和原始掩模会演变为歧视性迭代。广泛的实验是在标准VOS基准，YouTube-VOS-18/19和Davis-17上进行的。我们的模型在现有已发表的作品中实现了最新的表现，并且还表现出对扰动的卓越鲁棒性。我们的项目仓库位于https://github.com/jerryx1110/robust-video-object-ementation

In the booming video era, video segmentation attracts increasing research attention in the multimedia community. Semi-supervised video object segmentation (VOS) aims at segmenting objects in all target frames of a video, given annotated object masks of reference frames. Most existing methods build pixel-wise reference-target correlations and then perform pixel-wise tracking to obtain target masks. Due to neglecting object-level cues, pixel-level approaches make the tracking vulnerable to perturbations, and even indiscriminate among similar objects. Towards robust VOS, the key insight is to calibrate the representation and mask of each specific object to be expressive and discriminative. Accordingly, we propose a new deep network, which can adaptively construct object representations and calibrate object masks to achieve stronger robustness. First, we construct the object representations by applying an adaptive object proxy (AOP) aggregation method, where the proxies represent arbitrary-shaped segments at multi-levels for reference. Then, prototype masks are initially generated from the reference-target correlations based on AOP. Afterwards, such proto-masks are further calibrated through network modulation, conditioning on the object proxy representations. We consolidate this conditional mask calibration process in a progressive manner, where the object representations and proto-masks evolve to be discriminative iteratively. Extensive experiments are conducted on the standard VOS benchmarks, YouTube-VOS-18/19 and DAVIS-17. Our model achieves the state-of-the-art performance among existing published works, and also exhibits superior robustness against perturbations. Our project repo is at https://github.com/JerryX1110/Robust-Video-Object-Segmentation

下载PDF全文

下载文献需遵守相关版权规定

论文标题