综合特征和成本汇总与变压器的密集对应关系

论文标题

综合特征和成本汇总与变压器的密集对应关系

Integrative Feature and Cost Aggregation with Transformers for Dense Correspondence

论文作者

Hong, Sunghwan, Cho, Seokju, Kim, Seungryong, Lin, Stephen

论文摘要

我们提出了一个新颖的建筑，以实现密集的对应关系。当前的最新方法是基于变压器的方法，这些方法专注于功能描述符或成本量集合。但是，尽管关节聚集会通过提供一个人（即图像的结构或语义信息）或像素匹配的相似性来提高一个或另一个，但并非两者都汇总，但并非两者都汇总，尽管关节聚集会相互促进。在这项工作中，我们提出了一个基于变压器的新型网络，该网络以利用其互补信息的方式交织了两种形式的聚合。具体而言，我们设计了一个自我发项层，该层利用描述符来消除嘈杂的成本量，并且还利用成本量以促进准确匹配的方式汇总特征。随后的跨意思层执行进一步的聚集，以图像的描述为条件，并由早期层的聚合输出有助于。我们通过分层处理进一步提高了性能，在该处理中，更粗糙的聚合指导级别的处理。我们评估了所提出的方法对密集匹配任务的有效性，并在所有主要基准上实现最先进的性能。还提供了广泛的消融研究来验证我们的设计选择。

We present a novel architecture for dense correspondence. The current state-of-the-art are Transformer-based approaches that focus on either feature descriptors or cost volume aggregation. However, they generally aggregate one or the other but not both, though joint aggregation would boost each other by providing information that one has but other lacks, i.e., structural or semantic information of an image, or pixel-wise matching similarity. In this work, we propose a novel Transformer-based network that interleaves both forms of aggregations in a way that exploits their complementary information. Specifically, we design a self-attention layer that leverages the descriptor to disambiguate the noisy cost volume and that also utilizes the cost volume to aggregate features in a manner that promotes accurate matching. A subsequent cross-attention layer performs further aggregation conditioned on the descriptors of both images and aided by the aggregated outputs of earlier layers. We further boost the performance with hierarchical processing, in which coarser level aggregations guide those at finer levels. We evaluate the effectiveness of the proposed method on dense matching tasks and achieve state-of-the-art performance on all the major benchmarks. Extensive ablation studies are also provided to validate our design choices.

下载PDF全文

下载文献需遵守相关版权规定

论文标题