学习用于视觉对象跟踪的空间频率变压器

论文标题

学习用于视觉对象跟踪的空间频率变压器

Learning Spatial-Frequency Transformer for Visual Object Tracking

论文作者

Tang, Chuanming, Wang, Xiao, Bai, Yuanchao, Wu, Zhe, Zhang, Jianlin, Huang, Yongmei

论文摘要

最近的跟踪器采用变压器将或替换广泛使用的重置作为其新的骨干网络。尽管他们的跟踪器在常规场景中运行良好，但是他们只是将2D功能弄平为序列，以更好地匹配变压器。我们认为，这些操作忽略了目标对象的空间先验，这可能仅导致次优结果。此外，许多作品表明，自我注意力实际上是一个低通滤波器，它与输入功能或键/查询无关。也就是说，它可能会抑制输入功能的高频组成部分，并保留甚至放大低频信息。为了解决这些问题，在本文中，我们提出了一个统一的空间频率变压器，该变压器同时建模高斯空间先验和高频强调（GPHA）。具体而言，高斯空间先验是使用双多层感知器（MLP）生成的，并注入了通过将查询和自我注意力中的关键特征乘产生的相似性矩阵。输出将被馈入软磁层，然后分解为两个组件，即直接信号和高频信号。低通和高通的分支被重新定制并结合在一起以实现全通，因此，高频特征将在堆叠的自发层中得到很好的保护。我们进一步将空间频率变压器整合到暹罗跟踪框架中，并提出一种新颖的跟踪算法，称为Sftranst。基于跨级融合的SwintransFormer被用作骨干，也使用多头跨意义模块来增强搜索和模板特征之间的相互作用。输出将被馈入跟踪标题以进行目标定位。短期和长期跟踪基准的广泛实验都证明了我们提出的框架的有效性。

Recent trackers adopt the Transformer to combine or replace the widely used ResNet as their new backbone network. Although their trackers work well in regular scenarios, however, they simply flatten the 2D features into a sequence to better match the Transformer. We believe these operations ignore the spatial prior of the target object which may lead to sub-optimal results only. In addition, many works demonstrate that self-attention is actually a low-pass filter, which is independent of input features or key/queries. That is to say, it may suppress the high-frequency component of the input features and preserve or even amplify the low-frequency information. To handle these issues, in this paper, we propose a unified Spatial-Frequency Transformer that models the Gaussian spatial Prior and High-frequency emphasis Attention (GPHA) simultaneously. To be specific, Gaussian spatial prior is generated using dual Multi-Layer Perceptrons (MLPs) and injected into the similarity matrix produced by multiplying Query and Key features in self-attention. The output will be fed into a Softmax layer and then decomposed into two components, i.e., the direct signal and high-frequency signal. The low- and high-pass branches are rescaled and combined to achieve all-pass, therefore, the high-frequency features will be protected well in stacked self-attention layers. We further integrate the Spatial-Frequency Transformer into the Siamese tracking framework and propose a novel tracking algorithm, termed SFTransT. The cross-scale fusion based SwinTransformer is adopted as the backbone, and also a multi-head cross-attention module is used to boost the interaction between search and template features. The output will be fed into the tracking head for target localization. Extensive experiments on both short-term and long-term tracking benchmarks all demonstrate the effectiveness of our proposed framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题