论文标题
共生的关注以及特权信息以以自我为中心的行动识别
Symbiotic Attention with Privileged Information for Egocentric Action Recognition
论文作者
论文摘要
以自然的视频识别是一种自然测试,用于各种相互作用推理。由于以自我为中心的视频数据集中的动作词汇量很大,最近的研究通常利用两个分支的结构进行动作识别,即一个动词分类的一个分支,另一个用于名词分类的分支。但是,动词与名词分支之间的相关研究在很大程度上被忽略了。此外,由于没有位置意识的注意机制,这两个分支无法利用本地特征。在本文中,我们提出了一个新型的共生注意框架,利用特权信息(SAP)进行自我中心的视频识别。更精细的位置感知对象检测功能可以促进对演员与对象的互动的理解。我们在行动识别中介绍了这些功能,并将其视为特权信息。我们的框架可以在动词分支,名词分支和特权信息之间进行相互通信。该通信过程不仅将本地细节注入全局特征中,而且还利用了有关持续动作的时空位置的隐式指导。我们引入了新颖的共生关注(SA)以实现有效的沟通。它首先将一个分支上的检测引导特征归一化,以强调另一个分支的动作相关信息。 SA可以自适应增强三个来源之间的相互作用。为了进一步催化这种交流,可以发现空间关系以选择大多数与动作相关的信息。它确定了分类最有价值和最有价值的特征。我们在定量和质量上验证了SAP的有效性。值得注意的是,它可以在两个大规模的以自我为中心的视频数据集上实现最新。
Egocentric video recognition is a natural testbed for diverse interaction reasoning. Due to the large action vocabulary in egocentric video datasets, recent studies usually utilize a two-branch structure for action recognition, ie, one branch for verb classification and the other branch for noun classification. However, correlation studies between the verb and the noun branches have been largely ignored. Besides, the two branches fail to exploit local features due to the absence of a position-aware attention mechanism. In this paper, we propose a novel Symbiotic Attention framework leveraging Privileged information (SAP) for egocentric video recognition. Finer position-aware object detection features can facilitate the understanding of actor's interaction with the object. We introduce these features in action recognition and regard them as privileged information. Our framework enables mutual communication among the verb branch, the noun branch, and the privileged information. This communication process not only injects local details into global features but also exploits implicit guidance about the spatio-temporal position of an on-going action. We introduce novel symbiotic attention (SA) to enable effective communication. It first normalizes the detection guided features on one branch to underline the action-relevant information from the other branch. SA adaptively enhances the interactions among the three sources. To further catalyze this communication, spatial relations are uncovered for the selection of most action-relevant information. It identifies the most valuable and discriminative feature for classification. We validate the effectiveness of our SAP quantitatively and qualitatively. Notably, it achieves the state-of-the-art on two large-scale egocentric video datasets.