一个用于视频理解和知识嵌入的统一模型，具有异质知识图数据集

论文标题

一个用于视频理解和知识嵌入的统一模型，具有异质知识图数据集

A Unified Model for Video Understanding and Knowledge Embedding with Heterogeneous Knowledge Graph Dataset

论文作者

Deng, Jiaxin, Shen, Dong, Pan, Haojie, Wu, Xiangyu, Liu, Ximan, Meng, Gaofeng, Yang, Fan, Li, Size, Fu, Ruiji, Wang, Zhongyuan

论文摘要

视频理解是短视频业务平台中的一项重要任务，它在视频推荐和分类中具有广泛的应用。大多数现有视频理解的工作仅关注视频内容中出现的信息，包括视频帧，音频和文本。但是，从外部知识图（kg）数据集引入常识知识对于视频中的视频理解至关重要，而与视频无关。由于缺乏视频知识图数据集，因此很少有整合视频理解和kg的工作。在本文中，我们提出了一个包含多模式视频实体和富有成果的常识关系的异质数据集。该数据集还提供了多个新颖的视频推理任务，例如视频链接标签（VRT）和视频融合-Video（VRV）任务。此外，基于此数据集，我们提出了一个端到端模型，该模型可以通过知识图嵌入共同优化视频理解目标，该目标不仅可以更好地将事实知识注入视频理解中，而且还可以生成有效的多模式实体嵌入KG。全面的实验表明，将视频理解与事实知识相结合，使基于内容的视频检索效果受益。此外，它还有助于模型生成更好的知识图嵌入，从而超过了基于KGE的VRT和VRV任务的传统方法，其命中率至少为42.36％和17.73％。

Video understanding is an important task in short video business platforms and it has a wide application in video recommendation and classification. Most of the existing video understanding works only focus on the information that appeared within the video content, including the video frames, audio and text. However, introducing common sense knowledge from the external Knowledge Graph (KG) dataset is essential for video understanding when referring to the content which is less relevant to the video. Owing to the lack of video knowledge graph dataset, the work which integrates video understanding and KG is rare. In this paper, we propose a heterogeneous dataset that contains the multi-modal video entity and fruitful common sense relations. This dataset also provides multiple novel video inference tasks like the Video-Relation-Tag (VRT) and Video-Relation-Video (VRV) tasks. Furthermore, based on this dataset, we propose an end-to-end model that jointly optimizes the video understanding objective with knowledge graph embedding, which can not only better inject factual knowledge into video understanding but also generate effective multi-modal entity embedding for KG. Comprehensive experiments indicate that combining video understanding embedding with factual knowledge benefits the content-based video retrieval performance. Moreover, it also helps the model generate better knowledge graph embedding which outperforms traditional KGE-based methods on VRT and VRV tasks with at least 42.36% and 17.73% improvement in HITS@10.

下载PDF全文

下载文献需遵守相关版权规定

论文标题