建模程序多模式理解的时间模式图形

论文标题

建模程序多模式理解的时间模式图形

Modeling Temporal-Modal Entity Graph for Procedural Multimodal Machine Comprehension

论文作者

Zhang, Huibin, Zhang, Zhengkun, Zhang, Yao, Wang, Jun, Li, Yufan, jiang, Ning, wei, Xin, Yang, Zhenglu

论文摘要

程序多模式文档（PMD）组织文本说明和相应的图像逐步组织。理解PMD并为下游推理任务诱导其表示形式，被指定为程序多模式理解（M3C）。在这项研究中，我们以细粒度的水平（与文档或句子级别的现有探索相比），即实体进行程序M3C。有了微妙的考虑，我们将实体在其时间和跨模式关系中进行建模，并提出了一种新型的时间模式实体图（TMEG）。具体而言，图形结构是为了捕获文本和视觉实体并追踪其时间模式演化的。此外，引入图形聚合模块以进行图形编码和推理。在传统的数据集配方和我们的新数据集CraftQA上进行了三个程序M3C任务的全面实验，它们可以更好地评估TMEG的概括。

Procedural Multimodal Documents (PMDs) organize textual instructions and corresponding images step by step. Comprehending PMDs and inducing their representations for the downstream reasoning tasks is designated as Procedural MultiModal Machine Comprehension (M3C). In this study, we approach Procedural M3C at a fine-grained level (compared with existing explorations at a document or sentence level), that is, entity. With delicate consideration, we model entity both in its temporal and cross-modal relation and propose a novel Temporal-Modal Entity Graph (TMEG). Specifically, graph structure is formulated to capture textual and visual entities and trace their temporal-modal evolution. In addition, a graph aggregation module is introduced to conduct graph encoding and reasoning. Comprehensive experiments across three Procedural M3C tasks are conducted on a traditional dataset RecipeQA and our new dataset CraftQA, which can better evaluate the generalization of TMEG.

下载PDF全文

下载文献需遵守相关版权规定

论文标题