论文标题
与组成神经模块网络的图像字幕
Image Captioning with Compositional Neural Module Networks
论文作者
论文摘要
在图像字幕中,流利性是评估的重要因素,例如$ n $ gram指标,连续模型通常使用;但是,顺序模型通常会导致过度概括性表达式,这些表达式缺乏输入图像中可能存在的细节。受到视觉问题回答任务中组成神经模块网络的想法的启发,我们引入了一个层次结构框架,以探索自然语言的组成性和顺序性。我们的算法通过选择性地参与与输入图像中检测到的每个对象的独特方面相对应的不同模块,以包括计数和颜色等特定描述,从而构成了丰富的句子。在MSCOCO数据集的一系列实验中,提出的模型的表现优于多个评估指标的最新模型,更重要的是,提供了可解释的结果。此外,亚马逊机械土耳其人在亚马逊机械土耳其上的香料指标和人类评估的子类别的细分表明,我们的组成模块网络有效地生成了准确详细的字幕。
In image captioning where fluency is an important factor in evaluation, e.g., $n$-gram metrics, sequential models are commonly used; however, sequential models generally result in overgeneralized expressions that lack the details that may be present in an input image. Inspired by the idea of the compositional neural module networks in the visual question answering task, we introduce a hierarchical framework for image captioning that explores both compositionality and sequentiality of natural language. Our algorithm learns to compose a detail-rich sentence by selectively attending to different modules corresponding to unique aspects of each object detected in an input image to include specific descriptions such as counts and color. In a set of experiments on the MSCOCO dataset, the proposed model outperforms a state-of-the art model across multiple evaluation metrics, more importantly, presenting visually interpretable results. Furthermore, the breakdown of subcategories $f$-scores of the SPICE metric and human evaluation on Amazon Mechanical Turk show that our compositional module networks effectively generate accurate and detailed captions.