论文标题
细粒度的图像到图像转换向视觉识别
Fine-grained Image-to-Image Transformation towards Visual Recognition
论文作者
论文摘要
现有的图像到图像转换方法主要集中于综合视觉令人愉悦的数据。具有正确身份标签的图像既有挑战,却少得多。在保留身份,例如面部旋转和对象视点变形的同时,处理具有较大变形的图像转换任务更具挑战性。在本文中,我们旨在将具有细粒类别的图像转换为合成保留输入图像身份的新图像,从而使后续的细粒图像识别和几乎没有弹射的学习任务受益。生成的图像以较大的几何变形转换,不一定需要具有很高的视觉质量,但需要保持尽可能多的身份信息。为此,我们采用基于生成对抗网络的模型来解散图像的身份相关和无关因素。为了在可变形转换过程中保留输入图像的细颗粒上下文细节,提出了一种约束的非对准连接方法,以在发电机中的中间卷积块之间构建可学习的高速公路。此外,提出了一种自适应身份调制机制,以有效地将身份信息传递到输出图像中。对Compcars和Multi Pie数据集进行的广泛实验表明,我们的模型比最新的图像到图像到图像转换模型更好地保留了生成的图像的身份,因此,在细粒度的少量学习中,可以显着提高视觉识别性能。
Existing image-to-image transformation approaches primarily focus on synthesizing visually pleasing data. Generating images with correct identity labels is challenging yet much less explored. It is even more challenging to deal with image transformation tasks with large deformation in poses, viewpoints, or scales while preserving the identity, such as face rotation and object viewpoint morphing. In this paper, we aim at transforming an image with a fine-grained category to synthesize new images that preserve the identity of the input image, which can thereby benefit the subsequent fine-grained image recognition and few-shot learning tasks. The generated images, transformed with large geometric deformation, do not necessarily need to be of high visual quality but are required to maintain as much identity information as possible. To this end, we adopt a model based on generative adversarial networks to disentangle the identity related and unrelated factors of an image. In order to preserve the fine-grained contextual details of the input image during the deformable transformation, a constrained nonalignment connection method is proposed to construct learnable highways between intermediate convolution blocks in the generator. Moreover, an adaptive identity modulation mechanism is proposed to transfer the identity information into the output image effectively. Extensive experiments on the CompCars and Multi-PIE datasets demonstrate that our model preserves the identity of the generated images much better than the state-of-the-art image-to-image transformation models, and as a result significantly boosts the visual recognition performance in fine-grained few-shot learning.