Maxim：用于图像处理的多轴MLP

论文标题

Maxim：用于图像处理的多轴MLP

MAXIM: Multi-Axis MLP for Image Processing

论文作者

Tu, Zhengzhong, Talebi, Hossein, Zhang, Han, Yang, Feng, Milanfar, Peyman, Bovik, Alan, Li, Yinxiao

论文摘要

变形金刚和多层感知器（MLP）模型的最新进展为计算机视觉任务提供了新的网络架构设计。尽管这些模型在许多视觉任务（例如图像识别）中被证明是有效的，但在适应低级视觉方面仍然存在挑战。支持高分辨率图像和局部关注的局限性的僵化性可能是主要的瓶颈。在这项工作中，我们提出了一个名为Maxim的多轴基于MLP的体系结构，该体系结构可以作为图像处理任务的有效且灵活的通用视觉主链。 Maxim使用不形的分层结构，并支持由空间门控的MLP启用的远程相互作用。具体而言，Maxim包含两个基于MLP的构件：一个多轴门控MLP，可有效且可扩展的空间混合局部和全局视觉提示，以及一个交叉门控块，一种交叉注意的替代方案，涉及交叉特征条件。这两个模块都仅基于MLP，但也受益于全局和“完全横向卷积”，这是图像处理所需的两个属性。我们广泛的实验结果表明，所提出的Maxim模型在一系列图像处理任务上实现了十多个基准测试的最先进性能，包括变性，去蓝色，der酸，去掩饰和增强，同时需要比竞争模型更少或可比较的参数和flops。源代码和训练有素的模型将在\ url {https://github.com/google-research/maxim}上获得。

Recent progress on Transformers and multi-layer perceptron (MLP) models provide new network architectural designs for computer vision tasks. Although these models proved to be effective in many vision tasks such as image recognition, there remain challenges in adapting them for low-level vision. The inflexibility to support high-resolution images and limitations of local attention are perhaps the main bottlenecks. In this work, we present a multi-axis MLP based architecture called MAXIM, that can serve as an efficient and flexible general-purpose vision backbone for image processing tasks. MAXIM uses a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gated MLPs. Specifically, MAXIM contains two MLP-based building blocks: a multi-axis gated MLP that allows for efficient and scalable spatial mixing of local and global visual cues, and a cross-gating block, an alternative to cross-attention, which accounts for cross-feature conditioning. Both these modules are exclusively based on MLPs, but also benefit from being both global and `fully-convolutional', two properties that are desirable for image processing. Our extensive experimental results show that the proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks, including denoising, deblurring, deraining, dehazing, and enhancement while requiring fewer or comparable numbers of parameters and FLOPs than competitive models. The source code and trained models will be available at \url{https://github.com/google-research/maxim}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题