长度自适应变压器：训练一次长度下降，随时使用搜索

论文标题

长度自适应变压器：训练一次长度下降，随时使用搜索

Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search

论文作者

Kim, Gyuwan, Cho, Kyunghyun

论文摘要

尽管变压器的精度令人印象深刻，但它们的计算成本通常易于与有限的计算资源一起使用。以前的大多数提高推理效率的方法都需要为每个可能的计算预算一个单独的模型。在本文中，我们扩展了Power-Bert（Goyal等，2020），并提出了长度自适应变压器，该变压器可在一次射击训练后用于各种推理场景。我们用LengthDrop（辍学的结构变体）训练变压器，它随机地决定了每一层的序列长度。然后，我们进行多目标进化搜索，以找到一种长度配置，该长度配置可最大化准确性并最大程度地减少任何给定计算预算下的效率度量。此外，我们将Power-Bert超出序列级分类的适用性显着扩展到令牌级分类中，并使用滴和储能过程，该过程将单词矢量暂时放在中间层中，并在必要时在最后一层还原。我们通过证明在各种设置下的卓越准确性效率折衷（包括基于跨度的问题答案和文本分类）来验证所提出方法的实用性。代码可从https://github.com/clovaai/length-aptive-transformer获得。

Despite transformers' impressive accuracy, their computational cost is often prohibitive to use with limited computational resources. Most previous approaches to improve inference efficiency require a separate model for each possible computational budget. In this paper, we extend PoWER-BERT (Goyal et al., 2020) and propose Length-Adaptive Transformer that can be used for various inference scenarios after one-shot training. We train a transformer with LengthDrop, a structural variant of dropout, which stochastically determines a sequence length at each layer. We then conduct a multi-objective evolutionary search to find a length configuration that maximizes the accuracy and minimizes the efficiency metric under any given computational budget. Additionally, we significantly extend the applicability of PoWER-BERT beyond sequence-level classification into token-level classification with Drop-and-Restore process that drops word-vectors temporarily in intermediate layers and restores at the last layer if necessary. We empirically verify the utility of the proposed approach by demonstrating the superior accuracy-efficiency trade-off under various setups, including span-based question answering and text classification. Code is available at https://github.com/clovaai/length-adaptive-transformer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题