Autodistill：端到端框架，用于探索和提炼硬件有效的语言模型

论文标题

Autodistill：端到端框架，用于探索和提炼硬件有效的语言模型

AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Efficient Language Models

论文作者

Zhang, Xiaofan, Zhou, Zongwei, Chen, Deming, Wang, Yu Emma

论文摘要

最近，大型的预训练模型显着改善了各种自然语言处理（NLP）任务的性能，但是由于长期服务的延迟和大量的记忆使用，它们的服务昂贵。为了压缩这些模型，知识蒸馏吸引了越来越多的兴趣，这是模型压缩的最有效方法之一。但是，现有的蒸馏方法尚未解决数据中心中模型服务的独特挑战，例如处理快速发展的模型，考虑服务性能以及针对多个目标进行优化。为了解决这些问题，我们提出了Autodistill，这是一个端到端模型蒸馏框架，集成了模型体系结构探索和用于构建硬件有效的NLP预训练模型的多目标优化。我们使用贝叶斯优化来进行多目标神经体系结构搜索选择学生模型体系结构。拟议的搜索全面考虑了预测准确性和目标硬件的延迟。 TPUV4I上的实验显示了七个模型架构的发现，其精确度更高（高达3.2％），推理潜伏期（最高1.44倍）的发现比Moberbert更低。通过在胶水基准中运行下游NLP任务，使用28.50万参数的Autodistill进行培训的模型可实现81.69的平均得分，该得分高于Bert_base，Distillbert，Tinybert，Nas-nas-Bert和Moberbert。 Autodistill发现的最紧凑的模型仅包含2060万参数，但在平均胶水中，distillbert（67m），Tinybert（67m）和Moberbert（2530万）仍然超过了Bert_base（1009m），Distillbert（67m）和平均胶水分数。通过评估小队，AutoDistill发现的模型以2280万参数的成绩达到了88.4％的F1分数，该分数将参数降低了62％以上，同时比Distillbert，Tinybert和Nas-bert保持更高的准确性。

Recently, large pre-trained models have significantly improved the performance of various Natural LanguageProcessing (NLP) tasks but they are expensive to serve due to long serving latency and large memory usage. To compress these models, knowledge distillation has attracted an increasing amount of interest as one of the most effective methods for model compression. However, existing distillation methods have not yet addressed the unique challenges of model serving in datacenters, such as handling fast evolving models, considering serving performance, and optimizing for multiple objectives. To solve these problems, we propose AutoDistill, an end-to-end model distillation framework integrating model architecture exploration and multi-objective optimization for building hardware-efficient NLP pre-trained models. We use Bayesian Optimization to conduct multi-objective Neural Architecture Search for selecting student model architectures. The proposed search comprehensively considers both prediction accuracy and serving latency on target hardware. The experiments on TPUv4i show the finding of seven model architectures with better pre-trained accuracy (up to 3.2% higher) and lower inference latency (up to 1.44x faster) than MobileBERT. By running downstream NLP tasks in the GLUE benchmark, the model distilled for pre-training by AutoDistill with 28.5M parameters achieves an 81.69 average score, which is higher than BERT_BASE, DistillBERT, TinyBERT, NAS-BERT, and MobileBERT. The most compact model found by AutoDistill contains only 20.6M parameters but still outperform BERT_BASE(109M), DistillBERT(67M), TinyBERT(67M), and MobileBERT(25.3M) regarding the average GLUE score. By evaluating on SQuAD, a model found by AutoDistill achieves an 88.4% F1 score with 22.8M parameters, which reduces parameters by more than 62% while maintaining higher accuracy than DistillBERT, TinyBERT, and NAS-BERT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题