论文标题
特定于任务的专家修剪,用于稀疏的专家
Task-Specific Expert Pruning for Sparse Mixture-of-Experts
论文作者
论文摘要
稀疏的Experts(MOE)模型对于大规模的预训练非常有力,并且由于其模型容量而获得了有希望的结果。但是,对于数万亿个参数,很难在云或移动环境上部署MOE。 MOE的推论需要专家并行性,这不是硬件友好且沟通昂贵。特别是对于资源有限的下游任务,这种稀疏的结构必须为有限的性能提高而牺牲很多计算效率。在这项工作中,我们观察到大多数专家对Moe的微调和推论几乎没有什么贡献。我们进一步提出了一种通用方法,以逐步将非专业专家置于目标下游任务,该任务保留了MOE的好处,同时将MOE模型减少为一个单一功能密集模型。我们的实验表明,微调的单一杂货模型可以在六种不同类型的任务中保留99.3%的MOE受益,同时享受2倍推理速度,并以免费的沟通成本。
The sparse Mixture-of-Experts (MoE) model is powerful for large-scale pre-training and has achieved promising results due to its model capacity. However, with trillions of parameters, MoE is hard to be deployed on cloud or mobile environment. The inference of MoE requires expert parallelism, which is not hardware-friendly and communication expensive. Especially for resource-limited downstream tasks, such sparse structure has to sacrifice a lot of computing efficiency for limited performance gains. In this work, we observe most experts contribute scarcely little to the MoE fine-tuning and inference. We further propose a general method to progressively drop the non-professional experts for the target downstream task, which preserves the benefits of MoE while reducing the MoE model into one single-expert dense model. Our experiments reveal that the fine-tuned single-expert model could preserve 99.3% benefits from MoE across six different types of tasks while enjoying 2x inference speed with free communication cost.