论文标题
稀疏的搅拌机:结合MOE和混合以构建更有效的Bert
Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT
论文作者
论文摘要
我们将稀疏门控的Experts(MOE)的容量与线性的速度和稳定性相结合,将转换转换以设计稀疏的搅拌器编码器模型。稀疏的搅拌机在胶水和超级胶水上的表现略高于(<1%),但更重要的是,训练速度更快65%,并且推断速度更快61%。我们还提出了更快的变体,命名为快速稀疏搅拌机,在Superglue上的表现略高,但训练和跑步的速度几乎是两倍。我们通过通过各种混合机构,MOE配置和超参数仔细消融来证明这两个模型的设计是合理的。稀疏的搅拌机克服了MOE模型的许多延迟和稳定性问题,并提供了提供稀疏学生模型的前景,而无需求助于将它们提炼成密集的变体。
We combine the capacity of sparsely gated Mixture-of-Experts (MoE) with the speed and stability of linear, mixing transformations to design the Sparse Mixer encoder model. Sparse Mixer slightly outperforms (<1%) BERT on GLUE and SuperGLUE, but more importantly trains 65% faster and runs inference 61% faster. We also present a faster variant, prosaically named Fast Sparse Mixer, that marginally underperforms BERT on SuperGLUE, but trains and runs nearly twice as fast. We justify the design of these two models by carefully ablating through various mixing mechanisms, MoE configurations and hyperparameters. Sparse Mixer overcomes many of the latency and stability concerns of MoE models and offers the prospect of serving sparse student models, without resorting to distilling them to dense variants.