论文标题

稀疏激活网络的理论观点

A Theoretical View on Sparsely Activated Networks

论文作者

Baykal, Cenk, Dikkala, Nishanth, Panigrahy, Rina, Rashtchian, Cyrus, Wang, Xin

论文摘要

深层神经网络如今成功地拟合了非常复杂的功能,但是对于推理而言,密集的模型开始非常昂贵。为了减轻这种情况,一个有希望的方向是激活网络稀疏子图的网络。子图是由数据依赖性路由函数选择的,将输入的固定映射到子网络(例如,专家(MOE)范式的混合物中,开关变压器中的混合物)。但是,先前的工作在很大程度上是经验性的,尽管现有的路由功能在实践中效果很好,但它们并没有导致近似能力的理论保证。我们旨在为稀疏网络的力量提供理论解释。作为我们的第一个贡献,我们提出了一个与数据相关的稀疏网络的正式模型,该网络捕获了流行体系结构的显着方面。然后,我们基于局部敏感的哈希(LSH)引入一个路由函数,使我们能够对稀疏网络近似目标函数的方式进行推论。在用我们的模型代表基于LSH的稀疏网络之后,我们证明稀疏网络可以匹配Lipschitz函数上密集网络的近似能力。在输入向量上应用LSH意味着专家在输入空间的不同子区域中插入目标函数。为了支持我们的理论,我们根据Lipschitz目标功能定义了各种数据集,并且我们表明,稀疏网络在活动数量数量和近似质量之间提供了良好的权衡。

Deep and wide neural networks successfully fit very complex functions today, but dense models are starting to be prohibitively expensive for inference. To mitigate this, one promising direction is networks that activate a sparse subgraph of the network. The subgraph is chosen by a data-dependent routing function, enforcing a fixed mapping of inputs to subnetworks (e.g., the Mixture of Experts (MoE) paradigm in Switch Transformers). However, prior work is largely empirical, and while existing routing functions work well in practice, they do not lead to theoretical guarantees on approximation ability. We aim to provide a theoretical explanation for the power of sparse networks. As our first contribution, we present a formal model of data-dependent sparse networks that captures salient aspects of popular architectures. We then introduce a routing function based on locality sensitive hashing (LSH) that enables us to reason about how well sparse networks approximate target functions. After representing LSH-based sparse networks with our model, we prove that sparse networks can match the approximation power of dense networks on Lipschitz functions. Applying LSH on the input vectors means that the experts interpolate the target function in different subregions of the input space. To support our theory, we define various datasets based on Lipschitz target functions, and we show that sparse networks give a favorable trade-off between number of active units and approximation quality.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源