论文标题
使用本地错误校正代码缓解无服务器的Straggler
Serverless Straggler Mitigation using Local Error-Correcting Codes
论文作者
论文摘要
廉价的云服务(例如无服务器计算)通常容易受到散布的节点的影响,这些节点会增加分布式计算的端到端延迟。我们建议并实施简单而原则的方法,用于在无服务器系统中用于矩阵乘法的散曲机缓解措施,并根据机器学习和高性能计算的几个常见应用程序对其进行评估。所提出的方案的灵感来自错误校正代码,并使用无服务器工人对云中存储的数据进行并行编码和解码。这将创建一个完全分布的计算框架,而无需使用主节点进行编码或解码,从而删除了主机的计算,通信和存储瓶颈。在理论方面,我们确定我们提出的方案在解码时间方面是渐近的最佳选择,并提供了可以忍受的散乱者数量的下限。通过广泛的实验,我们表明我们的方案的表现优于现有方案,例如投机执行和其他编码理论方法的表现至少高于25%。
Inexpensive cloud services, such as serverless computing, are often vulnerable to straggling nodes that increase end-to-end latency for distributed computation. We propose and implement simple yet principled approaches for straggler mitigation in serverless systems for matrix multiplication and evaluate them on several common applications from machine learning and high-performance computing. The proposed schemes are inspired by error-correcting codes and employ parallel encoding and decoding over the data stored in the cloud using serverless workers. This creates a fully distributed computing framework without using a master node to conduct encoding or decoding, which removes the computation, communication and storage bottleneck at the master. On the theory side, we establish that our proposed scheme is asymptotically optimal in terms of decoding time and provide a lower bound on the number of stragglers it can tolerate with high probability. Through extensive experiments, we show that our scheme outperforms existing schemes such as speculative execution and other coding theoretic methods by at least 25%.