论文标题
在深度学习工作负载下的GPU失败的预测
Prediction of GPU Failures Under Deep Learning Workloads
论文作者
论文摘要
图形处理单元(GPU)是处理深度学习(DL)任务的事实上的标准。同时,GPU失败是不可避免的,在DL任务中造成了严重的后果:它们破坏了分布式培训,崩溃推理服务并导致违反服务水平的协议。为了减轻GPU失败引起的问题,我们建议通过使用ML模型来预测故障。本文是第一个研究大规模生产深度学习工作负载下GPU失败预测模型的预测模型。作为起点,我们评估了经典预测模型,并观察到这些模型的预测既不准确又不稳定。为了提高预测的精度和稳定性,我们提出了几种技术,包括平行和级联模型 - 汇集机制和滑动训练方法。我们在四个月的生产数据集中评估了各种技术的性能,包括3.5亿个条目。结果表明,我们提出的技术将预测精度从46.3 \%提高到84.0 \%。
Graphics processing units (GPUs) are the de facto standard for processing deep learning (DL) tasks. Meanwhile, GPU failures, which are inevitable, cause severe consequences in DL tasks: they disrupt distributed trainings, crash inference services, and result in service level agreement violations. To mitigate the problem caused by GPU failures, we propose to predict failures by using ML models. This paper is the first to study prediction models of GPU failures under large-scale production deep learning workloads. As a starting point, we evaluate classic prediction models and observe that predictions of these models are both inaccurate and unstable. To improve the precision and stability of predictions, we propose several techniques, including parallel and cascade model-ensemble mechanisms and a sliding training method. We evaluate the performances of our various techniques on a four-month production dataset including 350 million entries. The results show that our proposed techniques improve the prediction precision from 46.3\% to 84.0\%.