论文标题
基于容器的工作流程,用于HPC集群中深度学习算法的分布式培训
A Container-Based Workflow for Distributed Training of Deep Learning Algorithms in HPC Clusters
论文作者
论文摘要
深度学习被认为是解决科学不同分支中众多问题的解决方案。鉴于这些模型的资源密集型性质,通常需要以分布式的方式在专用硬件上执行此类图形处理单元(GPU)。在学术领域,研究人员通过高性能计算(HPC)群集可以访问这种资源。由于其多用户性质和有限的用户许可,这种基础架构使这些模型的培训变得困难。此外,不同的HPC簇可能具有不同的特殊性,可以纠缠于研究周期(例如,库依赖性)。在本文中,我们为HPC群集中深度学习模型的分布式培训开发了一种工作流和方法,该培训为研究人员提供了一系列新颖的优势。它依赖于udocker作为容器化工具,而将horovod作为库,用于在多个GPU中分布模型。 Udocker不需要任何特殊许可,使研究人员可以在不依赖任何管理员的情况下运行整个工作流程。 Horovod可以独立于使用的深度学习框架来确保培训的有效分配。此外,由于容器化和工作流程的特定功能,它为研究人员提供了一种群 - 不平衡的方式运行模型。进行的实验表明,工作流在模型的分布式训练中提供了良好的可扩展性,并且很容易适应不同的簇。
Deep learning has been postulated as a solution for numerous problems in different branches of science. Given the resource-intensive nature of these models, they often need to be executed on specialized hardware such graphical processing units (GPUs) in a distributed manner. In the academic field, researchers get access to this kind of resources through High Performance Computing (HPC) clusters. This kind of infrastructures make the training of these models difficult due to their multi-user nature and limited user permission. In addition, different HPC clusters may possess different peculiarities that can entangle the research cycle (e.g., libraries dependencies). In this paper we develop a workflow and methodology for the distributed training of deep learning models in HPC clusters which provides researchers with a series of novel advantages. It relies on udocker as containerization tool and on Horovod as library for the distribution of the models across multiple GPUs. udocker does not need any special permission, allowing researchers to run the entire workflow without relying on any administrator. Horovod ensures the efficient distribution of the training independently of the deep learning framework used. Additionally, due to containerization and specific features of the workflow, it provides researchers with a cluster-agnostic way of running their models. The experiments carried out show that the workflow offers good scalability in the distributed training of the models and that it easily adapts to different clusters.