论文标题

Cultuciux:使用强化学习的DNN加速器的自动硬件资源分配

ConfuciuX: Autonomous Hardware Resource Assignment for DNN Accelerators using Reinforcement Learning

论文作者

Kao, Sheng-Chun, Jeong, Geonhwa, Krishna, Tushar

论文摘要

DNN加速器通过利用DNN计算过程中的激活/权重/输出的重用来减少从DRAM到芯片的数据移动来提供效率。重复使用由加速器的数据流捕获。尽管在探索和比较各种数据流方面已经进行了重要的工作,但分配片上硬件资源(即计算和内存)的策略给出了一个可以优化性能/能源的数据流,而在满足兴趣DNN(s)的平台限制的同时,仍然相对均相对均未相对识别。正如我们在这项工作中所示的那样(例如,运行\ mobilenet的选择大于O(10^(72))的选择,用于平衡计算和内存的设计空间会爆炸,使其不可避免地通过详尽的搜索进行手动调整。鉴于不同的DNN和层类型表现出不同量的再利用,也很难提出特定的启发式方法。 在本文中,我们提出了一种称为Cuncuciux的自主策略,以找到给定模型和数据流样式的优化的HW资源分配。 Cultuciux利用强化学习方法,加强指导搜索过程,利用培训循环中详细的HW性能成本模型来估算奖励。我们还使用遗传算法来增强RL方法,以进行进一步的微调。与贝叶斯优化,遗传算法,模拟退火和其他RL方法相比,孔子表明训练的样本效率最高。它收敛到优化的硬件配置4.7至24倍的速度是备用技术。

DNN accelerators provide efficiency by leveraging reuse of activations/weights/outputs during the DNN computations to reduce data movement from DRAM to the chip. The reuse is captured by the accelerator's dataflow. While there has been significant prior work in exploring and comparing various dataflows, the strategy for assigning on-chip hardware resources (i.e., compute and memory) given a dataflow that can optimize for performance/energy while meeting platform constraints of area/power for DNN(s) of interest is still relatively unexplored. The design-space of choices for balancing compute and memory explodes combinatorially, as we show in this work (e.g., as large as O(10^(72)) choices for running \mobilenet), making it infeasible to do manual-tuning via exhaustive searches. It is also difficult to come up with a specific heuristic given that different DNNs and layer types exhibit different amounts of reuse. In this paper, we propose an autonomous strategy called ConfuciuX to find optimized HW resource assignments for a given model and dataflow style. ConfuciuX leverages a reinforcement learning method, REINFORCE, to guide the search process, leveraging a detailed HW performance cost model within the training loop to estimate rewards. We also augment the RL approach with a genetic algorithm for further fine-tuning. ConfuciuX demonstrates the highest sample-efficiency for training compared to other techniques such as Bayesian optimization, genetic algorithm, simulated annealing, and other RL methods. It converges to the optimized hardware configuration 4.7 to 24 times faster than alternate techniques.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源