论文标题
minifloat-nn和exsdotp:ISA扩展名和一个模块化的开放硬件单元,用于在RISC-V内核上进行低精度培训
MiniFloat-NN and ExSdotp: An ISA Extension and a Modular Open Hardware Unit for Low-Precision Training on RISC-V cores
论文作者
论文摘要
低精度格式最近通过减少NN模型的内存足迹并提高基础硬件架构的能源效率,在神经网络(NN)培训和推理中驱动了重大突破。狭窄的整数数据类型已被广泛研究以进行NN推断,并成功地推向了三元和二进制表示的极端。相反,大多数面向培训的平台使用至少16位浮点(FP)格式。较低精确的数据类型,例如8位FP格式和混合精液技术,直到最近才在硬件实现中探索。我们提出了Minifloat-NN,这是一种用于低精度NN培训的RISC-V指令集架构扩展,为两种8位和两种16位FP格式提供了支持,并扩大了操作。该扩展名包括点产量的指令,这些指令以较大的格式积累结果,并在两个变化中添加三个期限:扩展和非扩展。我们实现一个EXSDOTP单元,以有效地支持两种指令类型的硬件。 EXSDOTP模块的融合性质可防止由两个连续的FP添加的非缔合性产生的精确损失,同时节省了大约30%的面积和临界路径,与两个扩展的融合多重ADD单位相比。我们在SIMD包装器中复制EXSDOTP模块,并将其集成到开源浮点单元中,该单元与开源的RISC-V Core相结合,为未来的可伸缩体系结构奠定了针对低精确度和混合精神NN培训的未来可伸缩体系结构。一个包含八个扩展内核的群集共享一个以12 nm finfet技术实现的刮擦板内存,当计算在0.8 V(1.26 GHz)的FP8-TO-FP16 GEMMS时,可实现高达575 Gflops/w。
Low-precision formats have recently driven major breakthroughs in neural network (NN) training and inference by reducing the memory footprint of the NN models and improving the energy efficiency of the underlying hardware architectures. Narrow integer data types have been vastly investigated for NN inference and have successfully been pushed to the extreme of ternary and binary representations. In contrast, most training-oriented platforms use at least 16-bit floating-point (FP) formats. Lower-precision data types such as 8-bit FP formats and mixed-precision techniques have only recently been explored in hardware implementations. We present MiniFloat-NN, a RISC-V instruction set architecture extension for low-precision NN training, providing support for two 8-bit and two 16-bit FP formats and expanding operations. The extension includes sum-of-dot-product instructions that accumulate the result in a larger format and three-term additions in two variations: expanding and non-expanding. We implement an ExSdotp unit to efficiently support in hardware both instruction types. The fused nature of the ExSdotp module prevents precision losses generated by the non-associativity of two consecutive FP additions while saving around 30% of the area and critical path compared to a cascade of two expanding fused multiply-add units. We replicate the ExSdotp module in a SIMD wrapper and integrate it into an open-source floating-point unit, which, coupled to an open-source RISC-V core, lays the foundation for future scalable architectures targeting low-precision and mixed-precision NN training. A cluster containing eight extended cores sharing a scratchpad memory, implemented in 12 nm FinFET technology, achieves up to 575 GFLOPS/W when computing FP8-to-FP16 GEMMs at 0.8 V, 1.26 GHz.