- 无标题文档
查看论文信息

中文题名:

 卷积神经网络硬件加速模块的设计与实现    

姓名:

 周润泽    

保密级别:

 公开    

论文语种:

 中文    

学科代码:

 080714T    

学科专业:

 电子信息科学与技术    

学生类型:

 学士    

学位:

 理学学士    

学位年度:

 2022    

学校:

 北京师范大学    

校区:

 北京校区培养    

学院:

 人工智能学院    

第一导师姓名:

 王建明    

第一导师单位:

 北京师范大学人工智能学院    

提交日期:

 2022-05-27    

答辩日期:

 2022-05-10    

外文题名:

 Design and implementation of hardware acceleration module for convolutional neural networks    

中文关键词:

 卷积神经网络 ; FPGA ; 硬件加速 ; 流水线    

外文关键词:

 CNN ; FPGA ; hardware acceleration ; pipeline    

中文摘要:

卷积神经网络(Convolutional Neural Network, CNN)凭借其在处理非线性问题上的优异性能得以飞速发展,然而CNN的高计算复杂度限制了CNN的应用场景。本文以Verilog HDL为工具,提出了一种采用参数化设计对CNN的层级进行抽象,提高代码可重用性的设计方法。本文设计和实现了用于手写数字识别的CNN硬件加速模块,采用Verilog参数化设计设计了可重用的CNN子模块,并在FPGA上进行了实际测试验证,仿真与测试的结果表明了设计工作的正确性。本文主要创新点包括:采用定点化,时分复用等方式降低网络对硬件资源的消耗;使用流水线和双缓存数据结构提高网络的数据吞吐率;设计可复用数据的行缓存卷积模块保证流水线高效运转。实验结果表明,加速器的计算性能达到了38.08GOPS(每秒十亿次运算, Giga Operations Per Second),在系统时钟频率为50MHz时,每张图的识别时间为0.032ms,识别准确率均在97.5%以上,所需时钟周期数为同级别CPU的1/150。此外测试结果还表明,在识别率未明显损失的情况下,采用ReLU激活函数较其他非线性激活函数资源消耗少10%至20%,硬件实现时ReLU更适合作为激活函数。

外文摘要:

Convolutional Neural Network (CNN) has developed rapidly due to its excellent performance in dealing with non-linear problems. However, the high computational complexity of CNN limits the application scenarios of CNN. In this paper, we proposed a parametric design method using Verilog HDL as a tool to abstract the layer structure of CNN and to improve code reusability. In this paper, we designed and implemented a CNN hardware acceleration module for handwritten digit recognition. Reusable CNN sub-modules are designed using Verilog parametric design and verified on FPGA. The main innovations of this paper include: using fixed-pointing and time-division multiplexing to reduce the consumption of hardware resources by the network; using pipeline and double-cache data structure to improve the data throughput rate of the network; and designing a reusable data line cache convolution module to ensure the efficient operation of the pipeline. The experimental results show that the accelerator achieves a computational performance of 38.08 GOPS. With a recognition time of 0.032ms per graph at a system clock frequency of 50MHz, the module achieves a recognition accuracy of over 97.5%, requiring 1/150th of the number of clock cycles of a comparable CPU. ReLU is more suitable as an activation function for hardware implementation, as it consumes 10% to 20% less resources than other non-linear activation functions.

参考文献总数:

 21    

插图总数:

 15    

插表总数:

 8    

馆藏号:

 本080714T/22002    

开放日期:

 2023-05-27    

无标题文档

   建议浏览器: 谷歌 360请用极速模式,双核浏览器请用极速模式