中文题名: | 卷积神经网络硬件加速模块的设计与实现 |
姓名: | |
保密级别: | 公开 |
论文语种: | 中文 |
学科代码: | 080714T |
学科专业: | |
学生类型: | 学士 |
学位: | 理学学士 |
学位年度: | 2022 |
学校: | 北京师范大学 |
校区: | |
学院: | |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2022-05-27 |
答辩日期: | 2022-05-10 |
外文题名: | Design and implementation of hardware acceleration module for convolutional neural networks |
中文关键词: | |
外文关键词: | CNN ; FPGA ; hardware acceleration ; pipeline |
中文摘要: |
卷积神经网络(Convolutional Neural Network, CNN)凭借其在处理非线性问题上的优异性能得以飞速发展,然而CNN的高计算复杂度限制了CNN的应用场景。本文以Verilog HDL为工具,提出了一种采用参数化设计对CNN的层级进行抽象,提高代码可重用性的设计方法。本文设计和实现了用于手写数字识别的CNN硬件加速模块,采用Verilog参数化设计设计了可重用的CNN子模块,并在FPGA上进行了实际测试验证,仿真与测试的结果表明了设计工作的正确性。本文主要创新点包括:采用定点化,时分复用等方式降低网络对硬件资源的消耗;使用流水线和双缓存数据结构提高网络的数据吞吐率;设计可复用数据的行缓存卷积模块保证流水线高效运转。实验结果表明,加速器的计算性能达到了38.08GOPS(每秒十亿次运算, Giga Operations Per Second),在系统时钟频率为50MHz时,每张图的识别时间为0.032ms,识别准确率均在97.5%以上,所需时钟周期数为同级别CPU的1/150。此外测试结果还表明,在识别率未明显损失的情况下,采用ReLU激活函数较其他非线性激活函数资源消耗少10%至20%,硬件实现时ReLU更适合作为激活函数。 |
外文摘要: |
Convolutional Neural Network (CNN) has developed rapidly due to its excellent performance in dealing with non-linear problems. However, the high computational complexity of CNN limits the application scenarios of CNN. In this paper, we proposed a parametric design method using Verilog HDL as a tool to abstract the layer structure of CNN and to improve code reusability. In this paper, we designed and implemented a CNN hardware acceleration module for handwritten digit recognition. Reusable CNN sub-modules are designed using Verilog parametric design and verified on FPGA. The main innovations of this paper include: using fixed-pointing and time-division multiplexing to reduce the consumption of hardware resources by the network; using pipeline and double-cache data structure to improve the data throughput rate of the network; and designing a reusable data line cache convolution module to ensure the efficient operation of the pipeline. The experimental results show that the accelerator achieves a computational performance of 38.08 GOPS. With a recognition time of 0.032ms per graph at a system clock frequency of 50MHz, the module achieves a recognition accuracy of over 97.5%, requiring 1/150th of the number of clock cycles of a comparable CPU. ReLU is more suitable as an activation function for hardware implementation, as it consumes 10% to 20% less resources than other non-linear activation functions. |
参考文献总数: | 21 |
插图总数: | 15 |
插表总数: | 8 |
馆藏号: | 本080714T/22002 |
开放日期: | 2023-05-27 |