查看论文信息

查看全文

查看论文信息

中文题名：	卷积神经网络硬件加速模块的设计与实现
姓名：	周润泽
保密级别：	公开
论文语种：	中文
学科代码：	080714T
学科专业：	电子信息科学与技术
学生类型：	学士
学位：	理学学士
学位年度：	2022
学校：	北京师范大学
校区：	北京校区培养
学院：	人工智能学院
第一导师姓名：	王建明
第一导师单位：	北京师范大学人工智能学院
提交日期：	2022-05-27
答辩日期：	2022-05-10
外文题名：	Design and implementation of hardware acceleration module for convolutional neural networks
中文关键词：	卷积神经网络 ; FPGA ; 硬件加速 ; 流水线
外文关键词：	CNN ; FPGA ; hardware acceleration ; pipeline
中文摘要：	︿卷积神经网络（Convolutional Neural Network, CNN）凭借其在处理非线性问题上的优异性能得以飞速发展，然而CNN的高计算复杂度限制了CNN的应用场景。本文以Verilog HDL为工具，提出了一种采用参数化设计对CNN的层级进行抽象，提高代码可重用性的设计方法。本文设计和实现了用于手写数字识别的CNN硬件加速模块，采用Verilog参数化设计设计了可重用的CNN子模块，并在FPGA上进行了实际测试验证，仿真与测试的结果表明了设计工作的正确性。本文主要创新点包括：采用定点化，时分复用等方式降低网络对硬件资源的消耗；使用流水线和双缓存数据结构提高网络的数据吞吐率；设计可复用数据的行缓存卷积模块保证流水线高效运转。实验结果表明，加速器的计算性能达到了38.08GOPS（每秒十亿次运算， Giga Operations Per Second），在系统时钟频率为50MHz时，每张图的识别时间为0.032ms，识别准确率均在97.5％以上，所需时钟周期数为同级别CPU的1/150。此外测试结果还表明，在识别率未明显损失的情况下，采用ReLU激活函数较其他非线性激活函数资源消耗少10％至20％，硬件实现时ReLU更适合作为激活函数。﹀
外文摘要：	︿ Convolutional Neural Network (CNN) has developed rapidly due to its excellent performance in dealing with non-linear problems. However, the high computational complexity of CNN limits the application scenarios of CNN. In this paper, we proposed a parametric design method using Verilog HDL as a tool to abstract the layer structure of CNN and to improve code reusability. In this paper, we designed and implemented a CNN hardware acceleration module for handwritten digit recognition. Reusable CNN sub-modules are designed using Verilog parametric design and verified on FPGA. The main innovations of this paper include: using fixed-pointing and time-division multiplexing to reduce the consumption of hardware resources by the network; using pipeline and double-cache data structure to improve the data throughput rate of the network; and designing a reusable data line cache convolution module to ensure the efficient operation of the pipeline. The experimental results show that the accelerator achieves a computational performance of 38.08 GOPS. With a recognition time of 0.032ms per graph at a system clock frequency of 50MHz, the module achieves a recognition accuracy of over 97.5%, requiring 1/150th of the number of clock cycles of a comparable CPU. ReLU is more suitable as an activation function for hardware implementation, as it consumes 10% to 20% less resources than other non-linear activation functions. ﹀
参考文献总数：	21
插图总数：	15
插表总数：	8
馆藏号：	本080714T/22002
开放日期：	2023-05-27

附件下载