查看论文信息

查看全文

查看论文信息

中文题名：	CUDA并行计算架构介绍与性能优化
姓名：	周振兴
保密级别：	公开
学科代码：	081202
学科专业：	计算机软件与理论
学生类型：	硕士
学位：	工学硕士
学位年度：	2009
校区：	北京校区培养
学院：	数学科学学院
研究方向：	并行计算
第一导师姓名：	孙魁明
第一导师单位：	北京师范大学
提交日期：	2009-06-23
答辩日期：	2009-06-05
中文摘要：	︿在Intel主导的并行计算领域，高额的并行中心架构成本，让许多较小的科研团体始终无法享受到更快的计算。CUDA并行计算的出现，使得低成本、高效率的并行中心成为一种可能。CUDA是一种由NVIDIA推出的通用并行计算架构，CUDA使用的底层硬件是普通显卡上的GPU微处理器。现代通用GPU较之通用CPU在浮点计算能力上有着数百倍的性能差异，在CUDA的架构下，GPU能够高效的解决复杂的计算问题。本文首先从概念上描述了CUDA的基本架构[1]，对比了GPU处理能力和CPU的差异，并详细的描述了如何使用CUDA运行并行程序。其次，文章依次从存储、和线程调度的角度深入的介绍了CUDA的底层架构和运行机制[6]，从而给出了CUDA程序性能优化的两个基本的方法：存储效率优化和线程调度优化。正确的使用和CUDA的存储，合理的设计线程方案，对CUDA程序的性能有着重要的影响。文章在第五章，通过设计具体的程序实现，对比了CUDA和通用CPU的浮点计算性能。并给出了，如何通过优化存储和调整线程结构来最大限度的优化CUDA程序。﹀
外文摘要：	︿ As Intel led in the field of parallel computing, the high performance parallel computer centers cost a lot of money, so that many smaller research groups has never been able to enjoy faster computing. The emergence of parallel computing CUDA makes low-cost, high efficiency parallel computer centers available. CUDA is a general purpose parallel computing architecture that leverages the parallel compute engine in NVIDIA graphics processing units (GPUs) to solve many complex computational problems in a fraction of the time required on a CPU. To program to the CUDATM architecture, developers can, today, use C, one of the most widely used high-level programming languages, which can then be run at great performance on a CUDA enabled processor. This article first describes the concept of the basic structure of CUDA, the GPU processing power compared and the differences in CPU and a detailed description of how to use CUDA and run your parallel program. Secondly, the article in order from the storage, and thread scheduling perspective CUDA-depth introduction to the bottom of the structure and operating mechanism, which gives the performance-optimized CUDA two basic ways: to optimize storage efficiency and optimize thread scheduling. CUDA is divided into the storage registers, shared memory, global memory and constant memory. Registers, shared memory have a faster read and storage speed, and global memory and constant memory have relatively large delay, especially global storage. Reasonable and effective use of storage has great impact to the CUDA performance. Also the thread scheduling of the efficient have an important impact. In Chapter 5, I design three groups programs about GPU and CPU’s performance and how thread scheduling impact the performance. And give some advise about how to optimize the storage and adjustment of the thread structure to maximize the optimization process CUDA. ﹀
参考文献总数：	10
馆藏号：	硕081202/0921
开放日期：	2009-06-23

附件下载