查看论文信息

查看全文

查看论文信息

中文题名：	基于Hadoop的在线教育平台研究
姓名：	王晓磊
学科代码：	085212
学科专业：	软件工程
学生类型：	硕士
学位：	工程硕士
学位年度：	2015
校区：	北京校区培养
学院：	信息科学与技术学院
研究方向：	分布式系统应用
第一导师姓名：	张立保
第一导师单位：	北京师范大学信息
提交日期：	2015-07-03
答辩日期：	2015-05-23
外文题名：	Distributed system architecture research and Optimization Based on Hadoop model
中文摘要：	︿二十一世纪是数据的时代，每分每秒，互联网都在生产大量的电子数据，特别是进入Web2.0时代之后，互联网的数据更是多种多样，任何在网上的行为都有可能转换成数据被记录。面对如此海量的数据，如何搭建一个安全的，可靠的，快速的分布式数据平台来处理这些数据，将是一个严峻的考验。本文首先分析了GFS系统的设计和组成原理，Hadoop基于此原理完成了开源实现，其最基本的组件为：HDFS、Map/Reduce（分布式编程模型）、HBase（分布式面向列的数据库）、Zookeeper（针对大型分布式系统的协调系统）。本文研究了以上组件的设计原理，对组件逐个安装配置并协调启动，在成功启动后，结合不同组件的不同算法，模拟千兆数据进行测试。通过研究测试，利用Secondary Name Node(SNN)的特点，添加了一台单独的物理机作为SNN服务器，用于定时合并Namenode上的元数据文件，不仅解决了HDFS中Namenode的效率瓶颈,同时还搞定了单点故障问题。Zookeeper是负责整个系统的协调服务，服务期间，与几乎全部的节点都要进行交互数据，为了提高Zookeeper的运算效率，在Hadoop体系中，专门添加单独的Zookeeper服务器解决CPU的效率问题，在大规模环境中，需配置Zookeeper集群解决。在整个安装、配置、测试、优化过程中，以TB级数据为基础，进行多种不同方式的查询测试，将不同方式的查询结果进行对比，分析其优劣性。整个过程结束后，本文总结了Hadoop系统设计中存在的隐患并提出了进一步改进的建议，当中的优化方案对Hadoop的发展有一定的参考价值和探索意义。﹀
外文摘要：	︿ The21st century is the era of data , every time , the Internet in the production of a lot of data, especially after entering the Web2.0 era, the Internet data is more diverse, any online behavior are likely to convert data is recorded. In the face of such a mass of data, how to build a safe, reliable, distributed dataplatform quickly to deal with these data, will be a severe test.This paper first analyzes the design of GFS system and the principle ofHadoop,based on the theory of complete open source implementation, its most basiccomponents:HDFS (distributed file system), Map/Reduce (distributed programming model), HBase (distributed column oriented open source database), Zookeeper (for reliable coordination system in large distributed systems). This paper studies the design principle of the above components,configuration and coordination of components one by one at the start, after the successful start, combined with the different components of different algorithms,simulated Gigabit data test. Through the research on the test, using SecondaryName Node (SNN) features, adding a single physical machine as the SNN server,metadata for a file time with Namenode, the ultimate solution to the single point of failure and the efficiency bottleneck problem in HDFS Namenode. Zookeeper is the coordination of services, the whole system is responsible for the service period, and almost all of the nodes to exchange data, in order to improve the operational efficiency of Zookeeper, in the Hadoop system, specifically to add a separate Zookeeper server to solve the efficiency problem of CPU, in the large-scale environment, need to configure a Zookeeper cluster solution.During the whole installation, configuration, testing, the optimization process, theTB data as the foundation, different ways to query test, the different results were compared, analysis of the pros and cons. After the whole process, this papersummarizes the existing problems in the design of Hadoop system and puts forward improvement suggestions, have certain reference value and significance to explore the development of optimization scheme for Hadoop. ﹀
参考文献总数：	37
作者简介：	分布式技术在在线教育平台上的应用研究
馆藏号：	硕430113/1516
开放日期：	2015-07-03

附件下载