中文题名: | 基于Hadoop平台的K-Means聚类算法的研究 |
姓名: | |
学科代码: | 085208 |
学科专业: | |
学生类型: | 硕士 |
学位: | 工程硕士 |
学位年度: | 2015 |
校区: | |
学院: | |
研究方向: | 大数据挖掘与云计算 |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2015-06-14 |
答辩日期: | 2015-05-24 |
外文题名: | RESEARCH OF K-MEANS CLUSTERING ALGORITHM BASED ON THE HADOOP PLATFORM |
中文摘要: |
随着计算机技术的飞速发展,产生的数据量也与日俱增,通过互联网可以获取越来越多的数据信息,而且这些数据中隐含着丰富的、有用的信息知识,我们可以充分利用这些数据信息,来分析目前的形式和未来的发展趋势。但是,需要有足够强大的平台来处理这些海量数据,在这种形势下,云计算技术应运而生,我们可以将这些海量数据分布在“云”中,也就是廉价的服务器集群中。Hadoop平台是云计算技术的平台之一,是目前最受欢迎的平台,将单机运行的数据挖掘算法移植到Hadoop平台上,用于处理海量数据,是目前的研究热点。聚类算法是数据挖掘最为常用的算法,聚类算法就是按照某种标准,将相似的数据归到一个簇中,达到聚类的目的。在本文中,首先介绍Hadoop平台的核心框架HDFS和MapReduce,HDFS的主要任务是存储海量数据,MapReduce的主要任务是处理这些海量数据;其次,介绍聚类算法中最经典的算法——K-Means算法,分析K-Means算法的缺点不足,提出优化改进的方法:本文中将Canopy算法作为K-Means算法的前置算法;之后,再对Canopy算法进行改进,将“最小最大原则”的思想用到Canopy算法中;最后,将改进后的算法MapReduce化,并进行实验验证,结果表明,MapReduce化后的算法提高了算法的效率,改进后的算法提高了算法的准确率,并且解决了K-Means 算法对噪声点和孤立点敏感的问题。
﹀
|
外文摘要: |
With the rapid development of computer technology, the data is increasing more quickly. We can get amounts of data through the Internet. And much useful information and knowledge can be found from these data. We must take advantage of these data to analyze the current situation and the future trends. But, we need a strong enough platform to handle these amounts of data. Under this form, the cloud computing technology emerges. we can distribute these massive data on the cloud platform, which is the low-cost servers in the cluster. Hadoop platform is one of the cloud computing platforms. It’s also the most popular platform. We can combine the Hadoop and the data mining algorithms to process the huge amounts of data. The method attracts many researchers and it focus on much attention. Clustering algorithms is the most useful among the data mining algorithms. Clustering algorithm classifies the data on the basis of some rules like the distance.In this paper, firstly, we introduce HDFS and MapReduce which are the most important of the Hadoop platform. HDFS can store large amounts of data and MapReduce can deal with the massive data. Secondly, we introduce the K-Means clustering algorithms which is the most important data mining algorithms. We propose a method to improved the efficiency of K-Means algorithms. Before the K-Means algorithms, we propose the Canopy algorithms. Then we improved the Canopy algorithms with the method of “minimum maximum principle”. Finally, we combine the K-Means algorithms and Hadoop. Experimental results show that the improved algorithm can improve the efficiency and the accuracy of the algorithm and can solve the noises and outliers sensitive issues of the algorithm.
﹀
|
参考文献总数: | 35 |
作者简介: | 分析了K-Means算法的缺点不足,提出优化改进的方法:将Canopy算法作为K-Means算法的前置算法;之后,再对Canopy算法进行改进,将“最小最大原则”的思想用到Canopy算法中;最后,将改进后的算法MapReduce化,并进行实验验证,结果表明,MapReduce化后的算法提高了算法的效率,改进后的算法提高了算法的准确率,并且解决了K-Means 算法对噪声点和孤立点敏感的问题。 |
馆藏地: | 总馆B301 |
馆藏号: | 硕430109/1525 |
开放日期: | 2015-06-14 |