中文题名: | 面向网络舆情分析的文本聚类研究 |
姓名: | |
保密级别: | 公开 |
学科代码: | 120502 |
学科专业: | |
学生类型: | 硕士 |
学位: | 管理学硕士 |
学位年度: | 2010 |
校区: | |
学院: | |
研究方向: | 信息分析与处理 |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2010-06-22 |
答辩日期: | 2010-05-30 |
外文题名: | A STUDY OF THE TEXT CLUSTERING ORIENTED TO THE ONLINE PUBLIC OPINION TEXT ANALYSIS |
中文摘要: |
随着科技的进步,互联网也迅速发展起来,中国网民数量以惊人的速度增加,并成为世界上网民人数最多的国家。互联网已经成为了继报刊、广播、电视之后的“第四媒体”,对人民生活产生了深远的影响。网络提供的方便促使越来越多的网民选择利用互联网来表达自己的看法,在网络上发表民意,逐渐形成了网络舆情。网络舆情的分析对于维护社会稳定、促进国家发展具有重要的意义。对于网络舆情的监测和控制,话题发现与热点发现是其中的重点,而文本的聚类效果则影响着最终网络舆情的检测结果,进而影响政府的最终决策。本文首先对网络舆情研究的基本情况进行阐述,对于网络舆情的特点和舆情信息的形成与发展进行分析和总结,并对现有的一些网络舆情监测系统进行简单的介绍。然后对中文文本常用的聚类技术进行了详细的分析、研究和总结,列举了中文文本分词、文本特征表示等方面的一些基础研究工作,并且回顾了中文文本相似度和文本特征降维领域的已有结果,详细介绍了文本相似度计算和特征降维的各种方法。在回顾已有的文本聚类方法的基础之上,本文针对基于向量空间模型的划分算法k-means存在的问题进行分析研究,在此方法基础上作出了相应的改进。改进后的算法不需要事先确定划分的类别数,而是根据文本之间的相似度自动聚合确定类别数目,并且不需要事先选择文本初始聚类中心。在作出了对于k-means方法的相应改进之后,为了验证算法的有效性,本文使用了100篇文章进行实验,分别利用传统的k-means方法和改进后的聚类方法对文本集进行聚类。实验结果证明:传统k-means方法即使事先确定了正确的类别数,但如果初始聚类中心选择不当,聚类效果也只能达到局部最优;而改进的聚类方法只要相似度阈值选择恰当,则可以有效地自动将文本的类别数确定下来,并且消除了传统k-means方法中对初始聚类中心的依赖。
﹀
|
外文摘要: |
With the development of technology, the Internet has developed rapidly, the number of Internet users in China increased at an alarming rate, and become the world's Internet users than any other country. Internet has become the "fourth media" following the press, radio, television, and has a profound impact on people's lives. The convenience Internet communication provides encourage more and more users to express their own views, opinions through the Internet. And it issues a public in the Internet, and forms a online public opinion. The analysis of online public opinion for maintaining social stability and promoting national development is of great significance, and the text of the clustering effect is affecting the final network test results of public opinion and governments to make the final decision.Firstly, this paper explains the basic situation of the network public opinion, analysis and summarizes the characteristics of the network public opinion, the formation and development of the network public opinion, and gives a brief introduction of some existing network public opinion monitoring system. Secondly, this paper analysis, research and concluded common Chinese text clustering technology; cites the word segmentation and text clustering such as the basic research; and review the similarity of the Chinese text fields and text clustering’s results, details of the text similarity calculation and the various methods of feature reduction.Based on reviewing the existing text clustering methods, this paper makes improvements to the problem of the vector space model and k-means. So in the using of k-means method does not require a predetermined number of classification categories, but determine the number of automatic aggregation by the type of text.After improve the corresponding algorithm, in order to verify the effectiveness of the algorithm, this article uses the 100 articles experiment, clusters the text set by the traditional k-means method and method of the improved text clustering. Experimental results show that, if initial cluster centers choose incorrectly, clustering effect can be optimal locally only even if the k-means method to determine in advance the correct number of categories. If the improved clustering just selects the appropriate similarity threshold, it can effectively and automatically determine the number of categories the text down, and eliminate the influence of initial cluster centers in traditional k-means method to clustering effect.
﹀
|
参考文献总数: | 50 |
馆藏号: | 硕120502/1009 |
开放日期: | 2010-06-22 |