查看论文信息

查看全文

查看论文信息

中文题名：	基于潜在语义索引模型的文本聚类在《说文》词义研究中的应用
姓名：	王春辉
保密级别：	内部
学科代码：	081203
学科专业：	计算机应用技术
学生类型：	硕士
学位：	工学硕士
学位年度：	2007
校区：	北京校区培养
学院：	信息科学学院
研究方向：	中文信息处理
第一导师姓名：	宋继华
第一导师单位：	北京师范大学
提交日期：	2007-06-12
答辩日期：	2007-06-10
外文题名：	THE APPLICATION OF TEXT CLUSTERING BASED ON LATENT SEMANTIC INDEX MODEL FOR THE RESEARCH OF SHUOWENJIEZI’S LEXICAL MEANING
中文关键词：	文本聚类 ; 潜在语义索引 ; 说文解字 ; 向量空间模型 ; 权重 ; 相似度
中文摘要：	︿本课题是信息科学与传统汉语言文字学的交叉课题，研究文本挖掘在《说文》词义系统中的应用，希望利用现代计算机技术对《说文》中蕴含的汉字词汇语义规律进行挖掘与阐释，以建立形式化的《说文》词义系统。一方面，希望为传统《说文》学研究引进一些现代化的研究方法，以更好地继承和发展本学科；另一方面，也为现代计算机技术开辟了新的应用领域，并在实践中对技术进行改进和创新，促进信息技术的发展。文本聚类的效果很大程度上取决于将文本转换为数学模型的质量。本文从《说文》著述体例出发，通过对《说文》词义训释中主训词和义值差的提取，构建词义知识库，在完成分词工作的同时，得到了各个训释词在训释体例中的位置信息，该位置信息是提高数学模型质量的关键因素。向量空间模型是文本表示模型中应用较多且效果较好的一种方法，其优点是对大规模的文本集合有较快的处理速度，缺点是该模型只是简单的特征词匹配。本文在研究向量空间模型的基础上，引入潜在语义索引模型，该模型不但克服了向量空间模型机械匹配的缺点，还能达到很好的降维效果，是近年来在信息检索等中文信息处理领域中常用的降维方法之一。目前在文献中存在大量的聚类算法。算法的选择取决于数据的类型、聚类的目的和应用。如果聚类结果被用作描述或研究的工具，可以对同样的数据尝试多种算法，以发现数据可能揭示的某些潜在的规律。本文旨在利用现代文本聚类技术，挖掘《说文》中蕴含的词义系统，达到辅助古汉语教学和研究的目的。因此将采用多种聚类算法，给出实验数据及分析。本文的系统设计和实验是在开源数据挖掘工具WEKA的基础上实现的。完成的工作包括两方面：一是按照WEKA数据接口的标准，设计和实现了针对《说文》词义训释的数学建模工具，可生成向量空间模型和潜在语义索引模型，输出的ARFF文件在WEKA中可以识别。二是改进了WEKA聚类模块中的ClusterEvaluation类，提高了聚类结果可视化的粒度，输出每个簇中具体包含哪些对象，便于对聚类结果的进一步分析；同时增加了一种公共的聚类结果评估方法，整体相似度法。﹀
外文摘要：	︿ As s cross-subject combining information science and traditional Chinese linguistics and ideography, it researches the application of text clustering in ShuoWenJieZi (SWJZ)’s lexical meaning system, hoping to mine some Chinese character lexical rules and establish formal SWJZ’s lexical system with modern computer technology. There are two functions: on one hand, it is to provide traditional Chinese linguistics and ideography with digitalized research manners that made SWJZ inherited and developed better. On the other hand, it opens up a new area of application for modern computer technology and promotes the development of information technology in practice.Text clustering’s results to a great extent depend on the quality of mathematical model which is converted from text. In this paper, a lexical knowledge database is constructed by extracting ZHUXUNCI words, YIZHICHA words and ZENGBUXUNSHI words in terms of SWJZ’s style. At the same time we can know every word’s position. The position information plays a great role to improve the quality of mathematical model.Vector Space Model (VSM) is one of the best text expressing modes. As an advantage it can deal a large scale of texts by rapid speed. As a drawback it does by simple feature matching. Based on the study of the VSM, Latent Semantic Indexing (LSI) is introduced. LSI model not only overcomes mechanical matching defect, but also reaches a very good reduce-dimension effect. In recent years it has become a very popular reduce-dimension method in the field of Chinese information processing.By now there is a lot of clustering algorithm in the literature. Algorithm selection depends on the type of data, the purpose and the applications. If the clustering results are used to be described or researched, a variety of algorithms can be tried. It may reveal some of the potential rules in data. This paper seeks to use modern cluster methods mining lexical system contained in SWJZ, so several different algorithms are be used. In the end experimental data and analysis are presented.In this paper, the system and experiment are finished based on an open-source data mining tool named WEKA. The completed work includes two aspects: first, in accordance with the WEKA data interface standards, a tool is designed and developed which converts SWJZ into a certain mathematical model such as VSM and LSI. The output document ARFF can be recognized and processed in WEKA. Second, the class of ClusterEvaluation in WEKA clustering module is improved: output every object in each cluster to facilitate further analysis; moreover, add a public evaluation clustering method called overall similarity. ﹀
参考文献总数：	49
作者简介：	王春辉（1981-），女，北京师范大学信息科学与技术学院。参与项目： 2005年：国家社会科学基金项目“古今汉语平行语料库的构建” （项目批准号为：05BYY022）； 2005年：教育部民俗、文字、典籍研究中心重大课题“近代碑刻与手写文献数字化的开放式支撑环境研究”； 2006年：教育部民俗、文字、典籍研究中心重点课题“数字化‘说文学’教学研究系统设计与开发”。
馆藏号：	硕081203/0715
开放日期：	2007-06-12

附件下载