中文题名: | 基于维基百科的语义相关度学习算法研究 |
姓名: | |
学科代码: | 081203 |
学科专业: | |
学生类型: | 硕士 |
学位: | 工学硕士 |
学位年度: | 2014 |
校区: | |
学院: | |
研究方向: | 数据库与数据挖掘 |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2014-05-30 |
答辩日期: | 2014-05-29 |
外文题名: | RESEARCH ON A LEARNING ALGORITHM OF SEMANTIC RELATEDNESS USING WIKIPEDIA |
中文摘要: |
作为衡量概念或实体之间相关程度的方法,语义相关度计算已经受到越来越多研究者的关注,它在信息检索、文本挖掘以及聚类分类等应用中起到了十分重要的作用。近些年,随着维基百科体系的日益完善,它为语义相关度计算提供了宝贵的知识资源,一些基于维基百科的语义相关度计算方法已经逐渐被人们所熟知,但是现有的大部分方法仅仅利用了维基百科的某些特定信息。这些方法在某些情况下针对相同概念或实体的计算会得到不同的结果。为了更好的推动语义相关度计算的发展,使得语义相关度计算结果最接近人工标注的结果,研究如何选择适当的语义特征和度量方法具有十分重要的研究意义。 本文主要是以维基百科作为研究背景,首先,本文对语义相关度计算的基本理论进行总结,简单介绍了研究相关的基本概念和基础知识,分析并归纳传统计算方法存在的问题。其次,本文提取了词条的链接结构和分类体系作为特征信息,结合三种不同的度量方法交叉计算得到词条不同特征的原始语义相关度,通过分析发现词条的不同特征对语义相关度的计算结果有着不同程度的影响。然后,本文提出有监督的学习算法来训练不同特征信息之间的权重,用于改进计算过程,最大程度的消除不同特征带来的影响。然后结合通用人工评测数据集对加权后的计算结果进行相关性的测评。通过与本文介绍的三种方法进行对比,本文提出的基于学习的算法结合对词条不同特征的研究,有效的改善了语义相关度计算结果,证明了本文算法的可行性。最后本文抽取中英文维基百科的跨语言链接,构建跨语言词条映射关系,然后将学习算法应用到跨语言领域,并取得了良好的实验结果。 总之,本文提出的基于学习的维基百科语义相关度计算方法不仅综合考虑了维基百科复杂结构的语义特性,而且计算精度也高于现有的一些方法,可以用来计算同一语言的概念或实体的语义相关度。为了解决跨语言领域的计算问题,本文将该方法扩展到多语言环境下,实现了计算不同语言词条之间的语义相关度,丰富了语义相关度计算在跨语言领域的研究成果。
﹀
|
外文摘要: |
Semantic relatedness computing as a method to measure the degree of correlation between concepts or entities have been paid more and more attention from researchers, which has played a very important role in Information Retrieval, Text Mining and Clustering or Classification applications. In recent years, with the continuous development of Wikipedia, it has become an significant resources of knowledge related to semantic computing. A series of the semantic relatedness computation methods based on Wikipedia has gradually been well known. Each of these approaches uses a particular kind of information in Wikipedia, such as links, categories, and texts. They also empirically designed different relatedness metrics to assess the relatedness between concepts or entities. In some cases, however, it is observed that these approaches produce very different results for the same objects pair. In order to promote the development of semantic relatedness computation and make sure that the results of semantic relatedness computing are the most close to the ones human judged. Therefore, it is a significant meaning to research how to select the appropriate semantic features and measurement methods. This paper considers Wikipedia as a research background. Firstly, we summarize the basic theory of semantic relatedness computing, briefly introduce the concepts and knowledge related research and analyze and conclude the advantages and disadvantages of the conventional method of semantic relatedness computing based on Wikipedia. Secondly, we extract the entry links structure and taxonomy as feature information and combine with three different metrics to obtain the original semantic relatedness of different characteristics. We analyze and find that the different features between entries have different impaction on the results of semantic relatedness computing. Thirdly, we propose a supervised learning algorithm to train the weights between different features, which is used to improve the computing process and minimizes the effects caused by the different features. And then, combined with common human judgment data sets, we evaluate the weighted results correlation. Compared with three methods described in this paper, the proposed method, a learning-based algorithm together with different features, effectively improves the performance of semantic relatedness computing, which also proves the feasibility of the proposed algorithm. Finally, we extract the cross-lingual links from Chinese to English in Wikipedia and build cross-language mapping between entries. And then, the learning-based approach will be applied to the domain of cross-language and has achieved better performances. In short, this paper proposed semantic relatedness computing based on the learning algorithm is not only took into account the semantic properties of the complex structure of Wikipedia, but obtains also higher accuracy than some existing methods. This method can be used to calculate the concepts or entities semantic relatedness in same language. In order to solve computational problems in the field of cross-language, we extend the proposed method to a multi-language environment, achieve to compute the degree of semantic correlation between entries in different languages and rich the research results of semantic correlation computing in the cross-lingual domain.
﹀
|
参考文献总数: | 49 |
作者简介: | 1.论文《Discovering Missing Semantic Relations between Entities in Wikipedia》在ISWC2013国际会议上发表。2.论文《Learning to Compute Semantic Relatedness Using Knowledge from Wikipedia》被APWeb2014会议录用。 |
馆藏号: | 硕081203/1419 |
开放日期: | 2014-05-30 |