中文题名: | 现代汉语空间短语的自动识别 |
姓名: | |
学科代码: | 050102 |
学科专业: | |
学生类型: | 硕士 |
学位: | 文学硕士 |
学位年度: | 2011 |
校区: | |
学院: | |
研究方向: | 中文信息处理 |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2011-06-15 |
答辩日期: | 2011-06-01 |
外文题名: | Automatic Identification of Chinese Spatial Phrase |
中文摘要: |
HNC理论中有三大基本短语,即:时间短语、空间短语、数量短语。空间短语是其中非常重要的一种短语,研究空间短语的自动识别可以为特定信息的抽取、过滤提供一定的方便,也可以为其它类型的短语等类型的识别提供一定的借鉴意义。 识别方法上,本文主要采取模式匹配,词性识别以及HNC知识库判断三级模式来识别文本中的空间短语。首先根据空间短语的构成模式从文本中匹配待判断的空间短语,然后提取空间短语的中心词,第三步通过词性来判断中心词是否可以做空间短语的中心词,如果用词性不能明确判断中心词能否作为空间短语的中心词,最后将利用HNC符号的知识来做最后的判别。 本文的主要工作包括以下几个方面: 1.构建空间短语的语料库。本文从已标注的1998年《人民日报》和北大语料库及中文信息处理资源平台中的文本语料库中搜集空间短语,并对搜集到的空间短语进行标注,标注的内容包括前边界、中心词、后边界等信息。 2.分析空间短语的构成模式。总结出空间短语的构成模式是自动识别空间短语的基础,根据总结的语料,结合对空间短语各构成要素的分析,总结出空间短语构成的形式,然后以空间短语构成的形式为基础,总结出空间短语构成的模式。 3.构建空间短语识别所需的知识库。本文的知识库包括前边界库、后边界库、HNC词语知识库的词表库、HNC词语知识库的概念类别库,HNC词语知识库的HNC符号库等。 4.自动识别程序的设计与实现。本文的识别程序使用C#语言来实现,以面向对象的思想来设计相关的类和方法。主要的模块包括:分词、载入词表、载入待处理的文本、预处理载入的文本、查找激活点、返回前后边界的位置并提取待判断的字符串、提取空间短语的中心词、使用词性知识判断空间短语的中心词、使用HNC知识判断空间短语的中心词。
﹀
|
外文摘要: |
HNC theory has three basic phrases, ie:temporal phrases, spatial phrase, quantifier phrase. Spatial phrase is one very important phrase of these phrases, the research of automatic identification of spatial phrases can provide a convenient for specific information extraction and filtering, and it also can provide some reference to identify other types of basic phrases.This paper takes pattern matching, word recognition and knowledge base as three basic methods to identify phrases in the text. First, the composition model based on Spatial phrases from the text in the space match the phrase to be judged, and then extract the key words in the phrase, the third step use the part of speech of the key word to determine whether the word can be used as the center of the spatial phrases, if not determined for sure, the use of HNC's knowledge can finally make a final identification.The main work of this paper includes the following aspects:1. Building spatial phrase corpus. This article has been collected from the 1998 "People's Daily" ,the Beijing University corpus, and some texts which collected form Chinese information processing resources platform, and these texts was specially marked ,including the former border, the key word and post-border.2. Analysis of Spatial Patterns of phrase. Phrase summed up the patterns of spatial phrase is the basis for automatic identification, according the corpus, combined with the analysis of the elements of the spatial phrase, we can sum up the form of composition, and then on base of it, this paper summed up the patterns of spatial phrases.3. Construction the knowledge of automatic recognition. This knowledge base, including the former border library, post border library, and other HNC based library.4. Automated recognition program design and implementation. This paper uses C # language to design related classes and methods. The main modules include: segmentation, vocabulary load, load the text to be processed, pre-loaded text, find activation point, extracting key words of the phrase, using the phrase knowledge to determine the key words, using HNC knowledge to determine the phrase.
﹀
|
参考文献总数: | 18 |
馆藏号: | 硕050102/1129 |
开放日期: | 2011-06-15 |