- 无标题文档
查看论文信息

中文题名:

 面向信息处理的现代汉语名动词研究及标注    

姓名:

 周姣美    

保密级别:

 公开    

论文语种:

 中文    

学科代码:

 050102    

学科专业:

 语言学及应用语言学    

学生类型:

 硕士    

学位:

 文学硕士    

学位类型:

 学术学位    

学位年度:

 2022    

校区:

 北京校区培养    

学院:

 国际中文教育学院    

研究方向:

 计算语言学    

第一导师姓名:

 刘智颖    

第一导师单位:

 北京师范大学汉语文化学院    

提交日期:

 2022-06-14    

答辩日期:

 2022-06-14    

外文题名:

 Research and annotation of modern Chinese nominal verbs for information processing    

中文关键词:

 名动词 ; 词语多能指数 ; 词性标注 ; 中文信息处理    

外文关键词:

 Nominal verb ; lexical polyfunctionality index ; Part-of-speech tagging ; Chinese information processing    

中文摘要:

现代汉语中有些动词具有名词的语法性质,在词典中仅有动词标记,却经常表现出名词的语法功能,这类词被称为名动词。本文以现代汉语名动词作为研究对象,基于名动词的语言学本体研究,构建名动词标注语料库,引入香农-维纳指数计算名动词多能指数,选用基于统计模型的词性标注算法和基于规则的名动词标注方法开展标注实验,实现名动词的自动标注并提供标注方案。本文的主要工作如下:

第一,标注名动词语料库,构建现代汉语名动词表。从国家语委现代汉语通用平衡语料库中选取语料,包含散文、新闻、公文等多种语体共600篇文本,总规模达100万字,按照《信息处理用现代汉语词类标记规范》(修订稿)中的现代汉语词类标记设置,依据名动词的功能标准标注语料,最终得到一个含名动词标记的标注语料库,在此标注语料库的基础上建构一个包含178个词的名动词表。

第二,引入香农-维纳指数作为量化指标,衡量汉语名动词语法功能的灵活程度,将其定义为词语多能指数(H),并将多能指数的研究成果运用于名动词的自动标注。根据多能指数计算结果,名动词的平均多能指数为0.585,多能性较强。动形兼类的名动词属于名动词中功能更灵活的一类,平均多能指数为0.825,在语境中兼有动词、名词、形容词的不同功能。多能指数可为名动词词性标注以及词典中名动词词性标记的修订提供参考。

第三,实现名动词的自动标注,选择基于统计模型的词性标注算法和基于规则的词性标注方法分别开展名动词标注实验。基于统计模型的词性标注算法选择运用范围广且效果较好的隐马尔科夫模型和维特比算法(HMM+Viterbi),基于规则的方法是融合依存分析和多能指数的名动词自动标注方法(DP+H)。在同一个数据集上,HMM+Viterbi模型标注名动词词性标记的效果较不理想,开放性测试准确率为50%,封闭性测试准确率为42.86%;运用基于依存关系的名动词识别规则(DP规则)可达到92.86%的名动词识别准确率。在更大的数据集上,DP规则总标注准确率为81%;加入H值规则后可将总准确率从81%提升到85%,总体上实现了比较好的标注效果。

本研究建设了许多语言资源,包括现代汉语名动词词表、名动词精标注语料库、名动词多能指数、基于依存关系的名动词识别规则库以及名动词标注方案,填补了名动词标注领域的空白。上述语言资源可以共享,为现代汉语词类研究提供新的数据、思路和方法,对中文信息处理领域而言,本文不仅为名动词的标注提供解决方案,也为其他词类标注和排岐提供了思路和解决办法。

外文摘要:

Some verbs in modern Chinese have the syntax properties of nouns, and they only have verb markers in the dictionary, but often show the syntax functions of nouns, and these words are called Chinese nominal verbs. In this paper, we take modern Chinese nominal verbs as the research object, annotate the nominal verb corpus based on the linguistic research of Chinese nominal verbs, then introduce the Shannon-Wiener index to calculate the nominal verb polyfunctionality index, and choose the statistical model-based part-of-speech tagging algorithm and the rule-based nominal verb part-of-speech tagging method, finally conduct annotation experiments to realize the automatic annotation of nominal verbs and provide the annotation scheme. The main work of this research is as follows.

Firstly, we annotated the corpus and constructed a modern Chinese nominal verb word list. The corpus was selected from the Modern Chinese General Balance Corpus of the State Language Commission, which contains 600 texts in prose, news, and official documents, with a total size of 1 million characters, and annotated it according to the part-of-speech tag set of "Modern Chinese Word Marker Specifications for Information Processing" (revised version). Finally, an annotated corpus containing part-of-speech tag of nominal verb is obtained, and a nominal verb word list containing 178 words is constructed based on the annotated corpus.

Secondly, the Shannon-Wiener index was introduced as a quantitative index to measure the flexibility of the syntax functions of Chinese nominal verbs, which was defined as the lexical polyfunctionality index (H), and the value of polyfunctionality index were applied to the automatic annotation of Chinese nominal verbs. According to the results of the polyfunctionality index, the average polyfunctionality index of nominal verbs is 0.585, which is a strong pluripotency. Nominal verbs with the part-of-speech tag of adjective belong to a more flexible class of nominal verbs with an average polyfunctionality index of 0.825, which have different syntax functions of verbs, nouns and adjectives in the context. The lexical polyfunctionality index can provide reference for the part-of-speech tagging of nominal verbs and the revision of nominal verb lexical tags in dictionaries.

Thirdly, to realize the automatic annotation of nominal verbs, the statistical model-based part-of-speech tagging algorithm and the rule-based part-of-speech tagging algorithm is chosen to carry out the experiments of nominal verb annotation. The statistical model-based part-of-speech tagging algorithm is the widely used and effective Hidden Markov Model and Viterbi algorithm (HMM+Viterbi), and the rule-based method is the automatic nominal verb annotation method that integrates dependency analysis rules and polyfunctionality index (DP+H). On the same dataset, the HMM+Viterbi model annotates nominal verb lexical markers less well, with an accuracy of 50% for the open test and 42.86% for the closed test, while the DP+H method has a high accuracy, and the accuracy of 92.86% can be achieved by applying only dependency-based nominal verb recognition rules (DP rules). On a larger dataset, the total accuracy of DP rules method is 81%; adding the H-value rules can increase the total accuracy from 81% to 85%,  achieving a better annotation effect overall.

This study constructs many linguistic resources, including a modern Chinese nominal verb word list, a annotation corpus, a nominal verb polyfunctionality index, a dependency-based nominal verb recognition rule base, and a nominal verb annotation scheme, which fills a gap in the nominal verb annotation field. The above language resources can be shared to provide new data, ideas and methods for modern Chinese word class research. For the field of Chinese information processing, this paper not only provides solutions for nominal verb annotation, but also provides ideas and solutions for other word class annotation and disambiguation.

参考文献总数:

 49    

馆藏号:

 硕050102/22014    

开放日期:

 2023-06-14    

无标题文档

   建议浏览器: 谷歌 360请用极速模式,双核浏览器请用极速模式