中文题名: | 基于小学数学应用题的中文分词修正和词性标注修正的研究 |
姓名: | |
学科代码: | 081202 |
学科专业: | |
学生类型: | 硕士 |
学位: | 工学硕士 |
学位年度: | 2011 |
校区: | |
学院: | |
研究方向: | 计算机的教育应用 |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2011-05-31 |
答辩日期: | 2011-05-21 |
外文题名: | The Correcting Method of Chinese segmentation and POS tagging for Arithmetic Word Problem of Primary School |
中文摘要: |
中文分词及词性标注不仅是自然语言理解领域的难点问题,也是中文数学智能教学系统中的基础性问题。提高自动分词及词性标注的准确率,修正分词及词性标注的结果,能为实现小学数学应用题的自动理解和解答提供有效输入。本文的主要内容就是在ICTCLAS分词结果上,尝试多种分词修正方法和词性标注修正方法,构建中文分词自动修正系统和兼类词词性修正系统,以提供更适合小学数学应用题文本的分词结果和更合理的词性标注结果。 中文分词修正分为错误发现和错误修正两个阶段。在错误发现阶段本文提出了差异检测算法,自动获取分词差异字段集。在错误修正阶段,首先用规则法对特征明显的分词错误进行修正:通过人工分析制定了适应于小学数学应用题文本的分词修正规则,通过开发的小软件准确选择合适的方法以实现规则,并快速获取用户词典和错误实例词典。然后用统计法对特征规律不明显的分词错误及新词进行修正:制订规则扩展了阈值的可选下限,设定了适合小学数学应用题的互信息阈值0.000190。中文分词自动修正系统采用了规则+统计的分词修正方法,在小学数学应用题文本中获得了不错效果。 本文通过分析将词性标注修正的研究聚焦到实词兼类词词性的修正,也分为两个阶段。在兼类词提取阶段,本文提出了兼类词自动发现算法以获取兼类词集。在词性标注修正阶段,人工分析制定了适应于小学数学应用题文本的词性修正规则;在利用Weka工具包修正词性时,通过准确度比较选择了rules.DecisionTable算法作为分类算法;在利用CRF工具包修正词性时,修正了特征模板提升了修正准确率。兼类词词性修正系统采用了规则+CRF工具包修正词性,对ICTCLAS词性标注结果做了明显改善。
﹀
|
外文摘要: |
Chinese word segmentation and POS tagging is not only a difficult problems in the natural language understanding field, but also a basic problem in the Chinese mathematics intelligent tutoring system field. Improving the accuracy of word segmentation and POS tagging, and automatically correcting the errors of word segmentation and POS tagging, is the strong support for the realization of the Chinese mathematics intelligent tutoring system. In this thesis, we focus on the method for correcting the errors of word segmentation and POS tagging to provide the more suitable segmentation for primary school mathematics word problems. The whole process of Chinese word segmentation correction is divided into two parts. One is finding the differences; the other is fixing the differences. In the finding-stage, an algorithm to detect the differences was proposed to automatically obtain the set of differences. In the correcting-stage, through manual analysis the correcting rules which adapt to the elementary school mathematics word problems were developed. And through experiments the reasonable threshold of mutual information was set to 0.000190. In the Automatically Correcting System, the method that rules and mutual information and instances, was used, and got a nice effect.The words that have two or more POS classes were focused in the process of POS correction. The process is also divided into two stages. In the words extraction stage, an algorithm to detect the words was proposed to automatically obtain the set of words that have two or more POS classes. In the correcting-stage, through manual analysis the correcting rules which adapt to the elementary school mathematics word problems were developed. And in the correction method based Weka, by the accuracy comparison the “rules.DecisionTable” algorithm was chosen as the classification algorithm. And in the correction method based CRF, the characteristics template was modified and the accuracy was raised. In the Automatically Correcting System, the method that rules and the method based CRF, was used, and made significant improvement on the results of POS tagging of ICTCLAS.
﹀
|
参考文献总数: | 80 |
馆藏号: | 硕081202/1117 |
开放日期: | 2011-05-31 |