- 无标题文档
查看论文信息

中文题名:

 中文繁简转换研究与系统实现    

姓名:

 冯霞    

保密级别:

 内部    

学科代码:

 050102    

学科专业:

 语言学及应用语言学    

学生类型:

 硕士    

学位:

 文学硕士    

学位年度:

 2008    

校区:

 北京校区培养    

学院:

 汉语文化学院    

研究方向:

 中文信息处理    

第一导师姓名:

 朱小健    

第一导师单位:

 北京师范大学    

第二导师姓名:

 苗传江    

提交日期:

 2008-06-18    

答辩日期:

 2008-06-07    

外文题名:

 RESEARCH ON THE CONVERSION BETWEEN TRADITIONAL AND SIMPLIFIED CHINESE CHARACTERS AND IMPLEMENTATION OF A CONVERSION SYSTEM    

中文关键词:

 中文繁简转换 ; 不对称繁简字 ; 模糊消解 ; 中文信息处理    

中文摘要:
现在使用的汉字有简体和繁体两种形式,中国大陆普遍使用简化字,而中国香港、澳门、台湾以及其他国家和地区的大部分华人聚居区普遍使用繁体字。繁简汉字的差异给两岸四地的沟通与交流造成了巨大的障碍,运用中文信息处理技术实现繁简中文的自动转换是消除这一障碍的迫切需要。繁体字与简化字并非一一对应,存在着转换模糊,这是中文繁简转换的关键难点。此外,中文繁简转换还面临编码、词汇及语法等方面的转换问题。本文根据对各地用字标准的比较分析,整理出具有非一一对应关系的繁简字共126组,针对这些繁简字建立了一定规模的以句为单位的繁简对照语料库。以HNC(Hierarchical Network of Concepts,概念层次网络)理论为指导,通过分析繁体中文和简体中文的区别,研究了中文繁简转换的基本问题和处理策略,并开发了一个转换正确率高、具有自学习功能的智能型中文繁简转换系统。本文的研究内容主要有以下六个方面:(1) 在大量真实文本语料的基础上,以HNC的概念类别和概念组合结构为纲,对具有非一一对应关系的126组繁简字进行了基于规则的逐字分析处理;(2) 分析了中文繁简转换中的词处理的范围、词语差异模式以及处理策略;(3) 讨论了服务于非一一对应关系繁简字转换的字知识库、规则库的建设方法和内容;(4) 整理建立了繁简对照专名库,以解决繁体中文和简体中文在译名方面的转换问题;(5) 对中文繁简转换系统的应用需求和可构建的产品模式进行了分析;(6) 采用C#语言,根据面向对象的编程思想实现了一个繁简中文转换系统。该系统考虑了用户的一些常见的个性化需求。以上研究内容有利于实现中文信息资源的共享,对促进汉字文化圈内语言文化和经济信息的交流与发展都有重要意义。中文繁简转换中还有一些问题本文尚未研究和解决,如一些繁简汉字的历时转换问题、非常用词语的繁简转换问题等。
外文摘要:
At present there are two forms of Chinese characters: simplified characters being used in Mainland China and traditional characters being used in Hong Kong, Macao, Taiwan, and most of the Chinatowns in other countries and areas at large. As a result, the difference between traditional and simplified characters has brought a remarkable barrier to communication between the Mainland and those areas. To remove the barrier, implementation of a conversion system between the two forms by using technology of Chinese Information Processing is required urgently.It is the primary difficulty of conversion between traditional and simplified characters that the glyphs of them could not match exactly, but fuzzily. Moreover, the conversion is confronted with other problems, such as coding, the transformation of vocabulary, even of syntax.The present paper provides 126 groups of simplified and complex characters being not matching one-to-one by comparing and analyzing criterions of characters in various regions. Aiming at those characters, a corpus of comparison has also been established, taking sentence as a unit. Based on Hierarchical Network of Concepts (HNC), an intellectualized conversion system with a high level of conversion accuracy and self-learning function has been excogitated through analysis of difference between traditional and simplified characters and research on core issues and the solutions of conversion between them.The studies in this paper are mainly including the following aspects:1. Processing of each character within 126 groups of Traditional Characters and Simplified Characters being not matching one-to-one, based on regimentation and framework of HNC, and a large number of authentic language texts;2. Analyzing the word processing scope in the conversion, the different patterns of words and corresponding strategies;3. Discussing the constructing method and contents of knowledge and rule libraries for the conversion of those 126 groups of characters;4. Establishing the proper noun library of the comparison of traditional and simplified characters which could solve the problems of conversion between them;5. Analysis of application requirements of the conversion system and the possibilities of constructing modes of product;6. Implementing a conversion system which has considered some familiar individual demands of users via C# language and the object oriented programme thoughts.Above studies help to realize the sharing of Chinese information resources and has notable significance in development and exchange of culture and economy among Mainland China, Hong Kong, Macao and Taiwan.Some other problems of conversion between traditional and simplified characters have not been solved in this article, such as diachronic conversion and conversion of words being not commonly used.
参考文献总数:

 93    

作者简介:

 2006年-2008年在北京师范大学中文信息处理研究所攻读硕士学位,提前一年毕业。在就读期间公开发表学术论文共4篇。    

馆藏号:

 硕050102/0815    

开放日期:

 2008-06-18    

无标题文档

   建议浏览器: 谷歌 360请用极速模式,双核浏览器请用极速模式