中文题名: | 中文文学作品中字和词的统计研究 |
姓名: | |
保密级别: | 公开 |
学科代码: | 071101 |
学科专业: | |
学生类型: | 硕士 |
学位: | 理学硕士 |
学位年度: | 2010 |
校区: | |
学院: | |
研究方向: | 系统理论 |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2010-06-28 |
答辩日期: | 2010-05-27 |
外文题名: | STUDY AND COMPARISION STATISTICAL PROPERTIES OF CHINESE CHARACTERS AND WORDS |
中文摘要: |
语言是是人类最重要的交流工具,是人类思维的有形载体和信息传递的主要方式,也是人类积累和保存知识的主要形式。长期以来,学者们对于西方符号语言进行了大量的研究工作,但是对于汉语的统计分析并不充分。本文从复杂性研究的视角出发,利用统计分析和计算机模拟等手段讨论了不同时期、同一作家不同作品构成的语料库中词和字的相关统计信息。本文收集了唐诗、宋词、元曲明清小说和现代大众文学等语料对象,对自唐以来的文学作品的用字情况进行分析,结果表明,各个时期的文学作品的字频分布都服从函数形式 ,并且存在幂律性质减弱而指数性质增强的演变规律。进一步研究以字为单位的文本生成机制。参照Sebastian Bernhardsson提出的类似于“中性理论”的“元书”(meta-book)模型,本文分析验证了中文字频分布的两个主要性质——高频字均匀分布且总字数有限,得到和Bernhardsson研究英文时相同的结论。对目前受众广泛的金庸武侠小说和琼瑶言情小说进行计算机模拟,验证了“元书”模型的文本生成机制同样适用于中文,即不同的作者在写作的过程中也同样存在各自特有的“元书”。因为古代汉语中多以字表意,诗词等文学作品中的非单字词语出现的并不多,而且通常不容易判断一些词语是否应当拆分成字。因此本文对于词频的分析主要是基于金庸的武侠小说和琼瑶的言情小说这两个语料库进行。我们发现两个语料在词频分布上均具有幂律特性,而且在“词汇表大小/文章总词数”关系方面符合Heaps’ Law,即存在明显的幂律关系(对数线性)。最后,我们基于王大辉博士和Bernhardsson的工作建立了扩展元书模型,其主要特征是词库不断增长的情况下对词语的随机选择。计算机模拟结果和实际统计数据非常好吻合,能够较为有力的解释具体一本小说中的词的出现和统计性质的随着文本增加的演变特性。
﹀
|
外文摘要: |
Language is the most important tool of communication, and the most important physical carrier of human thoughts and the main form of information transfer, and the most important tool of accumulation and preservation of human knowledge. Over the years, Western scholars have done a lot of research on symbol language, yet works on Chinese language have not been done fully. From the perspective of complexity research, this paper uses statistical analysis and computer simulation methods to discuss Chinese language in different periods, as well as different works from the same author, in order to constitute the corpus of character and word-related statistics. This paper has done research on collection of Tang Poems, Song Poems, Yuan Poems, and Ming and Qing fiction and modern popular literature corpus, and analyzed characters frequency distribution of literatures since Tang Dynasty, and the results show that the word-frequency distribution of literatures from different periods are subject to the function , and the power-law feature is decreasing while the exponential feature is increasing. Further research has been done in text generation mechanism. Sebastian Bernhardsson create meta-book model similar to "neutral theory" in ecology, and this paper verify two basic features of character frequency distribution: high-frequency words are uniformly distributed and the number of different words is limited, and we came to the same conclusion as Bernhardsson studying English. We study love stories by Qiong Yao, and six longer kung fu novels of Jin Yong. We found that meta-book model depicting text generation mechanism could also apply to Chinese literatures, which means that different write has his/her own meta-book when he/she is writing. Because in ancient China, poetry and other literary works mostly often use Chinese character as the basic unit to write, which is different from modern literature, and it is often not easy to judge whether a word should be split into different characters. Se this article only analyzed word frequency distribution of modern literatures, focusing on Jin Yong and Qiong Yao's novels. We found that word frequency distributions of those two corpora both obey power law, and the relationship of "vocabulary size / total number of words" follows Heaps' Law, which means that there is evident power-law relationship (log-linear). Finally, based on models from Dr. Wang Dahui and Bernhardsson, we established extension meta-book model, whose main feature is words randomly selection as the lexicon is growing. Computer simulation results and the actual statistical data are well matched, which means that this model is more powerful to interpret mechanism of text generation based on unit of words.
﹀
|
参考文献总数: | 33 |
馆藏号: | 硕071101/1008 |
开放日期: | 2010-06-28 |