中文题名: | 面向汉语二语学习的句子偏误自动纠正研究 |
姓名: | |
保密级别: | 公开 |
论文语种: | 中文 |
学科代码: | 050102 |
学科专业: | |
学生类型: | 博士 |
学位: | 文学博士 |
学位类型: | |
学位年度: | 2019 |
校区: | |
学院: | |
研究方向: | 中文信息处理 |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2019-06-20 |
答辩日期: | 2019-06-06 |
外文题名: | Sentence Error Automatic Correction for Learning Chinese as a Second Language |
中文关键词: | |
中文摘要: |
近年来随着汉语作为第二语言被越来越多的人学习,汉语二语辅助教学也越来越受到
研究者的重视,汉语句子偏误自动纠正就是其中一个重要的研究任务。但是很多研究的重
点都是在构建神经网络模型上,而很少关注汉语本身的特征以及汉语偏误本身的规律,因
此本文从分析汉语中介语语料出发,研究汉语句子中字层面、词层面、句层面偏误的自动
纠正问题。研究内容主要包括:
第一、基于 HSK 动态作文语料库对汉语句子中的偏误进行详细的数据统计和分析。 经过分析发现,字层面和词层面的偏误出现频率很高(87.65%),并且大部分字词偏误都 发生在同一批汉字和词语上,偏误重合率也较高,且同样的偏误复现次数也较多。因此本 文从字、词入手,从语料库中归纳总结了两个词典:上下文无关词典和上下文有关词典。 其中上下文无关词典可以不考虑上下文直接对错误的词语进行纠正,上下文有关词典中给 每个目标词都抽取了高频的候选列表,可以根据上下文信息对词语进行纠正。
第二、本文研究了多层的融合语义和语序信息的字词表示方法。利用字词的上下文信
息、语序信息、向量随机转置操作等在大规模语料中学习字和词语的上下文信息和语序信
息,并融合到词向量中。为了验证词向量的有效性,我们在一个小规模的语料上学习了字
词向量,并观察其结果,然后通过词语排序任务去考察词向量是否能够很好地获取句子的
语序信息。
第三、开展了句子可接受度预测研究。本文根据英文可接受度评测语料,在中介语语 料库上构建了一个用于中文可接受度预测的测试语料。句子的可接受度研究可以用于对 生成的句子进行可接受度评分,使之更加接近母语人可接受的水平。本文发现一个高阶 Ngram 语言模型配合良好的平滑算法、拼写错误、词频就可以对句子可接受度进行很好地 预测。并且本文还研究了语言模型的困惑度与句子可接受度之间的关系,发现即使一个语 言模型在测试语料上困惑度不是很高,依然有可能进行句子可接受度预测。
第四、本文构建了一个融合汉语偏误规律和句子可接受度的句子偏误纠正模型。首先 在改进的 Transformer 模型上对目标句子进行纠正,为了提高模型的准确率和召回率,本 文在纠正模型中加入了不包含偏误的句子和分类信息,这使得模型可以更好地学习汉语句 子规律;其次,根据上下文无关词典和上下文相关词典对句子中字层面和词层面的偏误进行纠正;最后整个模型使用句子可接受度对输出的候选句子进行排序。实验证明,这个融
合了偏误规律和句子可接受度的模型可以很好的对中文句子偏误进行纠正。
本研究的特色与创新之处:
第一、提出了用于偏误分析的句子可接受度算法,对现有的句子可接受度转换方法 进行了改进,并从“词语兼容度”、“语序”和“语言模型”角度研究了句子可接受模型, 研究发现高阶 Ngram 语言模型配合良好的平滑算法、字层面偏误、词频就可以对句子可 接受度进行很好的预测。在现有的 GUG 测试集上,模型获得了 r=0.522 的相关度,比现有 的最好结果 (0.472) 高了 5 个百分点。
第二、提出了一个基于深度学习的句子偏误自动纠错模型。在该模型中,本文首次在 偏误自动纠正任务中,引入了汉语母语语料(文学作品语料),并融入了句子的分类信息 (可接受或不可接受)使用 Transformer 模型进行句子的偏误纠正训练,而以往的研究中都 忽略了母语信息对句子偏误纠正的巨大作用,实验证明,当引入母语语料和分类信息之
后,模型的准确率从 9.7% 提升到了 48%,有了大幅度的提升。 第三、提出了一种融合多特征的句子偏误自动纠错方法。在该模型中,融合了深度学
习算法、汉语字词偏误规律以及句子可接受度模型。在深度学习算法的基础上,利用汉语 字词偏误规律对句子进行进一步的纠正,并利用句子可接受度模型对结果进行排序。结 果表明,汉语字词偏误规律可以进一步提高深度学习模型的准确率和召回率,将 F1 值从 29.97% 提升到 31.86%,句子可接受度模型将句子的二分类结果(可接受或不可接受)从 87.9% 提升到了 90.1%。
﹀
|
外文摘要: |
In recent years, with more and more people learning Chinese as a second language, Chinese sentence error correction has attracted many researchers. However, many studies focus on training a good deep learning system instead of Chinese itself and the laws of Chinese sentence errors. By studying the Chinese interlanguage corpus, this paper focus on correcting character errors, word error and grammar error for foreigners' essays.
We first do a statistical analysis on HSK dynamic composition corpus. We found that errors in character level and word level occur in an extremely high frequency. They both have a high frequency items which always be used as some similar characters or words. So we extract two vo- cabularies from the corpus: context-free vocabulary and context-related vocabulary for character and word correction. Context-free vocabulary can be used for character error correction immedi- ately and context-related vocabulary can be used for some words' error correction with their context in sentence.
In order to learn a good word representation, this paper combine the semantic and syntactic information by context information, order information, circle convolution and random permutation in a large-scale Chinese corpus. We first trained a mini character embedding and word embedding on a small corpus to see their performance, then we test our embedding on a Chinese analogy reasoning task and a word order task to check whether the embedding can capture the semantic information and order information.
Base on the HSK dynamic composition corpus, we build a CGUG test data like GUG data(in English) for Chinese sentence acceptability. Studying sentence acceptability can help score the sentence produced from a language generation system and make it approaching the human's lan- guage. This paper found that a high-order Ngramlanguage model with good smoothing algorithms, spelling errors, and word frequency can predict the acceptability of sentences well. And this paper also studies the relationship between the perplexity of the language model and the acceptability ofthe sentence, and found a not well language model can predict the sentence acceptability well.
At last, this paper constructs a Chinese sentence error correction model combining the Chinese error laws and sentence acceptability through an Encoder-Decoder model. Before correcting the character and word errors, the sentence will be translated to a correct sentence by a improved Transformer model. In order to improve the recall rate of the model, this paper added some correct sentences to the translation model which can make the model learning the Chinese well. The sentence acceptability is used to control the outputs of the model. The experiment proves that this
model can correct the Chinese sentence errors well.
The contribution and innovations of this study are mainly reflected in the the following
contents: First, the study proposed sentence acceptability for error analysis. After exploring the sentence acceptability from aspects of word compatibility, word order information and lan- guage model, we found that a high-order Ngramlanguage model with good smoothing algorithms, spelling errors, and word frequency can predict the acceptability of sentences well. We got a rela- tion of 0.522 on GUG test data while the art-of-state is 0.472.
Second, study proposed a sentence error correction model base on deep learning. We first introduce correct Chinese native corpus and classify information (acceptability or unacceptability) to Transformer model in error correction task which always was ignored by other researchers. Experiments shows that Chinese native corpus and classify information have much contribution to the model which improved the precision from 9.7% to 48%.
Third, we explore a multi-strategy method for sentence error correction. We used deep learn- ing, the law of Chinese error and the sentence acceptability. After using Transformer model, we corrected the characters' and words' error by the law of Chinese error to and rank the candidate sentences by sentence acceptability. Experiment shows that this method got a good result which improve F1 score from 29.97% to 31.86%.
﹀
|
参考文献总数: | 88 |
馆藏地: | 图书馆学位论文阅览区(主馆南区三层BC区) |
馆藏号: | 博050102/19005 |
开放日期: | 2020-07-09 |