- 无标题文档
查看论文信息

中文题名:

 基于深度学习的人工智能生成文本检测研究    

姓名:

 刘昊    

保密级别:

 公开    

论文语种:

 chi    

学科代码:

 071201    

学科专业:

 统计学    

学生类型:

 学士    

学位:

 理学学士    

学位年度:

 2024    

校区:

 北京校区培养    

学院:

 统计学院    

第一导师姓名:

 杜勇宏    

第一导师单位:

 统计学院    

提交日期:

 2024-06-08    

答辩日期:

 2024-05-09    

外文题名:

 Research on AI Generated Text Detection based on Deep Learning    

中文关键词:

 深度学习 ; 自然语言处理 ; 预训练模型    

外文关键词:

 Deep learning ; Natural language processing ; Pre-trained models    

中文摘要:

随着生成式大语言模型不断更新迭代,高性能的大模型产品广泛落地,越来越多的人开始使用大模型产品辅助撰写各类文案,提升文案质量与写作效率。但同时也有部分人群使用大模型撰写作文等原创性要求非常高的文章,违背了公平原则,给评判者带来了较大的困扰。由于大模型产品能够生成与人写文本非常相似的文章,这使得判断一篇文本是否由人工智能生成较为困难。为了避免大语言模型在教育界带来过大的负面影响,本研究针对议论文体裁,利用深度学习模型实现人工智能生成文本检测,即判断一篇议论文文本是由大模型产品生成的还是由学生撰写的。

在数据方面,本研究的机写文本来自ChatGPT、GPT-3.5-Turbo等生成式大模型,人写文本来自PERSUADE 2.0语料库以及中国学生英语笔语语料库。在模型方面,本研究在文章级别、句子级别、单词级别三个细粒度下尝试了RoBERTa微调模型,随机梯度下降分类器(SGDClassifer)、轻梯度提升机(LGBM)、极端梯度提升机(XGBoost)、随机森林模型(Random Forest)的集成学习模型(Stacking),双向长短期记忆网络(BiLSTM),卷积神经网络(CNN),结合注意力机制(Attention)的混合神经网络模型(BiLSTM-Attention-CNN),结合不同的向量化方式共设计了10个实验方案。在评估指标方面,本文通过训练数据集的交叉验证确定模型后,将最终模型用于分布外测试集以评估接受者操作特征曲线下面积(AUC)、宏平均F1得分(F1-score)。

本研究得出了如下结论:不同细粒度中单词级别的模型效果最好,因此较为适合解决人工智能生成文本检测问题。其中,CNN模型有着较短的训练时长,在分布外数据的F1-score达到了0.9892,混合神经网络模型效果同样优异,但计算开销加大;句子级别下BiLSTM模型优于CNN模型;文章级别下RoBERTa-Stacking模型的分布外表现十分接近RoBERTa微调模型,且训练时长仅为微调模型的2.89%。本研究的创新之处在于所使用的RoBERTa-Stacking模型融合了深度学习和集成学习模型,在减少计算开销下保持了较高性能;此外利用RoBERTa模型对文章的多个句子进行句子嵌入以获取文本矩阵,这可应用于更有难度的文本分类问题。

外文摘要:

With the continuous iteration of generative large language models, large language model products have achieved higher performance and been widely implemented. More people are starting to use large language model products to assist themselves in writing various types of texts, improving the quality and efficiency. But at the same time, there are also some groups of people who use large language models to write articles with high originality requirements, which violates the principle of fairness and brings great difficulties to judges. Due to the ability of large language model products to generate articles that are very similar to those written by humans, it is difficult to determine whether a text is generated by artificial intelligence. In order to avoid the significant negative impact of large language models in the education industry, this study focuses on argumentative essays and uses deep learning models to achieve AI generated text detection, namely to determine whether an argumentative essay text is generated by a large language model or written by students.

In terms of data, the machine written texts in this study come from large language models such as ChatGPT and GPT-3.5-Turbo, while the human written texts come from the PERSADE 2.0 corpus and Written English Corpus of Chinese Learners. In terms of models, this study attempted at three fine-grained levels: article level, sentence level, and word level. RoBERTa, the ensemble learning model (Stacking) of Random Forest model (RF), Stochastic Gradient Descending Classifier (SGDClassifer), Light Gradient Boosting Machine (LGBM) and Extreme Gradient Booster (XGBoost), Bidirectional Long Short-Term Memory Network (BiLSTM), Convolutional Neural Network (CNN), and Hybrid Neural Network model (BiLSTM-Attention-CNN) combined with attention mechanism (Attention) were used to design a total of 10 experimental schemes. In terms of evaluation metrics, this article determines the model through cross validation of the training dataset, and uses the final model to evaluate the area under the receiver operating characteristic curve (AUC) and macro average F1-score (F1-score) on the out of distribution set.

This study concludes that word level models among different fine-grained levels have the best performance, making them more suitable for solving detection problems on texts generated by artificial intelligence. Among them, the CNN model achieved an F1-score of 0.9892 on out of distribution data with an extremely short training time. The hybrid neural network model performed equally well, but the computational cost increased; The BiLSTM model is superior to the CNN model at the sentence level; At the article level, the out of distribution performance of the RoBERTa Stacking model is very close to that of the RoBERTa fine-tuning model, but the training time is only 2.89% of the latter. The innovation of this study lies in the RoBERTa-Stacking model used, which integrates deep learning and ensemble learning, maintaining high performance while reducing computational costs; In addition, using the RoBERTa model to embed multiple sentences in an article to obtain a text matrix can be applied to more difficult text classification problems.

参考文献总数:

 31    

馆藏号:

 本071201/24047    

开放日期:

 2025-06-09    

无标题文档

   建议浏览器: 谷歌 360请用极速模式,双核浏览器请用极速模式