查看论文信息

查看全文

查看论文信息

中文题名：	基于预训练语言模型与特征融合的中文恶意网页检测技术研究
姓名：	蒋彦廷
保密级别：	公开
论文语种：	中文
学科代码：	050102
学科专业：	语言学及应用语言学
学生类型：	硕士
学位：	文学硕士
学位类型：	学术学位
学位年度：	2022
校区：	北京校区培养
学院：	国际中文教育学院
研究方向：	自然语言处理
第一导师姓名：	胡韧奋
第一导师单位：	北京师范大学国际中文教育学院
提交日期：	2022-06-04
答辩日期：	2022-05-27
外文题名：	Research on Chinese Malicious Webpages Detection Technology Based on Pre-trained Language Models and Feature Fusion
中文关键词：	预训练语言模型 BERT ; 中文恶意网页检测数据集 ; 迁移学习 ; 特征贡献度排序 ; 网页自然语言
外文关键词：	Pre-trained language model “BERT” ; Chinese malicious webpages detection datasets ; Transfer learning ; Feature contribution ranking ; Webpages natural language
中文摘要：	︿近年来，中国互联网日益繁荣。截至 2021 年 12 月末，中国境内网民的数量已经超过10.32 亿，中文网页的数量也超过了 3350 亿。然而，随着互联网的发展，许多恶意网页也逐渐显露。研究中文恶意网页的自动检测技术，有助于浏览器高效地发现恶意网页，清除危害网民身心健康、危害社会和谐稳定的不良信息，保护网民的个人隐私、合法财产免受侵犯，维护网络空间安全。考虑到网页不仅有 URL(Uniform Resource Locator，统一资源定位符)标识其地址，也包含了丰富的自然语言和代码语言信息，本文提出了一种基于预训练语言模型与特征融合的中文恶意网页检测方法，从资源建设、特征表示和模型设计等方面展开了较为系统的探索。首先，在资源建设上，我们采集并发布了一个中文恶意网页检测数据集，包括外部数据与内部数据两部分：前者为恶意与良性网页的 URL 数据；后者不仅包括网页 URL，也包括网页的 HTML 与 JavaScript 文件。在此基础上，本文进一步分析了恶意网页的大致类型及其恶意的法律法规依据。第二，为获取高效的网页特征表示，我们在外部数据 URL 分类任务上、在网页自然语言增量训练任务上分别精调（fine-tune）和进一步预训练（further pre-train）英文和中文 BERT模型，并分别产生了新模型“BERT-URL”与“BERT-web-Chinese”。实验显示，预训练模型BERT 在外部 URL 两类分类任务上取得了 87.01%的正确率，明显优于人工特征模板、word2vec、Fasttext 等基线模型，也略优于预训练模型 RoBERTa 与 DistilBERT。第三，在内部数据的恶意网页检测任务上，我们首先基于随机森林信息增益方法，衡量分析了人工设计的各个特征项的贡献度，发现前人设计的英文网页的一些特征项，不宜简单照搬到中文网页上。接下来，我们将来自人工特征模板、BERT-URL 与 BERT-web-Chinese 的特征融合，实现内部数据集上的恶意网页检测，该方法 F1 值达 79.84%，较基于特征模板的方法提升了 7.37%。实验也证明了网页自然语言在恶意网页检测任务上的重要性。最后，文章分析了恶意网页的自然语言特点，并指出了网络上现有敏感词表的不足。这从侧面证明了预训练语言模型在恶意网页检测方面的有效性，扩展了 BERT 语言模型在网络安全、中文信息处理领域的应用场景。﹀
外文摘要：	︿ Recently, Chinese Internet is getting increasingly prosperous. By the end of December 2021, the number of Chinese people who have access to Internet has increased to 1 032 million, and the number of Chinese webpages has exceeded 335 billion. However, with the development of Internet, a large number of malicious webpages have emerged. Improving Chinese malicious webpages detection technology can help web browsers find malicious pages effectively, clear the harmful information which may cause damage to netizens’ health and social harmony and stability, and protect netizens’ privacy and legal property from invasion, thus defending the cyberspace security. Considering that webpages include not only URLs (Uniform Resource Locators) but also full natural language and code language, this thesis proposes an integrated Chinese malicious webpages detection method based on pre-trained language models and feature fusion, and probes into many aspects such as dataset construction, feature representation and model design. Firstly, as for dataset construction, an open Chinese malicious webpages detection dataset was collected and released for the research community, which contains two parts: external data and internal data. The former is URLs from malicious and benign webpages, and the latter contains not only URLs but also HTML and JavaScript files. We analyzed the rough types of malicious webpages and its malicious evidence according to relevant laws and regulations. Secondly, to gain a better representation of Chinese webpages, we fine-tuned BERT on the external URLs classification task and further pre-trained BERT on webpages text, respectively producing new models “BERT-URL” and “BERT-web-Chinese”. The external URLs classification experiment result has shown that the accuracy of BERT reached 87.01%, obviously superior to that of the manually-designed feature template, word2vec and Fasttext model, also slightly superior to RoBERTa and DistilBERT. Thirdly, on the malicious webpages detection task from the internal data, we measured every manually designed feature’s contribution values based on Information Gain of the Random Forest algorithm, finding that some features for English pages cannot be copied directly to Chinese pages. Then we integrated features from manual templates, BERT-URL and BERT-web-Chinese, and the detection F1 score reached 79.84%, increasing by 7.37% compared with manually designed webpage features. Meanwhile, this experiment also proved the importance of webpages natural language on the malicious webpages detection task. Finally, we analyzed the natural language characteristics of malicious webpages, and pointed out the drawbacks of present sensitive vocabulary resources, proving the usefulness of BERT on malicious webpages detection, thus extending the application fields of BERT on cyberspace security and Chinese information processing. ﹀
参考文献总数：	74
作者简介：	蒋彦廷，中文信息处理研究所硕士生，主要研究方向为自然语言处理，图书与情报工作
馆藏号：	硕050102/22008
开放日期：	2023-06-04

附件下载