中文题名: | 基于预训练语言模型与特征融合的中文恶意网页检测技术研究 |
姓名: | |
保密级别: | 公开 |
论文语种: | 中文 |
学科代码: | 050102 |
学科专业: | |
学生类型: | 硕士 |
学位: | 文学硕士 |
学位类型: | |
学位年度: | 2022 |
校区: | |
学院: | |
研究方向: | 自然语言处理 |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2022-06-04 |
答辩日期: | 2022-05-27 |
外文题名: | Research on Chinese Malicious Webpages Detection Technology Based on Pre-trained Language Models and Feature Fusion |
中文关键词: | 预训练语言模型 BERT ; 中文恶意网页检测数据集 ; 迁移学习 ; 特征贡献度排序 ; 网页自然语言 |
外文关键词: | Pre-trained language model “BERT” ; Chinese malicious webpages detection datasets ; Transfer learning ; Feature contribution ranking ; Webpages natural language |
中文摘要: |
近年来,中国互联网日益繁荣。截至
2021 年
12 月末,中国境内网民的数量已经超过10.32
亿,中文网页的数量也超过了
3350 亿。然而,随着互联网的发展,许多恶意网页也逐渐显露。研究中文恶意网页的自动检测技术,有助于浏览器高效地发现恶意网页,清除危害网民身心健康、危害社会和谐稳定的不良信息,保护网民的个人隐私、合法财产免受侵犯,维护网络空间安全。 |
外文摘要: |
Recently, Chinese Internet is getting increasingly prosperous. By the end of December 2021, the number of Chinese people who have access to Internet has increased to 1 032 million, and the number of Chinese webpages has exceeded 335 billion. However, with the development of Internet, a large number of malicious webpages have emerged. Improving Chinese malicious webpages detection technology can help web browsers find malicious pages effectively, clear the harmful information which may cause damage to netizens’ health and social harmony and stability, and protect netizens’ privacy and legal property from invasion, thus defending the cyberspace security.
﹀
Considering that webpages include not only URLs (Uniform Resource Locators) but also full natural language and code language, this thesis proposes an integrated Chinese malicious webpages detection method based on pre-trained language models and feature fusion, and probes into many aspects such as dataset construction, feature representation and model design. Firstly, as for dataset construction, an open Chinese malicious webpages detection dataset was collected and released for the research community, which contains two parts: external data and internal data. The former is URLs from malicious and benign webpages, and the latter contains not only URLs but also HTML and JavaScript files. We analyzed the rough types of malicious webpages and its malicious evidence according to relevant laws and regulations. Secondly, to gain a better representation of Chinese webpages, we fine-tuned BERT on the external URLs classification task and further pre-trained BERT on webpages text, respectively producing new models “BERT-URL” and “BERT-web-Chinese”. The external URLs classification experiment result has shown that the accuracy of BERT reached 87.01%, obviously superior to that of the manually-designed feature template, word2vec and Fasttext model, also slightly superior to RoBERTa and DistilBERT. Thirdly, on the malicious webpages detection task from the internal data, we measured every manually designed feature’s contribution values based on Information Gain of the Random Forest algorithm, finding that some features for English pages cannot be copied directly to Chinese pages. Then we integrated features from manual templates, BERT-URL and BERT-web-Chinese, and the detection F1 score reached 79.84%, increasing by 7.37% compared with manually designed webpage features. Meanwhile, this experiment also proved the importance of webpages natural language on the malicious webpages detection task. Finally, we analyzed the natural language characteristics of malicious webpages, and pointed out the drawbacks of present sensitive vocabulary resources, proving the usefulness of BERT on malicious webpages detection, thus extending the application fields of BERT on cyberspace security and Chinese information processing. |
参考文献总数: | 74 |
作者简介: | 蒋彦廷,中文信息处理研究所硕士生,主要研究方向为自然语言处理,图书与情报工作 |
馆藏号: | 硕050102/22008 |
开放日期: | 2023-06-04 |