查看论文信息

查看全文

查看论文信息

中文题名：	网络新闻的特征分析及热度预测
姓名：	林晓敏
保密级别：	公开
论文语种：	中文
学科代码：	025200
学科专业：	应用统计
学生类型：	硕士
学位：	应用统计硕士
学位类型：	专业学位
学位年度：	2020
校区：	北京校区培养
学院：	统计学院/国民核算研究院
研究方向：	应用统计
第一导师姓名：	段小刚
第一导师单位：	北京师范大学统计学院
提交日期：	2020-06-19
答辩日期：	2020-05-29
外文题名：	THE FEATURE ANALYSIS AND HEAT PREDICTION OF NETNEWS
中文关键词：	网络新闻特征 ; 卷积神经网络 ; 机器学习算法 ; 热度值预测 ; 集成学习
外文关键词：	Network News Features ; Convolutional Neural Networks ; Machine Learning Algorithms ; Heat Prediction ; Integrated Learning
中文摘要：	︿互联网的普及改变了人们获取新闻的方式，据2019年《中国互联网络发展状况统计报告》显示，2018年12月至2019年6月，网络新闻的用户规模约为6.86亿，使用网络新闻应用的网民占比近80%。在互联网的背景下，人们获得新闻的方式更便捷，同时也更容易在海量新闻中耗费不必要的时间，从而减少对网络新闻应用的使用，因此，在充斥着大量冗余信息的新闻网站中发现热点新闻十分有必要。另外，随着网民日益增多，尤其是微博、微信公众号等自媒体的发展，网络渐渐发展为民众关注社会、参政议政的场地，互联网又具备信息量大、传播快的特点，使得网络舆情管理需要投入更多的精力，而网络舆情管理是社会管理的一部分，关系到国家安全和社会稳定。因此，有必要对网络新闻的特征进行分析，了解网络新闻现状，只有明确网络新闻的特征以及网民关心的热点特征，才能使舆情管理更具针对性。本文数据集来源于实习单位，网络新闻数据是由各大站点2019年12月发布的新闻文本聚合而成，站点涵盖了百家号、腾讯新闻、新浪新闻、网易新闻、头条新闻网站、澎湃新闻、人民日报、央视新闻、凤凰网等，总共35869条。首先，对数据进行清洗，清洗工作包括缺失值补充、数据去重，清洗过程使用了爬虫技术、基于TextRank算法的关键词抽取技术、Jaccard相似度算法。其次，从新闻本身特征出发，对类别特征、实体特征、突发特征进行分析，另外增加了对时间特征、发布网站特征的分析。类别特征提取使用了卷积神经网络，实体特征提取基于规则和词典的方法，突发特征提取采取了模糊匹配方法。最后，引入新闻发布当天，微博中相关的帖子数量作为新闻的热度值，与5个特征进行交叉统计，分析了在各个特征下新闻热度分布，说明特征选择的合理性，也为网络新闻热度值的预测奠定基础。热度值量化后，本文使用了k近邻、朴素贝叶斯、决策树三种机器学习算法研究所有特征对新闻热度值的影响并进行预测，结果表明决策树的分类效果最优，召回率达到73%，突发特征和实体特征重要性最高，模型对高热度和低热度的预测能力较为优秀。为了进一步提高模型的预测效果，本文使用了集成学习方法，选择基于梯度提升决策树算法对新闻热度值进行预测，最终模型的召回率提升至77%。本文特征变量的提取对识别高热度值和低热度值的新闻有一定的效果，并且提出的指标体系构建思路即综合考虑新闻本身以及外部因素来预测新闻热度值是可行的、合理的。﹀
外文摘要：	︿ The popularity of the Internet has changed the way people get news. The 2019 China Internet Development Statistics Report shows that from December 2018 to June 2019, the number of online news users was approximately 686 million, and the proportion of netizens using online news applications accounted for nearly 80%. Although, in the age of the Internet, it has become more convenient for people to obtain news, but it is also easier for people to spend unnecessary time in mass news, and then reducing the utilization of online news applications. Therefore, it is necessary to find hot news in news sites that full of redundant information. From another point of view, due to the characteristics of fast Internet transmission and large amount of information, the management of online public opinion needs to invest more energy, and the management of online public opinion is a part of social management, which is related to national security and social stability. With the increasing number of netizens, especially the development of self-media such as Weibo and WeChat public accounts, the Internet has gradually developed into a place where people pay attention to society and participate in politics. Their influence on society cannot be ignored. Therefore, it is necessary to analyze the characteristics of online news and understand the current status of online news. Only by knowing the characteristics of online news and the hot features of netizens' care can social management be more targeted. The data set of this article comes from the internship unit. The network news data is aggregated from the news of major sites. The site covers Baijiahao, Tencent News, Sina News, Wangyi News, Toutiao News Site, Pengpai News, People's Daily, CCTV News and so on, there are a total of 35869 news published on the homepage. First, the data should be cleaned. The cleaning work includes the supplement of missing value and data deduplication. The cleaning process uses crawler technology, keyword extraction technology based on TextRank algorithm, and Jaccard similarity algorithm. Secondly, given that the characteristics of the news itself, the article analyzes the characteristics of category, entity, burst. In addition, the characteristics of time and website are added. Category feature extraction uses convolutional neural networks, entity feature extraction based on rules and dictionary methods, and burst feature extraction uses fuzzy matching methods. Finally, the number of related posts in Weibo is used as the heat value of news, and then cross-statistics with 5 features are analyzed. The distribution of news heat under each feature not only shows the rationality of feature selection but also lays the foundation of the online news popularity prediction. After the heat value is quantified, in order to study the impact of all features on news popularity, this paper uses three machine learning algorithms, k-nearest neighbor, naive Bayes, and decision tree. The results show that the classification effect of the decision tree is optimal, the recall rate reaches 73%, and the burst Features and entity features have the highest importance, at the same time the model has excellent prediction ability for high and low heat. In order to further improve the prediction effect of the model, this article uses integrated learning that based on the gradient boosting decision tree algorithm to predict the heat value of news, and the recall rate of the model is increased to 77%. The results of this article shows that the extraction of feature variables in this paper has a certain effect on the identification of news with high heat value and low heat value, and the proposed index system construction idea is to comprehensively consider the news itself and external factors to predict news heat value. ﹀
参考文献总数：	37
馆藏号：	硕025200/20026
开放日期：	2021-06-19

附件下载