- 无标题文档
查看论文信息

中文题名:

 基于机器学习方法下的用户留存率分析    

姓名:

 马铮    

保密级别:

 公开    

论文语种:

 中文    

学科代码:

 025200    

学科专业:

 应用统计    

学生类型:

 硕士    

学位:

 应用统计硕士    

学位类型:

 专业学位    

学位年度:

 2022    

校区:

 珠海校区培养    

学院:

 统计学院    

研究方向:

 机器学习    

第一导师姓名:

 唐军    

第一导师单位:

 北京师范大学统计学院    

提交日期:

 2022-06-16    

答辩日期:

 2022-05-26    

外文题名:

 Analysis of user retention rate based on machine learning method    

中文关键词:

 机器学习 ; 留存率分析 ; Logistic模型 ; 随机森林模型 ; XGBoost模型 ; Stacking    

外文关键词:

 Machine learning ; Retention rate analysis ; Logistic model    

中文摘要:
伴随数字化经济的高速发展,银行在面临着新的机会的同时也存在着新的困难,一方面客户群体规模不断扩大,对于银行的盈利需求是向好的,但另一方面银行间的竞争也变得日益激烈,如何避免客户流失现象的发生,提高银行客户的留存率,这是目前银行需要重点关注的业务。因此,通过机器学习模型对银行客户的留存率进行有效分析,这对银行面临日益激烈的市场竞争环境具有重要意义。当前,在预测银行客户流失领域已经有大量专家学者取得一定成果,但他们使用的机器学习方法大多为单一模型去训练预测模型,对于组合模型的使用与研究还比较匮乏。
本文的研究数据是一组公开在大数据竞赛平台Kaggle上的银行客户数据,在数据的预处理环节,本文分别采用顺序编码与独热编码的编码方式对分类型自变量进行了处理,随后通过图表展示的方法探讨了部分自变量与因变量之间的关系。本文在使用结构较为简单的单一模型预测银行用户流失时先后尝试了Logistic回归模型、随机森林模型以及XGBoost模型。首先分别整体概括了每一类单一模型的相关理论知识,继而分别用相同的顺序编码处理过的训练数据以及独热编码处理过的训练数据去训练得到三类单一模型,继而在得到的这三类模型上分别利用测试集数据来对银行流失用户进行预测,最后借助相关评价指标来判断模型的优劣,发现综合考量XGBoost模型是表现最好的单一模型,能够最为精准的预测流失客户信息。
同时本文还研究了以Logistic回归模型以及随机森林模型和XGBoost模型为基础的集成模型,尝试通过Stacking集成学习方法分别建立了两类集成模型来比较性能。概括而言,在初级分类器的选择上,第一个集成模型包含单一模型Logistic回归模型和单一模型XGBoost模型,在次级分类器的选择上,使用了Logistic回归模型;对于第二个模型而言,初级分类器选择了随机森林模型与XGBoost模型,而第二层的次级分类器则依旧选择了Logistic回归模型。通过对比实验结果发现,第一个结构较为简单的分类器在训练数据使用独热编码方式编码时对于客户流失的预测最为精准,且超过了所有单一模型的预测效果。
外文摘要:
With the rapid development of digital economy, banks have ushered in new opportunities and challenges. On the one hand, the scale of customer groups continues to expand, and the demand for bank profits is good, but on the other hand, the competition between banks has become increasingly fierce. How to avoid the occurrence of customer loss and improve the retention rate of bank customers is the business that banks need to focus on at present. Therefore, it is of great significance for banks to predict the loss of credit card customers and improve the retention rate of customers through machine learning method. At present, the first mock exam has been made by many experts and scholars in the field of predicting bank customer churn. However, most of the machine learning methods they use are single model training prediction models, but the use and research of combined models are still scarce.
The research data of this paper is selected from the bank customer data released by Kaggle data platform in November 2020. In the data preprocessing link, this paper uses the coding methods of sequential coding and independent hot coding to encode the classified variables respectively, and then studies the relationship between some variables and customer churn through the method of data visualization. In this paper, the Logistic regression model, the random forest model and the XGBoost model are the first mock exam to predict the loss of bank users. The first mock exam is the first mock exam of the knowledge of each single model. Then, three classes of single models are trained by using the same sequence encoding training data and the training data processed by the encoding code. Then, the data of these three models are used to predict the loss of users by using the test set data. Finally, the model is judged by the relevant evaluation indexes. The first mock exam is that the XGBoost model is the best performing single model, and it can predict the customer information most accurately.
This paper establishes a combination model based on logistic regression model, random forest model and XGBoost model, and constructs two combination models with different structures by using stacking ensemble learning method. Specifically, the first combined model selects logistic model and XGBoost regression model as the primary classifier of the first layer and logistic regression model as the secondary classifier of the second layer; The second combined model selects XGBoost model and random forest model as the primary classifier of the first layer, and logistic regression model as the secondary classifier of the second layer. The first mock exam is encoding the training data with encoding alone, which is the most accurate prediction of churn and exceeds the prediction effect of all single models.
参考文献总数:

 21    

作者简介:

 马铮 北京师范大学统计学院应用统计硕士    

馆藏地:

 总馆B301    

馆藏号:

 硕0714Z2/22022Z    

开放日期:

 2023-06-16    

无标题文档

   建议浏览器: 谷歌 360请用极速模式,双核浏览器请用极速模式