查看论文信息

查看全文

查看论文信息

中文题名：	基于机器学习的个人信用评分模型的研究
姓名：	陆忠信
保密级别：	公开
论文语种：	chi
学科代码：	025200
学科专业：	应用统计
学生类型：	硕士
学位：	应用统计硕士
学位类型：	专业学位
学位年度：	2024
校区：	珠海校区培养
学院：	统计学院
研究方向：	机器学习
第一导师姓名：	唐军
第一导师单位：	统计学院
提交日期：	2024-06-04
答辩日期：	2024-05-22
外文题名：	RESEARCH ON PERSONAL CREDIT SCORING MODEL BASED ON MACHINE LEARNING
中文关键词：	金融风控 ; 个人信用评分 ; 集成学习 ; 不平衡数据 ; 随机欠采样 ; XGBoost ; Logistic 回归
外文关键词：	Financial risk control ; Personal credit score ; Integrated learning ; Imbalance data ; Random undersampling ; XGBoost ; Logistic regression
中文摘要：	︿经济的发展离不开信用，市场经济的本质就是信用经济。实践证明，诚实守信不是每个人与生俱来的品质，现实中存在很多不诚信行为。2013 年我国最高人民法院颁布《关于公布失信被执行人名单信息的若干规定》之后，截至 2024 年，失信被执行人数量已经超过 857 万人。而银行等金融机构如果将贷款放给失信被执行人毫无疑问会造成损失。大部分金融机构为了避免损失，会对申请贷款的客户的各项条件进行数据分析，以判断客户的违约风险以决定是否放贷。传统的个人信用评分模型建立在统计学基础之上，而机器学习的大火让个人信用模型有了进一步发展。本文针对发展个人信用评分模型的现实要求，查询了解国内外相关文献之后介绍了个人信用评分模型的发展历程，推导了多种机器学习算法原理以及多种评价指标。在实际多面对不平衡数据集时，介绍 SMOTE 采样和随机欠采样的方法以处理数据的不平衡性。实证上选用 Kaggle 网站上信贷逾期的公开数据集，以“是否违约”为响应变量构建二分类模型。首先对数据的缺失值和异常值进行了处理，然后针对数据的不平衡性使用随机欠采样的方式进行处理。对于预处理之后的数据利用 Python 编程语言，先后选择了 Logistic 回归、朴素 Bayes、梯度提升、随机森林和 XGBoost 进行模型构建。在多种评价指标的比较下验证了集成学习模型预测效果要比单一模型好，但是单一模型在个别指标上更优秀，要结合实际业务需求选择合适模型。并且发现不同集成学习算法特征重要度有较大差异，对应于实际工作时，不同模型所看重的指标有所不同。最后对银行等金融机构在个人信用模型构建方面提出现实可行的建议。﹀
外文摘要：	︿ Credit is indispensable to economic development, and the essence of a market economy is a credit economy. Practice has proved that honesty and trustworthiness are not inherent qualities of every person, and there are many dishonest behaviors in reality.After the Supreme People's Court of China issued the “Several Provisions on Publishing Information on the List of Defaulted Executives” in 2013, the number of defaulted executives has exceeded 8.57 million as of 2024. And banks and other financial institutions will undoubtedly incur losses if they lend loans to the defaulted executors. In order to avoid losses, most financial institutions will analyze data on the conditions of customers applying for loans in order to determine their default risk to decide whether to lend. Traditional personal credit scoring models are built on statistics, and the fire of machine learning has allowed further development of personal credit models. In this paper, we introduce the development history of personal credit scoring model, deduce the principles of various machine learning algorithms and various evaluation indexes after inquiring about the relevant literature at home and abroad to meet the realistic requirements of developing personal credit scoring model. When facing unbalanced datasets in practice, SMOTE sampling and random undersampling methods are introduced to deal with the unbalanced nature of the data. Empirically, the public dataset of credit delinquency on Kaggle website is used to construct a binary classification model with "whether default" as the response variable. First, the missing values and outliers of the data are processed, and then the imbalance of the data is processed by random undersampling. For the preprocessed data, Logistic Regression, Simple Bayes, Gradient Boost, Random Forest, and XGBoost were selected for model construction using Python programming language. Under the comparison of multiple evaluation indexes, it is verified that the prediction effect of the integrated learning model is better than that of a single model, but a single model is better in individual indexes, and it is necessary to choose a suitable model in combination with the actual business requirements. And it is found that the feature importance of different integratedlearning algorithms has a large difference, corresponding to the actual work, the indicators valued by different models are different. Finally, realistic and feasible suggestions are put forward for banks and other financial institutions in the construction of personal credit models. ﹀
参考文献总数：	36
作者简介：	北师大学生
馆藏地：	总馆B301
馆藏号：	硕025200/24035Z
开放日期：	2025-06-05

附件下载