中文题名: | 基于机器学习的个人信贷违约风险预测与模型解释 |
姓名: | |
保密级别: | 公开 |
论文语种: | chi |
学科代码: | 025200 |
学科专业: | |
学生类型: | 硕士 |
学位: | 应用统计硕士 |
学位类型: | |
学位年度: | 2024 |
校区: | |
学院: | |
研究方向: | 数据科学与管理 |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2024-06-15 |
答辩日期: | 2024-05-25 |
外文题名: | Personal Credit Default Risk Prediction Based on Machine Learning and Model Interpretation |
中文关键词: | 个人信贷违约风险预测 ; 机器学习 ; Stacking模型 ; LIME算法 |
外文关键词: | Personal Credit Default Risk Prediction ; Machine Learning ; Stacking Model ; LIME Algorithm |
中文摘要: |
近年来,随着社会经济的发展和人民群众金融需求的日益增长,越来越多的人选择信用贷款,个人信贷业务不断扩大。但是在信贷市场不断扩大的同时,潜在的金融风险也在不断增加。2023年末,银行业金融机构不良贷款余额3.95万亿元,较年初增加1495亿元。不良贷款率1.62%。因此,如何准确预测个人信贷违约风险成为了一个热门的研究方向。提高对于个人信贷违约风险的预测效果不仅能够减少银行业金融机构的损失,而且还能避免因出现大量的违约情况从而给社会带来严重的不良影响。 本文对阿里云天池公开的某信贷平台的80万条真实贷款记录进行研究。首先对数据进行了描述性统计分析,探索了定量变量在未违约人群和已违约人群中的分布情况,以及定性变量在不同取值情况下的违约率。然后对数据进行了预处理,包括对于日期数据、异常值、缺失值的处理,对于定性变量进行独热编码、对于定量变量进行标准化处理,以及利用欠采样方法处理数据不平衡问题。接着建立个人信贷违约风险模型,包括Logistic、XGBoost、CatBoost和Random Forest模型,和以XGBoost、CatBoost和Random Forest作为基模型,以Logistic模型作为元模型建立的Stacking模型。在对XGBoost模型的超参数进行优化时,在五折交叉验证下,对比了网格搜索法和贝叶斯优化法的优化效果,发现两者效果相近,但是贝叶斯优化速度更快,因此对于其他机器学习模型都采用贝叶斯优化法进行超参数优化。随后根据各模型在测试集上准确率、精确率、召回率、F1得分和AUC值来比较各模型预测效果。最后分别利用模型的特征重要性和LIME算法对Stacking模型的预测结果进行全局和局部的解释。 研究结果表明:第一、在所有建立的个人信贷违约风险模型中,Stacking模型的预测效果最好,其在评价指标准确率、精确率、F1得分和AUC值上都是表现最好的,在召回率上也表现良好。其次是XGBoost和CatBoost模型,最后是Random Forest和Logistic模型。第二、在Stacking模型中,特征贷款等级、贷款等级之子级和贷款发放日期对预测结果起主要作用,这三个特征对于模型的影响程度排在前三。除此以外,特征年收入、信贷周转余额合计和债务收入比也在模型中发挥着重要作用。第三、违约率随着贷款等级的增加而增加,贷款等级为最低级“A”的人群的违约率为6.04%,贷款等级为最高级“G”的人群的违约率为49.70%。 |
外文摘要: |
In recent years, with the development of social economy and the growing financial needs of the people, more and more people choose credit loans, and personal credit business continues to expand. But while credit markets are expanding, so are potential financial risks. At the end of 2023, the balance of non-performing loans of banking financial institutions was 3.95 trillion yuan, an increase of 149.5 billion yuan from the beginning of the year. The non-performing loan ratio is 1.62%. Therefore, how to accurately predict the risk of personal credit default has become a hot research direction. Improving the prediction effect of personal credit default risk can not only reduce the loss of banking financial institutions, but also avoid a large number of defaults which will bring serious adverse effects to the society. This paper studies 800,000 real loan records of a credit platform disclosed by Alibaba Yuntianchi. Firstly, descriptive statistical analysis of the data is carried out to explore the distribution of quantitative variables in the non-defaulting population and the defaulting population, as well as the default rate of qualitative variables under different values. Then the data are preprocessed, including the processing of date data, outliers and missing values, the unique thermal coding of qualitative variables, the standardization of quantitative variables, and the undersampling method to deal with the data imbalance. Then, a personal credit default risk model is established, including Logistic, XGBoost, CatBoost and Random Forest models, and XGBoost, CatBoost and Random Forest as the base model. The Stacking model is established with Logistic model as the meta-model. In the optimization of hyperparameters of XGBoost model, the optimization effect of grid search method and Bayes optimization method was compared under the 50-fold cross-validation, and it was found that the effect of the two
methods was similar, but the speed of Bayesian optimization was faster. Therefore, Bayesian optimization method was used for hyperparameter optimization of other machine learning models. Then the prediction effect of each model was compared according to the accuracy, accuracy, recall, F1 score and AUC value of each model on the test set. Finally, the feature importance of the model and LIME algorithm are used to explain the prediction results of the Stacking model globally and locally. The research results show that: First, among all the established personal credit default risk models, the prediction effect of the Stacking model is the best. Its accuracy, precision, F1 score and AUC value of evaluation indicators are the best. It also performs well in the recall rate. Next are XGBoost and CatBoost models, and finally Random Forest and Logistic models. Second, in the Stacking model, the feature loan level, the sub-level of the loan level, and the date of loan issuance play a major role in the forecasting results. These three features rank in the top three in terms of influence on the model. In addition, characteristic annual income, total credit revolving balance and debt-to-income ratio also play an important role in the model. Third, the default rate increases with the increase of the loan grade, the default rate of those with the lowest "A" loan grade is 6.04%, and the default rate of those with the highest "G" loan grade is 49.70%. |
参考文献总数: | 37 |
馆藏地: | 总馆B301 |
馆藏号: | 硕025200/24066Z |
开放日期: | 2025-06-15 |