查看论文信息

查看全文

查看论文信息

中文题名：	基于统计学习的个人信用评估模型的构建
姓名：	张学文
保密级别：	公开
论文语种：	中文
学科代码：	025200
学科专业：	应用统计
学生类型：	硕士
学位：	应用统计硕士
学位类型：	专业学位
学位年度：	2019
校区：	北京校区培养
学院：	统计学院/国民核算研究院
研究方向：	应用统计
第一导师姓名：	童行伟
第一导师单位：	北京师范大学统计学院
提交日期：	2019-06-19
答辩日期：	2019-05-11
外文题名：	CONSTRUCTION OF PERSONAL CREDIT EVALUATION MODEL BASED ON STATISTICAL MACHINE LEARNING
中文关键词：	信用评估 ; Lasso-logistic ; 随机森林 ; XGboost ; 双采样 ; smote算法
中文摘要：	︿近年来超前消费观念日渐盛行，伴随着水涨船高的消费欲望，消费个体对资金的需求与日俱增，个人信贷市场借此迅猛发展，消费者在信贷市场上可以基于自身消费需求向贷款机构申请借款，贷款机构通过对借款人进行信用评估以决定是否对借款人发放贷款。在这个过程中，消费者个体获得资金解决了资金短缺问题，贷款机构从中获得贷款利息，双方本是互惠互利的。然而由于目前我国的信用评估体系还不够健全，信用风险、欺诈风险开始滋生，这不仅严重损害了金融机构的利益，更妨碍经济和社会的健康发展。建立高效准确的个人信用评估机制非常重要，需要加以解决。本文利用某金额机构的历史业务数据和信贷行为数据，进行个人信用评估模型的构建。主要采用Lasso-logistic回归、基于bagging的随机森林算法、基于boosting的XGboost算法来构建模型；在构建模型过程中，比较关键的环节有：进行数据的预处理工作，主要包括，异常值的处理、缺失值的处理、数据的规范化；通过删除缺失率过高的特征、删除相关度过高的特征、选取重要程度大的特征进行特征选择；在分类模型构建过程中，样本目标变量比例失衡的问题称作数据集不平衡，当出现样本不平衡问题时，分类模型在少数类样本上的预测精度会明显降低，极端情况下构建的分类模型甚至是无效的；针对样本数据不平衡问题，本文采用双采样和smote采样进行处理以实现数据集平衡，并比较两种采样方法的优劣。研究发现，在样本平衡化处理方法中，smote平滑采样优于双采样方法。双重采样在正类样本扩展中只是现有正类样本的简单副本，很容易导致过拟合问题。而smote采样通过插值生成少类样本数据，能够从一定程度上优化这一问题。同时在模型的预测效果上，Lasso-logistic作为对logistic的改进优于logistic模型；且机器学习算法分类效果显著高于传统的回归分类模型，其中基于boosting的XGboost模型分类效果要优于基于bagging的随机森林模型。并且在模型构建中，发现贷款账户数、个人消费贷款比例、教育水平、贷款合同金额这些特征非常重要，金融机构在对用户的行为记录中应该注重这类信息的采集。﹀
外文摘要：	︿ In recent years, the concept of advanced consumption has become more and more prevalent. With the rising desire for consumption, the demand of individual consumers for funds is increasing day by day. With the rapid development of personal credit market, consumers can apply for loans from loan institutions based on their own consumption needs in the credit market. Loan institutions can decide whether to grant loans to borrowers through credit evaluation. In this process, individual consumers get funds to solve the problem of capital shortage. Loan institutions get interest from loans. Both sides benefit from each other. However, due to the imperfect credit evaluation system in our country, credit risk and fraud risk begin to breed, which not only seriously damages the interests of financial institutions, but also hinders the healthy development of economy and society. It is of great importance to establish an efficient personal credit evaluation mechanism, which needs to be solved urgently. This paper uses the historical business data and credit behavior data of a certain amount institution to construct the personal credit evaluation model. Lasso-logistic regression, Bagging-based random forest algorithm and boosting-based XGboost algorithm are used to build the model. In the process of building the model, the key links are: data preprocessing, mainly including missing value processing, outlier value processing, data standardization; By deleting features with high missing rate, deleting features with high correlation and selecting features with high importance, feature selection can be carried out. In the process of constructing classification model, the problem of imbalance in proportion of target variables of samples is called data set imbalance. When the problem of sample imbalance occurs, the prediction accuracy of classification model on a few samples will be significantly reduced, and the classification model constructed in extreme cases is even ineffective. Aiming at the problem of sample data imbalance, In this paper, double sampling and smote sampling are used to balance data sets, and the advantages and disadvantages of the two sampling methods are compared. It is found that smote sampling is superior to double sampling in sample balancing. Dual sampling is only a simple replication of the existing positive samples in the expansion of positive samples, which can easily lead to over-fitting problems. However, smote sampling can optimize this problem to a certain extent by generating a few sample data through interpolation. At the same time, Lasso-logistic is better than logistic model in the prediction effect of the model, and the classification effect of machine learning algorithm is significantly higher than that of traditional regression classification model, and the classification effect of XGboost model based on boosting is better than that of random forest model based on bagging. In the construction of the model, it is found that the characteristics of loan account number, individual consumption loan proportion, education level and loan contract amount are very important. Financial institutions should pay attention to the collection of such information in the user's behavior record. ﹀
参考文献总数：	38
作者简介：	张学文，2017年取得江西财经大学统计学院经济统计学学士学位，2019年毕业于北京师范大学统计学院应用统计系。
馆藏号：	硕025200/19040
开放日期：	2020-07-09

附件下载