查看论文信息

查看全文

查看论文信息

中文题名：	基于逻辑回归与机器学习方法的车险反欺诈识别研究
姓名：	李博韬
保密级别：	公开
论文语种：	中文
学科代码：	025200
学科专业：	应用统计
学生类型：	硕士
学位：	应用统计硕士
学位类型：	专业学位
学位年度：	2018
校区：	北京校区培养
学院：	统计学院/国民核算研究院
第一导师姓名：	陈梦根
第一导师单位：	北京师范大学统计学院
提交日期：	2018-06-18
答辩日期：	2018-05-23
外文题名：	A STUDY ON AUTO INSURANCE FRAUD IDENTIFICATION BASED ON LOGISTIC REGRESSION AND MACHINE LEARNING
中文关键词：	Logistic模型 ; 神经网络模型 ; 支持向量机 ; SMOTE算法 ; 边界模糊性
中文摘要：	︿在我国汽车市场不断壮大之时，车险业务也随之成长起来。但是于此同时，中国的车险欺诈情况随着车险市场的壮大而泛滥。我国车险行业的费用率水平明显偏高，赔付率水平略低，但综合费用率很高，使得行业整体利润率始终保持在很低的水平。但对我国保险业来讲，财产险公司在产品定价上更偏向于市场而非模型这一实际情况使得控制费用率变得很难实现，因此过高的赔付率会使得保险公司在财务方面压力过大[34]。同时，“据专家估计在我国，保险公司因车险欺诈而遭受到的损失占到车险总赔付金额的20%-30%之多”[36]，因此建立一套行之有效的车险欺诈识别模型势在必行。本文基于丰富的数据，借助统计回归模型与机器学习方法，从车险理赔数据入手帮助财险公司识别车险欺诈案件。具体来说，本文首先梳理了国内外对车险反欺诈问题的研究发展状态，总结出了三大类以及一个趋势，分别针对模型选择和优化数据结构。三类分别为1.采用简单线性回归模型拟合数据；2.借助广义线性模型的方法来做拟合；3.借助机器学习的方法来做预测。而总体趋势就是从关注模型的选用到关注数据结构的优化。在前人的研究基础之上，本文首先对数据做了探索性数据分析，了解数据结构特征，发现了其带有大量缺失值以及判别变量具有非平衡性的特点，本文也剔除了一些含有太多缺失值以及无关紧要的变量。随后对数据集做了逻辑回归拟合。最后在对全模型做逐步回归的结果上做了模型的解读，得出了如下几个观点。一类维修厂和特约维修站几乎不会存在欺诈嫌疑；涉及人员伤亡的索赔案件的欺诈嫌疑发生比率是无人员伤亡的exp(1.66)=5.26倍；经过交管正常处理的案件存在欺诈的嫌疑也比交警简易程序处理、保险公司处理或其他的途径要低很多；车辆损失部位发生在底部的案件存在欺诈嫌疑的可能性更高；有第一现场的交通事故是没有第一现场交通事故欺诈发生率的exp(-1.35)=0.26倍。为了与统计回归模型结果作对比，本文又利用两种机器学习的方法对数据做了拟合，分别为神经网络模型与支持向量机模型。对比结果显示Logistic模型的分类效果最差，支持向量机的结果居中，神经网络模型的效果最好，但是支持向量机模型的错判率最低。原因在于支持向量机基于小样本的学习方法在处理本论文所用的车险理赔数据集时很有效，而人工神经网络的理论特性使得其适合学习内部具有复杂规则的模型，所以效果更加显著。考虑到车险理赔数据集是一组典型的非平衡数据，且少数类情况数量较少。针对此种情况，采取了过采样方法作为解决方案。对比结果显示应用了SMOTE过采样算法的神经网络模型的效果基本达到了最佳状态，可以视作为一个表现非常优秀的分类器，但是SVM分类器在处理经SMOTE算法优化过的数据后反而表现降低了。其原因可能在于引起了过拟合现象，以及与SMOTE算法共同使用时可能导致严重的边界模糊化问题。﹀
外文摘要：	︿ As China's auto market expands, auto insurance business also grows dramatically. At the same time, auto insurance market in China is rampant with auto insurance frauds. The cost ratio of auto insurance industry in China is obviously high, as the ratio of compensation is slightly lower, but the comprehensive expense ratio is still too high, which keeps the overall profit margin of the insurance industry at a very low level. But for the insurance industry in our country, property companies prefer to market rather than on the product pricing model which has made it hard for controlling, as the matter of fact. Thus high loss ratio will make insurance companies in financial stress[34]. At the same time, "According to experts estimation, in our country, Insurance fraud losses accounted for 20% -30% of the total amount to auto claims for insurance companies [36]". Therefore, there is must to establish an effective car insurance fraud identification model as soon as possible. Based on rich set of data, this paper uses statistical regression model and machine learning methods to help the property insurance company to identify auto insurance fraud cases from the data auto insurance claims. Specifically, this paper firstly makes clear of research and development of auto insurance fraud identification including at home and abroad, and then summarizes three points as well as a development trend respectively for model selection and optimization of data structure. The three points are 1. Use Simple linear regression model to fit data; 2. Use generalized linear model to fit data; 3. Make predictions by means of machine learning. The trend is the concentration moving to optimize data structure from selecting a good model. On the basis of forefathers' researches, this paper makes an exploratory data analysis, for knowing about data structure and data characteristic, then we find a large number of missing values and the discriminant variable is with a heavy unbalance. After that, this article also eliminates some irrelevant variables and the ones that have too much missing values. Then the logistic regression is used to fit the data. Finally, the model is interpreted in the results of the gradual regression on the whole model, and the following viewpoints are obtained. There is almost no fraud suspicion when meets first class and special class of repair shops; When claims involves human casualties, the rate of suspected fraud is exp(1.66)=5.26 times of those cases without casualties; There is also a much lower level of suspected fraud in cases that have been dealt with the proper process by traffic police or insurance company; There is a higher likelihood of fraud suspicion in the case which the vehicle loss occurs at the vehicle’s bottom; The cases with a first scene of traffic accident has a lower possibility about exp(-1.35)=0.26 times of the ones without a first scene. In order to compare with the results of statistical regression model, this paper uses two methods of machine learning to fit the data as well, Neural network model and support vector machine model respectively. The comparison shows that the Logistic model has the worst classification ability, while the neural network model has the best effect, but support vector machine model has the lowest error rate. Because of the support vector machine learning method basing on a small samples learning method, it is quite effective when dealing with car insurance claims dataset used in this paper. While the artificial neural network is good at learning a question with complex internal rules, so the effect is more significant. Considering that the auto insurance claims dataset is a typical non-equilibrium dataset, an over-sampling method is adopted as the solution for this problem. Comparison results show that the SMOTE-Artificial Neural Network model hits the best performance, which can be regarded as a very good classifier, while the performance of SMOTE-SVM decreases. The reason may be that the SMOTE-SVM algorithm results in over-fitting phenomenon and serious boundary ambiguity. ﹀
参考文献总数：	0
馆藏号：	硕025200/18004
开放日期：	2019-07-09

附件下载