- 无标题文档
查看论文信息

中文题名:

 重尾数据线性模型的稳健变量选择和FDR控制    

姓名:

 米新皓    

保密级别:

 公开    

论文语种:

 chi    

学科代码:

 025200    

学科专业:

 应用统计    

学生类型:

 硕士    

学位:

 应用统计硕士    

学位类型:

 专业学位    

学位年度:

 2023    

校区:

 北京校区培养    

学院:

 统计学院    

研究方向:

 变量选择    

第一导师姓名:

 李高荣    

第一导师单位:

 统计学院    

提交日期:

 2023-06-15    

答辩日期:

 2023-05-12    

外文题名:

 Robust Variable Selection and FDR Control for Linear Models with Heavy Tailed Data    

中文关键词:

 变量选择 ; 错误发现率 ; 多重假设检验 ; Huber损失 ; Knockoff ; Lasso    

外文关键词:

 Variable selection ; False discovery rate ; Multiple hypothesis testing ; Huber loss ; Knockoff ; Lasso    

中文摘要:

近年来, 高维复杂数据频繁出现在生物医学、社会科学、遥感学、神经影像学等众多科学及应用研究领域中, 同时数据中也伴随着众多异常值和离群点. 在这种情形下, 识别出对响应变量起真正影响的协变量是进行后续统计分析的首要步骤. 在统计学领域, 提出并发展新的稳健变量选择程序已经成为了当今统计学研究的热点问题之一. 常用的稳健变量选择方法包括惩罚分位数回归, 惩罚最小一乘回归以及惩罚Huber回归等. 这些方法的共同特点是对损失函数施加惩罚项, 从而达到变量选择的效果. 但是这些传统的变量选择方法聚焦于尽可能多地选择出真正对响应变量有着显著影响的变量, 而忽略了控制错误识别噪声变量数目.

FDR(False discovery rate, FDR)近年来成为控制错误识别噪声变量数目的黄金准则. 但在控制FDR下进行稳健变量选择目前仍未得到充足的研究. 近年来,提出的固定-X的knockoff方法可以在控制FDR下进行变量选择, 错误发现比例结果优于传统变量选择方法, 对变量选择的多重假设问题起到了推动作用. 但是目前大多方法基于最小二乘损失的固定-X的knockoff方法, 这在面对重尾分布或者离群点数据时, 无法得到有效的变量选择结果. Huber损失函数克服了最小二乘损失函数的缺点, 让损失函数具有连续导数的同时, 对处理存在异常值的数据的情况更加稳健.

本论文提出了基于Huber损失函数的knockoff方法, 将稳健估计方法与knockoff方法相结合, 解决了Huber损失函数加惩罚项的变量选择方法忽视FDR控制的问题, 同时解决了传统的最小二乘损失函数加惩罚项的knockoff方法处理重尾数据时无法得到有效变量选择结果的问题. 为处理数据存在离群点或者重尾分布的情况时, 该如何进行控制FDR的有效变量选择进一步提供了研究思路. 为了检验所提出方法的性能, 本论文通过蒙特卡罗模拟研究, 对基于Huber损失函数的knockoff方法和其他变量选择方法进行对比实验, 模拟结果表明本研究提出的基于Huber损失函数的knockoff方法和knockoff+方法在处理重尾数据时, 相较于其他的变量选择方法有提升, 所得到变量选择结果的错误发现比例得到了有效控制, 检验功效也较其他方法更好. 本研究还对中国北方一家大型超市的数据展开了实证研究, 使用提出的基于Huber损失函数和knockoff的变量选择方法对超市数据展开分析, 在控制FDR的情况下选择出真正对顾客数量影响显著的商品种类.

外文摘要:

In the past few years, high-dimensional complex data has frequently appeared in many theoretical and applied research fields such as biomedicine, social science, remote sensing, neuroimaging, and so on. At the same time, there are many outliers in the data. In this case, identifying the covariates that significantly affect the response variable is the first step in conducting subsequent statistical analysis. In the field of statistics, proposing and developing new robust variable selection procedures has become one of the hot points in current statistical researches. The robust variable selection methods that commonly used include penalized quantile regression, penalized least absolute regression and penalized Huber regression. The common of these methods is to impose penalty terms on the loss function, so as to select variables. However, these traditional variable selection methods focus on selecting as many variables as possible that truly have a significant impact on the response, while neglecting the control of the number of false discovery noise variables.

In recent years, FDR (False discovery rate) has become the gold standard for controlling the number of erroneous recognition noise variables. However, there is currently insufficient research on robust variable selection under controlled FDR. In recent years, the fixed-X knockoff method can select variable under controlled FDR, and the false discovery proportion results are better than traditional variable selection methods, which has played a driving role in the multiple hypothesis problem of variable selection. However, most of the current methods are based on the fixed-X knockoff method of least squares loss, which cannot get effective variable selection results when facing heavy-tailed distribution or outlier data. Huber loss function overcomes the shortcomings of the least squares loss function, and makes the loss function have a continuous derivative while being more robust in processing data with outliers.

This paper proposes a knockoff method based on Huber loss function, which combines robust estimation method with knockoff method to solve the problem that Huber loss function plus penalty term variable selection method ignores FDR control, and solves the problem that the traditional least squares loss function plus penalty term knockoff method cannot obtain effective variable selection results when processing heavy tailed data. This further provides research ideas on how to select effective variables for controlling FDR when there are outliers or heavy tailed distributions in the data. In order to test the performance of the proposed method, this paper also conducts comparative experiments between the Huber loss function based knockoff method and other variable selection methods through Monte Carlo simulation research. The simulation results show that the Huber loss function based knockoff method and knockoff+ method proposed in this study have improved compared with other variable selection methods when dealing with heavy tailed data, the FDP of the obtained variable selection results has been effectively controlled, and the power of the test is also better than other methods. This study also carried out an empirical study on the data of a large supermarket in northen China. The proposed variable selection method based on Huber loss function and knockoff was used to analyze the supermarket data, and under the control of FDR, the types of goods that had a significant impact on the number of customers were selected.

参考文献总数:

 56    

馆藏号:

 硕025200/23008    

开放日期:

 2024-06-14    

无标题文档

   建议浏览器: 谷歌 360请用极速模式,双核浏览器请用极速模式