- 无标题文档
查看论文信息

中文题名:

 高维错误设定模型下的双重稳健推断    

姓名:

 焦晨洁    

保密级别:

 公开    

论文语种:

 中文    

学科代码:

 025200    

学科专业:

 应用统计    

学生类型:

 硕士    

学位:

 应用统计硕士    

学位类型:

 专业学位    

学位年度:

 2021    

校区:

 北京校区培养    

学院:

 统计学院/国民核算研究院    

研究方向:

 应用统计    

第一导师姓名:

 郭旭    

第一导师单位:

 北京师范大学统计学院    

提交日期:

 2021-06-19    

答辩日期:

 2021-05-28    

外文题名:

 DOUBLE ROBUST INFERENCE IN HIGH DIMENSIONAL MISSPECIFIED MODELS    

中文关键词:

 错误设定模型 ; 双重稳健推断 ; 高维数据 ; 数值模拟    

外文关键词:

 Misspecified Models ; Double robust inference ; High dimensional data ; Numerical simulation    

中文摘要:
 在高维数据的分析中,研究者会对数据进行一些假定,例如零均值,同方差,无自相关等等。然而这些假定有时是比较理想化的,实际数据无法满足这些条件,从而导致模型设定是错误的。如何在错误假定下,仍然使得估计有效,是一项非常必要的研究。在近几年,很多文章对这一问题进行了研究,但相对具有局限性,本文提出了一种双重稳健推断方法可以在异方差线性回归模型,广义线性回归模型,非线性回归模型以及单指标结构模型中有效的解决这一问题。

以多元线性回归模型为例,数据中响应变量为Y,相关的解释变量有多个。研究中关心的问题是其中某一个变量对响应变量Y是否有显著的影响。所以将这个变量作为变量X,其余作为变量Z。其中X是向量,Z是矩阵。一般的想法是建立Y对X,Z的多元线性回归模型(记为Y模型)。然后估计模型中变量X的系数,来判断影响是否显著。在建立模型时,需要假定Y模型为同方差的。若同方差假定错误,常用的t检验方法,则易使估计失效,导致无法准确判断X对Y的影响。但若这时变量X对变量Z存在线性回归模型(记为X模型)成立,则使用双重稳健推断方法可以有效的估计变量X对Y的影响。

本文首先提出了双重稳健推断方法统计量,然后使用理论方法证明统计量渐近收敛于标准正态分布。通过大量数值模拟实验,可以发现X模型与Y模型有任何一个成立时,使用双重稳健推断方法均可以有效的判断X对Y是否有显著的影响。这一方法不仅在低维异方差线性模型中成立,可以推广至低维广义线性模型,非线性模型以及单指标模型。

在高维数据中,判断X对Y的影响对数据的稀疏性有一定要求。所以在这一部分,以高维异方差线性回归模型为例,对稀疏性进行探究。在高维异方差线性模型中,X模型与Y模型中有任何一个成立时,并且成立的模型具有稀疏性,无论另一个模型是否具有稀疏性,双重稳健推断方法均可以进行有效估计。在此基础上,对稀疏的广义线性模型,非线性模型以及单指标模型进行了讨论和推广。在实验中还讨论了矩阵变量Z的相关系数对估计结果的影响,在大部分的情况下,相关系数的影响不明显。

在数值模拟后,本文将这一方法应用于枯草芽孢杆菌生产核黄素(维生素B2)的基因表达数据中。该数据集中有4088个变量,对变量逐个检验,并结合BH方法。设置三种错误发生率(0.05, 0.1, 0.15)进行筛选,分别可以得到10, 365, 3489个重要变量。在核黄素的生产中重点关注这些基因,有利于提升其生产率。

外文摘要:
In the analysis of high-dimensional data, researchers will make some assumptions on the data, such as zero mean, covariance, no autocorrelation and so on. However, these assumptions are sometimes idealized, and the actual data cannot meet these conditions, which leads to the wrong model setting. It is necessary to study how to make the estimation valid under the wrong assumption. In recent years, many papers have studied this problem, but it has relative limitations. This paper proposes a dual robust inference method, which can effectively solve this problem in heteroscedastic linear regression model, generalized linear regression model, nonlinear regression model and single index structure model.

Taking the multiple linear regression model as an example, the response variable in the data is Y, and there are several related explanatory variables. The concern of the study is whether one of the variables has a significant impact on the response variable Y. Take this variable as the variable X and the rest as the variable Z, where X is a vector and Z is a matrix. The general idea is to establish the multiple linear regression model of Y to X, Z (denoted as Y model). Then, the coefficient of variable X in the model are estimated. Through the estimation results, whether the influence is significant or not is judged. When establishing the model, it is necessary to assume that the Y model is of the same variance. If the assumption of the same variance is wrong, the commonly used t-test method is easy to make the estimation invalid, which makes it impossible to accurately judge the influence of X on Y. However, if the linear regression model (denoted as X model) of variable X to variable Z is established, the double robust inference method can effectively estimate the influence of variable X on Y.

In this paper, we first propose the double robust inference method statistics, and then use the theoretical method to prove that the statistics asymptotically converge to the standard normal distribution. Then through a large number of numerical simulation experiments, it can be found that when either X model or Y model is true, the double robust inference method can effectively judge whether X has a significant impact on Y. This method is not only established in low dimensional heteroscedasticity linear model, but also can be extended to low dimensional generalized linear model, nonlinear model and single index model.

In high-dimensional data, the sparsity of data is required to judge the influence of X on Y. In this part, we take the high dimensional heteroscedastic linear regression model as an example to explore the sparsity. In the high dimensional heteroscedastic linear model, when either X model or Y model is established, and the established model has sparsity, whether the other model has sparsity or not, the dual robust inference method can effectively estimate. On this basis, the sparse generalized linear model, nonlinear model and single index model are discussed and generalized. In the experiment, the influence of the correlation coefficient of the matrix variable Z on the estimation result is also discussed. In most cases, the influence of the correlation coefficient is not obvious.

After numerical simulation, this method is applied to the gene expression data of riboflavin (vitamin B2) production by Bacillus subtilis. There are 4088 variables in the data set, which are tested one by one and combined with BH method. Setting three FDR (0.05, 0.1, 0.15) to screen, we can get 10, 365, 3489 important variables respectively. Focusing on these genes in riboflavin production is conducive to improving its productivity.

参考文献总数:

 30    

馆藏号:

 硕025200/21043    

开放日期:

 2022-06-19    

无标题文档

   建议浏览器: 谷歌 360请用极速模式,双核浏览器请用极速模式