中文题名: | 回归问题中高维矩阵值协变量的变量选择 |
姓名: | |
保密级别: | 公开 |
论文语种: | 中文 |
学科代码: | 025200 |
学科专业: | |
学生类型: | 硕士 |
学位: | 应用统计硕士 |
学位类型: | |
学位年度: | 2020 |
校区: | |
学院: | |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2020-06-19 |
答辩日期: | 2020-05-30 |
外文题名: | Variable Selection of High-dimensional Matrix-valued Covariates in Regression |
中文关键词: | |
外文关键词: | Matrix-valued variable ; Variable selection ; Factor analysis ; Correlation ; Regression model |
中文摘要: |
高维数据在生物、经济、金融等领域非常常见,其协变量个数p远大于样本量n。然而变量个数虽然多,但与响应变量真正有关的重要变量通常较少。因此针对高维数据进行变量选择对各领域的科学研究具有重要意义。阻碍变量选择的一个严重问题是变量间的相关性。相关性可能会导致错误的子集选择结果。当协变量不是弱相关时,通常的变量选择方法很难选择出正确的模型。比如,使用LASSO进行变量选择时,弱自变量的相关性较强,则会选择较大模型,将不重要变量也纳入模型。同样,基于Structured LASSO的矩阵值变量选择方法也存在同样的问题。 目前针对变量选择中的去相关问题已有一些研究。比如使用模型的低维潜在表达、对设计矩阵进行Puffer变换、基于因子分析的FAD等。但尚未有针对矩阵值数据的去相关研究。处理矩阵值数据简单办法就是将其拉直成随机向量,再基于向量建立回归模型。但是向量化影响估计的准确性,忽略矩阵的结构信息。因此,当矩阵值数据存在强相关性时需要探究新的变量选择方法。 本文针对高维矩阵值数据进行变量选择时,协变量间存在相关性的问题提出了一种DSL的变量选择方法。 首先,提出了一种新的矩阵因子分析方法MVFA,将高维矩阵值自变量转化为低维的公共因子矩阵和高维的特殊因子矩阵。公共因子综合了各个协变量之间的共性变动,特殊性因子即为去相关后的矩阵值变量。 然后,提出新的变量选择方法DSL。DSL分为两步,先使用MVFA得到特殊因子,再对特殊因子进行选择。矩阵值线性回归和矩阵值广义线性回归均可以使用DSL进行变量选择。
最后,本文进行了模拟实验和真实数据实验。模拟实验对比了MVFA和DSL与已有方法的结果,并探究影响方法效果的因素。结果表明,MVFA在载荷矩阵估计和原始数据恢复方面均由于已有的普通因子分析方法,且效果随维数或样本量的增加而升高,随共性因子个数的增多而下降。在矩阵线性回归场合,DSL 在变量选择和参数估计方面均优于Structured LASSO。随着自变量间相关性的增强,二者的变量选择和参数估计均有所下降,但DSL的效果下降不大,而Structured LASSO的效果下降显著;在矩阵Logistic回归场合,DSL的变量选择和分类效果随相关性的增强而下降。变量选择效果随维数的增加而下降,分类效果随维数的增加而升高。 真实实验将DSL应用于脑电图数据,研究酒精中毒和电压模式的关联,从而选择出可用于判断是否中毒的重要电极位点和采集时间点,并对是否中毒进行分类。通过五折交叉验证获得最优超参数0.055,并选择出7个重要的电极位点和9个重要的时间观测点。 |
外文摘要: |
High-dimensional data is very common in the biological, economic, financial and other fields, and the number of covariates p is much larger than the sample size n. However, although the amount of variables is large, there are few important variables related to response variables. Therefore, variable selection for high-dimensional data is important to scientific research in various fields. A serious problem that hinders variable selection is the correlation between variables. Correlation may lead to wrong subset selection results. When the covariates are not weakly correlated, the usual variable selection method is difficult to select the correct model. For example, when using LASSO for variable selection, if variables are strongly correlated, it will select a large model, and unimportant variables are included. Similarly, the matrix-valued variable selection method based on Structured LASSO has the same problem. At present, there have been some studies on the de-correlation problem in variable selection, such as the low-dimensional latent expression, Puffer transformation of the design matrix, FAD based on factor analysis, and so on. However, there is no de-correlation study on matrix-valued data. The simple way to deal with matrix-valued data is to straighten it into a random vector, and then build a regression model with the vector. However, vectorization affects the accuracy of the estimation and ignores the structural information of the matrix. Therefore, when there is a strong correlation between matrix-valued data, new variable selection methods need to be explored. In this paper, we propose a variable selection method DSL, which can select variables for high-dimensional matrix-valued data with strong correlation between covariates. Firstly, a new matrix factor analysis method MVFA is proposed, which transforms high-dimensional matrix value independent variables into low-dimensional common factor matrix and high-dimensional special factor matrix. The public factor integrates the common changes between the various covariates, and the specific factor is the matrix-valued variable after decorrelation. Then, a new variable selection method DSL is proposed. DSL is divided into two steps, first use MVFA to obtain special factors, and then select the special factors. Both matrix-valued linear regression and matrix-valued generalized linear regression can use DSL for variable selection. Finally, this paper conducted simulation experiments and real data experiments. The simulation experiment compares the results of MVFA and DSL with existing methods, and explores the factors that affect the effectiveness of the method. The results show that both load matrix estimation and original data recovery of MVFA are better than the existing common factor analysis methods. The effect increases with the increase in dimension or sample size, and decreases with the increase in the number of common factors. In the case of matrix linear regression, DSL is superior to Structured LASSO in both variable selection and parameter estimation. With the enhancement of the correlation between independent variables, both variable selection and parameter estimation have decreased, but the effect of DSL has not decreased significantly, while Structured LASSO has; in the case of matrix logistic regression, the selection and classification of DSL decrease as the correlation increases. The effect of variable selection decreases with increasing dimension, and the classification effect increases with increasing dimension. In a real experiment, DSL was applied to EEG data to study the relationship between alcoholism and voltage patterns, and select important electrode sites and collection time points that can be used to judge whether it was poisoned, and classify whether it was poisoned. The optimal hyperparameter 0.055 was obtained through five-fold cross-validation, and 7 important electrode sites and 9 important time observation points were selected. |
参考文献总数: | 21 |
馆藏号: | 硕025200/20047 |
开放日期: | 2021-06-19 |