查看论文信息

查看全文

查看论文信息

中文题名：	高维异质性和矩阵型数据降维
姓名：	刘秀敏
保密级别：	公开
论文语种：	中文
学科代码：	0714Z2
学科专业：	应用统计
学生类型：	博士
学位：	理学博士
学位类型：	学术学位
学位年度：	2021
校区：	北京校区培养
学院：	统计学院/国民核算研究院
研究方向：	高维数据降维
第一导师姓名：	赵俊龙
第一导师单位：	北京师范大学统计学院
提交日期：	2021-06-15
答辩日期：	2021-05-14
外文题名：	Dimension Reduction of High Dimensional Heterogeneity and Matrix-valued Data
中文关键词：	异质性数据 ; 矩阵型数据 ; 离散化 ; 降维
外文关键词：	heterogeneous data ; matrix-valued data ; discretization ; dimension reduction
中文摘要：	︿高维数据降维方法在许多领域起着至关重要的作用，是统计学近年来的研究热点. 本文主要从异质性数据、矩阵型数据和连续变量离散化三个方面研究高维数据降维问题. 在高维数据研究中经常会碰到重尾分布、异常值或是影响点，而且一般的回归模型中响应变量和协变量之间并不是简单的线性关系，分位数回归模型能很好地处理这一系列问题. 由于数据搜集技术的迅速发展，采集到的数据类型也越来越多样化. 矩阵型数据近年来广泛出现在许多领域，比如，医学领域常会遇到的脑电图数据、脑功能磁共振成像数据等，此类型数据也逐渐成了统计学研究的热点之一. 通常情况下矩阵型数据的行列都有其特殊含义，但是目前的文献很少有研究矩阵型数据行列效应的. 本文研究了高维矩阵型数据行列效应的统计推断. 此外，连续变量离散化是高维数据降维中一个非常重要的方法，但是对于离散化带来的偏差，现有文献研究的很少. 本文针对该问题进行了深入的研究. 第一，在高维分位数回归分析中，对于给定的分位点 τ ∈ (0, 1)，以往文献通常会假设参数 βτ ∈ R^p是稀疏的并且协变量之间具有弱相关性. 然而，稀疏性假设在实际问题中是很难验证并且难以满足的，而且许多实际问题中协变量之间都高度相关，从而使得协方差矩阵的特征根具有较快的下降速度，即特征根具有近似稀疏性. 与现有文献中同时估计所有参数 βτ 不同，本文主要关注单个参数或着 βτ 的部分参数的估计问题，从而提出一个新的估计方法——高维分位数回归部分参数估计方法 (PQR). 研究结论发现协方差矩阵特征根的近似稀疏性对本文提出的方法是有利的，大大地放松了对参数 βτ 的稀疏性假设. 本文给出了估计的统计性质. 模拟结果和实际数据分析也说明了此方法的有效性. 第二，对高维矩阵型数据行列效应进行统计推断是一个引人关注的问题. 为此，本文引入了一个新的模型并且给出了行列效应的估计. 对于给定的一行或一列，考虑了其效应的假设检验问题，并且构建了置信区间. 此外证明了行列效应估计的一致收敛性. 最后提出了选择重要行和列的方法，并且建立了相应的理论性质. 模拟结果和实例分析都表明了本文提出方法的有效性. 第三，通过切片方法对连续变量离散化是统计建模中常用的方法，特别是在高维数据降维领域. 众所周知，离散化通常会带来信息的损失. 特别是至今都没有文献系统地研究过离散化带来的偏差问题. 本文提出了一个框架来理解和对比由一般切片方法离散化带来的偏差，并且对单变量离散化构建了 Poincar ?e 类型不等式. 这个不等式可以清楚地说明离散化带来的影响. 它表明偏差是由两个因素控制的: 离散化和非离散化分别对应的分布函数的距离，以及条件期望函数的光滑程度. 基于这些结果，本文对比了几种常见切片方法带来的偏差，并且分析了其对应的变量选择中的例子. 此外，本文将这个一般化结果推广到矩阵型指标及稳健统计量的情形，并分别给出了变量选择中的例子. 综上所述，本文主要研究了高维数据降维相关的方法. 首先利用分位数回归方法探索了关于高维异质性数据的降维问题. 其次在对高维矩阵型数据的研究中，创新性地给出了矩阵行列效应的统计推断，并将其应用到重要行列的选择中. 最后提出了一个通用的连续变量离散化偏差形式并将其应用于变量选择中. ﹀
外文摘要：	︿ Dimension reduction of high dimensional data plays a significant role in many application fields, and it is a current research focus in statistics. This paper studies dimension reduction of high dimensional data from three aspects: heterogeneous data, matrix data, and discretization of continuous variables. Heavy tailed distribution, outliers or influence points are often encountered in the study of high-dimensional data, and the general regression model is not a simple linear relationship between the response variables and covariates. Quantile regression models can deal with these problems well. Due to the rapid development of data collection technologies, types of data collected are becoming more and more diverse. Matrix valued data has been widely used in many fields in recent years, such as EEG data, brain functional magnetic resonance imaging data. Therefore, matrix-valued data has become one of the hotspots in statistical research. The row and column of matrix data usually have special meanings, but few literatures have studied the row and column effect of matrix data.This paper studies the inference of row and column effects of high-dimensional matrix-valued data. Furthermore, discretization is a useful method for dimension reduction of high-dimensional data. But for the deviation caused by discretization, there are few researches of the existing literatures. We make an in-depth study of this problem in this paper. Firstly, in high dimensional quantile regression, it is commonly assumed in the literature that the parameter βτ ∈ Rp is sparse and that predictors are weakly correlated, where τ ∈ (0, 1) is a given level. The sparsity assumption, however, is hard to verify in practice and can be violated, and the predictors can be highly correlated in many applications, so that the eigenvalues of the covariance matrix decrease rapidly, or they are approximately sparse. Different from the existing estimators in the literature that estimate all entries of βτ simultaneously, we in this paper consider the problem of estimating a single element or a small part of βτ, proposing a new estimator called partial quantile regression (PQR). Interestingly, it is found that the approximately sparse of the eigenvalues can be beneficial in our framework, relaxing greatly the sparsity assumption on βτ. The statistical properties of the estimator are established. Simulation results and real data analysis confirm the effectiveness of the estimator. Secondly, an absorbing problem is to make statistical inference on the effects of rows and columns of high-dimensional matrix-valued data. To this end, we introduce a new model and propose estimators of effects of rows and columns. Hypothesis testing problem on the effect of a given row or column is considered, and confidence intervals are constructed. Moreover, procedures are proposed to select significant rows and columns, and theoretical properties are established. Simulation results and real data analysis demonstrate the effectiveness of the proposed method. Finally, transferring a continuous variable into a discrete one by slicing its range into intervals is a commonly used technique in the literature of statistical modeling, especially in dimension reduction. However, it is well known that discretization leads to the loss of information. Specifically, discretization results in bias, which has not been studied systematically. In this paper, a frame is proposed to understand and compare the bias caused by a general slicing strategies. The Poincare type inequalities are established first for univariate discretization, showing the effect of discretization clearly. It is shown the bias is controlled by two factors: the distance between two distributions that are generated with and without discretization respectively, and the smoothness of the function of conditional expectations. Based on these results, we compare the bias of several popular slicing schemes and analysis the corresponding examples in feature screening. Furthermore, we extend these results to the matrix-valued index and robust statistics to illustrate the examples in feature screening. In conclusion, this paper mainly studies the methods of dimension reduction of high- dimensional data. Firstly, we explore the dimension reduction of high-dimensional heterogeneity data by using quantile regression. Secondly, in the research of high-dimensional matrix-valued data, inference of the row and column effects of a matrix are given innovatively, which are applied to select the important rows and columns. Finally, a generalized form of deviation caused by discretization is proposed and applied to variable selection. ﹀
参考文献总数：	99
馆藏地：	图书馆学位论文阅览区（主馆南区三层BC区）
馆藏号：	博0714Z2/21001
开放日期：	2022-06-15

附件下载