中文题名: | 分布式样本逆协方差矩阵的估计效率及其应用 |
姓名: | |
保密级别: | 公开 |
论文语种: | 中文 |
学科代码: | 071201 |
学科专业: | |
学生类型: | 学士 |
学位: | 理学学士 |
学位年度: | 2020 |
学校: | 北京师范大学 |
校区: | |
学院: | |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2020-06-10 |
答辩日期: | 2020-05-11 |
外文题名: | Distributed Sample Inverse Covariance Matrix: Statistical Efficiency and Applications |
中文关键词: | |
外文关键词: | distributed computation ; covariance matrix ; relative efficiency ; random matrix ; deterministic equivalence |
中文摘要: |
分布式统计学习在处理大型数据集时具有重要意义。本文构建了数据并行下基于一步加权平均法的通信有效计算框架:数据集分块储存在多台计算机上,先在每台机器上计算本地样本的逆协方差矩阵,再将计算结果传送至中央处理器并取加权平均值进行聚合。在此框架下,本文旨在研究与使用全样本直接估计相比,使用分布式样本估计总体逆协方差矩阵的误差大小,即二者的相对效率。本文利用随机矩阵理论的确定性等价概念,推导出了在样本维数发散且与样本量可比的高维情形下相对效率的极限形式。当数据集在每台机器上随机等分、且局部估计等权重聚合时,渐进相对效率最高,且与机器数量负相关。本文还发现分布式框架对不同问题的影响不同,随着机器数量的增加,OLS估计的相对效率下降较慢,均值向量检验的相对效率下降较快,投票法判别分析的相对效率基本不变,而平均法判别分析的相对效率有所提高。最后,本文通过数值模拟验证了以上理论结果。 |
外文摘要: |
Distributed statistical learning from multiple data repositories with minimum communication is increasingly important when dealing with large scale datasets. In this setup, datasets are partitioned over machines, which compute locally, and communicate short messages. This paper studies a simple communication-efficient learning framework that first calculates the local sample inverse covariance matrix on each data subset, and then use the one-step weighted average method to combine the local results, in order to achieve the best possible approximation to the global inverse covariance matrix given the whole dataset. How does this work compared to computing on the full data? Here the paper studies the relative efficiency of the distributed sample inverse covariance matrix and the whole sample inverse covariance matrix under this framework. This study derives the asymptotic relative efficiency rely on the calculus of deterministic equivalents from random matrix theory in high dimensions, where the number of parameters is comparable to the training data size. It is found that the asymptotic relative efficiency achieves its optimum when data sets are randomly and equally split on each machine, and local estimates are combined with equal weights; and it has a negative correlation with the number of machines.
The paper also finds that
different problems are affected differently by the distributed framework. With
the number of machines increasing, the relative efficiency of OLS estimation
decreases slowly, the relative efficiency of hypothesis test decreases much
more rapidly, and the relative efficiency of discriminant analysis under voting
method is almost constant, while the relative efficiency of discriminant
analysis under averaging method is improved. Finally, numerical simulations are
utilized to verify above theoretical results.
|
参考文献总数: | 23 |
作者简介: | 北京师范大学统计学院统计学专业2016级本科生 |
插图总数: | 6 |
插表总数: | 1 |
馆藏号: | 本071201/20041 |
开放日期: | 2021-06-10 |