中文题名: | 权重的放缩性质对残差网络训练过程的影响 |
姓名: | |
保密级别: | 公开 |
论文语种: | chi |
学科代码: | 070101 |
学科专业: | |
学生类型: | 学士 |
学位: | 理学学士 |
学位年度: | 2023 |
校区: | |
学院: | |
研究方向: | 深度学习 |
第一导师姓名: | |
第一导师单位: | |
第二导师姓名: | |
提交日期: | 2023-05-22 |
答辩日期: | 2023-05-08 |
外文题名: | The Effect of Weight Scaling Properties on the Training Process of Residual Network |
中文关键词: | |
外文关键词: | Weight scaling ; ResNet ; Vanishing and exploding gradient problem |
中文摘要: |
近年来深度学习发展迅速, 在计算机视觉以及自然语言处理上取得巨大成功. 为了缓解网络加深时出现的梯度消失和梯度爆炸问题, 残差网络于2015年被提出, 并迅速成为之后大多数网络设计的基础结构. 但是, 在训练残差网络时如果没有采用合适的初始化方法, 最终还是会由于梯度爆炸而导致训练失败. 因此, 需要对残差网络中信息传播的规律进行分析, 从而找到有效解决训练困难的初始化策略. 本文首先分析前向传播和反向传播中信息的期望和方差, 得出为了防止传播过程中出现梯度爆炸的问题, 残差部分允许放缩的范围. 在合理放缩之后网络中传播的信息方差与初始信息方差之比是与网络深度无关的有界量, 从而能有效防止梯度爆炸. 在能够稳定训练的前提下, 采用不同的放缩方案会导致网络的逼近速度和泛化能力存在差异. 本文通过实验比较常见放缩方案在这两个方面的差异. 对于不容易出现梯度爆炸问题的浅层网络, 为稳定梯度提出的放缩方案不再有效, 必须寻找新的放缩方案. |
外文摘要: |
Deep learning has developed rapidly in recent years with great success in computer vision and natural language processing. To alleviate the exploding and vanishing gradient problem when deepening the network, residual networks were proposed in 2015 and quickly became the base structure for most network designs since then. However, without proper initialization methods, the training process will eventually fail due to gradient explosions. Therefore, it is necessary to analyze the pattern of information propagation in residual networks to find an effective initialization strategy to solve the training difficulties. By analyzing the expectation and variance of the information in the forward and backward propagation, we obtain the feasible scaling range of the residual branchs in order to prevent gradient explosions during the propagation process. The ratio of the propagated information variance to the initial information is a bounded quantity independent of the depth of the network after the scaling, which can effectively prevent the gradient explosions. On the premise of stable training, networks with different scaling schemes may have different approximation speed and generalization ability. In this paper, we experimentally compare the differences of common scaling schemes in these two aspects. It should be noted that for shallow networks which are not prone to the exploding gradient problem, the proposed scaling schemes for gradient stabilization are no longer effective and substitute schemes must be found. |
参考文献总数: | 23 |
馆藏号: | 本070101/23223 |
开放日期: | 2024-05-21 |