中文题名: | 基于机器学习的方法对北京地区强降水预报的研究 |
姓名: | |
保密级别: | 公开 |
论文语种: | chi |
学科代码: | 0705Z2 |
学科专业: | |
学生类型: | 硕士 |
学位: | 理学硕士 |
学位类型: | |
学位年度: | 2023 |
校区: | |
学院: | |
研究方向: | 机器学习·预测降水 |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2023-06-20 |
答辩日期: | 2023-05-31 |
外文题名: | RESEARCH ON THE PREDICTION OF HEAVY PRECIPITATION IN BEIJING BASED ON MACHINE LEARNING METHODS |
中文关键词: | |
外文关键词: | Random forest ; XGBoost ; Heavy precipitation forecast ; Deviation correction ; Feature engineering |
中文摘要: |
数值模式根据数值计算的结果预报降雨,受非线性的影响,降水的可预报性较差,预报难度较大,模式输出的降水量与实际降水之间往往存在着一定偏差,需要进行进一步订正。 本论文主要研究利用机器学习方法建立北京地区强降水预报模型。基于数值预报产品及自动站、常规观测等资料,进行了气象特征判别和提取,构造了与降水相关的特征量;利用随机森林和XGBoost两种机器学习方法对降水特征量进行了重要性评估,挑选出了重要性较高的因子;建立了不同采样方式下的降水等级分类预报模型以及降水回归预报模型,并对预报模型进行了评估。主要工作如下: 1.以气象变量、时间变量、地理空间变量共36个特征变量作为模型的输入变量,降水类别作为输出标签,将2018-2020年的数据集按照75%和25%的比例进行随机划分,分别作为模型的训练集和测试集,将2021年的数据集作为验证集。之后再评估36个特征变量的重要程度并进行排序,随机森林分类模型筛选出了23个重要性较强的因子,XGBoost分类模型筛选出了19个重要性较强的因子,并对重要性较高的因子作出了解释性分析。 2.利用随机森林、XGBoost两种机器学习方法,建立了降水分类模型。为缓解样本不均衡的问题,采取过采样和欠采样两种方式进行数据处理。比较过采样方式建立的随机森林分类模型与RMAPS-短期(ST)模式在强降水上的表现,TS评分增长了43%,命中率(POD)增长了5.97倍。比较利用欠采样方式建立的XGBoost分类模型与RMAPS-短期(ST)模式在强降水上的表现,TS评分增长了91.3%,POD增长了5.6倍,均具有较好的订正效果。 3.利用随机森林、XGBoost两种机器学习方法,建立了降水回归模型。比较随机森林回归模型、XGBoost回归模型与RMAPS-短期(ST)模式在强降水上的表现,均为POD有一定提升,但TS评分并无明显改善。分析了提升较小的原因并作出了相关的解释,指出了未来解决的思路。 |
外文摘要: |
The numerical model predicts rainfall according to the results of numerical calculation. Due to the influence of nonlinearity, the prediction of precipitation is poor and difficult. There is often a certain deviation between the output precipitation of the model and the actual precipitation, which needs to be further corrected. This paper mainly studies the establishment of heavy precipitation forecast model in Beijing by machine learning method. Based on the data of numerical forecast products, automatic stations and conventional observation, the meteorological characteristics were discriminated and extracted, and the characteristic quantity related to precipitation was constructed. Two machine learning methods, random forest and XGBoost, were used to evaluate the importance of precipitation features, and the factors with high importance were selected. The classification forecast model and regression forecast model of precipitation under different sampling methods are established and evaluated. The main work is as follows: 1. A total of 36 characteristic variables, including meteorological variable, time variable and geospatial variable, were taken as input variables of the model, and precipitation category was taken as output label. The data sets from 2018 to 2020 were randomly divided according to the proportions of 75% and 25%, respectively, as the training set and test set of the model, and the data set of 2021 as the verification set. After that, the importance of 36 characteristic variables was evaluated and sorted. The random forest classification model screened out 23 factors with strong importance, and the XGBoost classification model screened out 19 factors with strong importance, and the explanatory analysis was made for the factors with high importance. 2. The random forest and XGBoost machine learning methods were used to establish the precipitation classification model. In order to alleviate the problem of sample imbalance, two methods of data processing are adopted: over-sampling and under-sampling. Comparing the performance of RMAPs-short-term (ST) model with the random forest classification model established by over-sampling method on heavy precipitation, the TS score increased by 43%, and the POD increased by 5.97 times. By comparing the performance of the XGBoost classification model with RMAPs-short-term (ST) model, the TS score increased by 91.3% and the POD increased by 5.6 times, both of which have good correction effect. 3. The regression model of precipitation is established by using two machine learning methods of random forest and XGBoost. By comparing the performance of random forest regression model, XGBoost regression model and RMAPs-short-term (ST) model on heavy precipitation, POD improved to a certain extent, but the TS score did not improve significantly. This paper analyzes the reasons for the small improvement and makes relevant explanations, and points out the future solutions. |
参考文献总数: | 85 |
馆藏号: | 硕0705Z2/23030 |
开放日期: | 2024-06-20 |