中文题名: | 基于时间序列分解与LSTM神经网络组合的近地面污染物预测研究 |
姓名: | |
保密级别: | 公开 |
论文语种: | 中文 |
学科代码: | 0705Z2 |
学科专业: | |
学生类型: | 博士 |
学位: | 理学博士 |
学位类型: | |
学位年度: | 2022 |
校区: | |
学院: | |
第一导师姓名: | |
第一导师单位: | |
第二导师姓名: | |
提交日期: | 2022-06-01 |
答辩日期: | 2022-06-01 |
外文题名: | A NOVEL HYBRID FRAMEWORK FOR AIR POLLUTANT CONCENTRATION FORECASTING USING TIME SERIES DECOMPOSITION METHODS AND LSTM NEURAL NETWORK |
中文关键词: | |
中文摘要: |
近年来,空气质量问题日益突出,重污染天气发生频率增加,不仅对生产、生活带来影响,而且对人体健康造成损害。尤其是在近些年呼吸系统疾病日益增长的情况下,区域污染物预报研究就显得十分重要。基于大数据和人工智能的空气质量预测分析已成为研究热点。由于空气质量相关的时间序列数据具有动态和非线性的特点,越来越多的研究人员正在尝试使用数据驱动的模型来预测空气质量。目前,基于机器学习算法进行的空气污染物预测与评估,大多数是采用前向神经网络,并在此基础上进行算法优化,但是传统的前向神经网络会忽略预测数据上的时间特性,对于具有前后关联的时间序列不易解决,并且对非平稳的高变异性的时间序列污染物数据,会增加预测模型的不确定性,使得模型泛化性和适用性变差;其次,现有研究通常是针对某一地面观测站点或者几个站点进行模型构建,针对区域量化具有局限性,对无站点或者站点稀疏区域,预测模型会受到限制;并且在格网化的污染物预测中污染物的空间效应也是非常重要的,因此能否使用其他驱动数据对无观测站点区域进行区域的时空污染物量化预测,也是一个比较关心的问题。综上,构建一个学习效率高、泛化性强和适用性广泛的污染物预测模型,且不单纯依赖地面观测作为模型训练,不仅是对现有数值预测模型的补充,也可以为基于人工智能污染预测的研究提供新的思路。本研究针对上述科学问题以及研究瓶颈,着重展开了以下几方面的研究: (1)基于卫星大数据以及WRF-CHEM模式模拟污染物数据集构建: 针对站点驱动预测模型的局限性,本文构建了基于卫星大数据估算以及模式模拟输出的污染物数据集,并划分为区域污染物序列数据,作为区域污染物预测模型的驱动数据。基于卫星大数据对近地面PM2.5浓度估算是利用MODIS MAIAC AOD产品作为主要参数,以MEIC排放清单数据、ERA5气象数据、地表表征数据以及社会生产表征作为辅助数据。考虑PM2.5时空异质性,利用随机森林算法对近地面PM2.5浓度进行估算。为了获取全覆盖的PM2.5结果,采用基于高程差约束的反距离加权法结合地面站点对其进行空间插值,从而得到区域长时间序列的PM2.5浓度,经过验证RMSE整体小于20μg/m3,平均偏差MB在-10.0~10.0μg/m3,一致性指数IOA在0.9以上。由于单一卫星很难同时获取多种污染物的信息,并且受到传感器观测方式以及时间分辨率限制,只能估算卫星过境时刻(极轨卫星)或时间段(静止卫星)的污染物情况,而大气化学模式可以对多种污染物进行模拟。因此本文利用WRF-CHEM数值模式对PM2.5、O3以及NO2浓度进行模拟,提取区域的污染物数据长时间序列,作为污染物预测模型训练数据集。经过验证,模拟得到的三种污染物日均和小时结果与地基站点之间都有很好的一致性(IOA>0.9),并且RMSE和MB整体较低(RMSE<25μg/m3,|MB|<7.0μg/m3)。
(2)时间序列分解与长短时记忆神经网络LSTM组合模型的构建: 利用具有历史信息记忆功能的LSTM 神经网络,与STL(Seasonal-trend decomposition procedure based on Loess)、WT(Wavelet Transform)以及EEMD(Ensemble Empirical Mode Decomposition)三种分解方法组合(STL-LSTM、EEMD -LSTM和WT-LSTM)。首先对污染物进行时间序列上的分解,结合ERA5气象要素(温度、湿度和风速),对各分解分量在时间序列上进行三步预测,最后重构得到后三日污染物的预测结果。在单分量预测中STL余项分量和EEMD第一分量的预测效果并不理想,为了提高模型的预测能力,对效果欠佳的单一分量采用二次WT分解,构建二次分解组合模型(EEMD-WT-LSTM 和STL-WT-LSTM)。以基于卫星大数据估算的PM2.5作为模型的训练数据,改进的EEMD-WT-LSTM 和STL-WT-LSTM模型表现出了较好的预测效果,后一天的预测R2可以达到0.97和0.95,RMSE(MAE)仅为6.76(4.59)μg/m3和7.92(5.47)μg/m3,相对误差在±25%之内的占比分别87.95%和86.15%,第三天预测整体R2大于0.79,且RMSE(MAE)小于17.0(12.0)μg/m3。 (3)以 WRF-CHEM数值模式输出的三种污染物数据对构建模型进一步验证: 利用WRF-CHEM数值模式输出三种污染物PM2.5、O3、NO2以及气象要素温度、湿度和风速。对后三日PM2.5、O3以及NO2日均值预测结果进一步表明,五种组合模型整体性能依次为STL-WT-LSTM>EEMD-WT-LSTM> WT-LSTM> EEMD -LSTM> STL-LSTM。并且利用STL-WT-LSTM二次分解组合模型对PM2.5、O3以及NO2 未来24小时预测中,模型也表现出了较强的泛化性和鲁棒性。尽管多步预测误差会随着预测步长增加而增大,但是相比单步预测的模型,不需要根据前一时刻预测结果循环预测下一时刻,提高预测效率。并且在一定时间内,多步预测总体上较为稳定,体现了组合模型的优势。 (4)以WRF-CHEM数值模式输出的PM2.5数据为基础的近地面污染物时空格网预测: 在以上研究基础上,考虑污染物的空间相关性,将反距离加权算法加入到区域时间序列分解与LSTM神经网络组合模型中,实现区域时空分布的PM2.5格网预测。以WRF-CHEM数值模式输出的PM2.5在2018年11月1日到2018年11月30日作为预测案例。对后三日PM2.5浓度连续预测结果表明,STL-WT-LSTM-IDW和EEMD-WT-LSTM-IDW组合模型继承了二次分解组合模型的优势,在描述时间和空间PM2.5变化上要优于其他组合模型,单个格网与地面观测一致性较好。结合HYSPLIT后向轨迹,STL-WT-LSTM-IDW模型对2018年11月12日-2018年11月16日日均时空变化PM2.5浓度预测,能够有效的反映PM2.5浓度由增长到消散的过程,并且在区域上可以很好地捕捉高低污染区域变化情况。 |
外文摘要: |
Recently, air quality problems have become increasingly prominent, and the frequency of heavy pollution weather has increased, which not only affects production and life but also damages human health. Especially in recent years, with the growing respiratory diseases, regional pollutant prediction research is very important. The prediction and analysis of air quality based on big data and artificial intelligence has become a research hotspot. Since the time series data related to air quality are dynamic and nonlinear, more and more researchers are trying to use data-driven models to predict air quality. At present, the prediction and evaluation of air pollutants based on machine learning algorithms are mostly based on the forward neural network, and the algorithm is optimized on this basis. However, the traditional forward neural network will ignore the time characteristics of the prediction data, and it is difficult to solve the time series with front and back correlation. Moreover, the non-stationary and high variability time series pollutant data will increase the uncertainty of the prediction model and make the generalization and applicability of the model worse. Secondly, the existing research is usually for a ground observation station or several sites model construction, for regional quantification has limitations, for no site or site sparse area, prediction model will be limited; The spatial effect of pollutants is also very important in the grid-based pollutant prediction. Therefore, whether other driving data can be used to quantitatively predict the spatial and temporal pollutants in the area without observation stations is also a relatively concerned problem. In summary, the construction of a pollutant prediction model with high learning efficiency, strong generalization, and wide applicability does not rely solely on ground observation as model training. It is not only a supplement to the existing numerical prediction model, but also provides new ideas for the study of pollution prediction based on artificial intelligence. Given the above scientific problems and research bottlenecks, this study focuses on the following aspects: (1) Construction of simulated pollutant data sets based on satellite big data and WRF-CHEM model: Given the limitations of site driven prediction model, this paper constructs a pollutant dataset based on satellite big data and atmospheric chemistry model estimation and divides the dataset into regional pollutant time series data as the driving data for the regional pollutant prediction model. The near-ground PM2.5 concentration estimation based on satellite big data uses the MODIS MAIAC AOD product as the main parameter, with MEIC emission inventory data, ERA5 meteorological data, surface characterization data, and social production characterization as auxiliary data. Considering the temporal and spatial heterogeneity of PM2.5, the random forest algorithm is used to estimate the PM2.5 concentration near the ground. To obtain the full coverage of PM2.5 concentration, the inverse distance weighting (IDW) method based on the elevation difference constraint is used to interpolate it in combination with ground stations, to obtain the PM2.5 concentration of the regional long-term series. After verification, the RMSE was less than 20μg/m3, the MB was-10.0~10.0μg/m3, and the IOA was above 0.9. Since it is difficult for a single satellite to obtain information on multiple pollutants at the same time, and due to the limitations of sensor observation methods and time resolution, it is only possible to estimate the pollutants at the time of satellite transit (polar-orbiting satellites) or periods (stationary satellites). The atmospheric chemistry model can simulate a variety of pollutants. Therefore, this paper uses the WRF-CHEM numerical model to simulate the concentrations of PM2.5, O3, and NO2 and extracts the long-term series of regional pollutants as training data for the pollutant prediction model. After verification, the daily average and hourly results of the three pollutants obtained by the simulation are in good agreement with the ground station (IOA>0.9), and the RMSE and MB are generally low (RMSE<25μg/m3, |MB| <7.0μg/m3). (2) Construction of a combined model of time series decomposition and long short-term memory (LSTM) neural network: The LSTM neural network with historical information memory function is combined with STL(Seasonal-trend decomposition procedure based on Loess)、WT(Wavelet Transform)and EEMD(Ensemble Empirical Mode Decomposition) decomposition methods to obtain STL-LSTM, EEMD-LSTM and WT-LSTM. Firstly, the pollutants are decomposed in time series, combined with ERA5 meteorological elements (temperature, humidity, and wind speed), three-step prediction is carried out on the time series of each decomposition component, and finally, the prediction results of pollutants in the next three days are reconstructed. In single-component prediction, the prediction effect of the STL residual component and the first component of EEMD is not ideal. To improve the prediction ability of the model, WT is used for the single component with poor effect to construct a quadratic decomposition combined model(EEMD-WT-LSTM and STL-WT-LSTM). The PM2.5 estimated based on satellite big data was used as the training data of models. The results showed that the improved EEMD-WT-LSTM and STL-WT-LSTM models showed good prediction results. The prediction R2 of the next day could reach 0.97 and 0.95, and the RMSE (MAE) was only 6.76(4.59)μg/m3 and 7.92(5.47)μg/m3. The relative error of ±25 % accounted for 87.95 % and 86.15 %, respectively. The overall prediction R2 of the third day was greater than 0.79. The RMSE (MAE) was less than 17.0(12.0)μg/m3. (3) The constructed models are further verified by the three pollutant data outputs from WRF-CHEM: The WRF-CHEM was used to estimate three kinds of pollutants (PM2.5, O3, NO2) and meteorological elements (temperature, humidity, and wind speed). The prediction results of the daily mean values of PM2.5, O3, and NO2 for the next three days further show that the overall performance of the five combined models is STL-WT-LSTM>EEMD-WT-LSTM>WT-LSTM>EEMD-LSTM>STL-LSTM. And using the STL-WT-LSTM model to predict PM2.5, O3, and NO2 in the next 24 hours, the model also showed strong generalization and robustness. Although the multi-step prediction error will increase with the increase of the prediction step size, compared with the single-step prediction model, there is no need to circularly predict the next moment according to the prediction result of the previous moment, which improves the prediction efficiency. And in a certain period, the multi-step forecast is relatively stable on the whole, which reflects the advantages of the combined model. (4) Spatio-temporal grid prediction of near-surface PM2.5 based on a dataset estimated from the WRF-CHEM: Based on the above research, considering the spatial correlation of pollutants, the inverse distance weighting algorithm is added to the combined model of regional time series decomposition and the LSTM neural network, which is used to predict the spatiotemporal distribution of PM2.5. The PM2.5 was estimated by WRF-CHEM from November 1, 2018, to November 30, 2018, and was used as a forecast case. The continuous prediction results of PM2.5 concentration for the next three days showed that the advantages of the quadratic decomposition combined model were inherited. In the prediction of spatio-temporal PM2.5, the combined model of STL-WT-LSTM-IDW and EEMD-WT-LSTM-IDW is better than other combined models in describing the variation of spatiotemporal PM2.5 concentration, and the single grid is in good agreement with ground observations. Combined with the HYSPLIT (Hybrid Single-Particle Lagrangian Integrated Trajectory model) backward trajectory, the STL-WT-LSTM-IDW model predicted the PM2.5 concentration of the daily average temporal and spatial variation from November 12, 2018, to November 16, 2018, which can effectively reflect the process of pollutants from growth to dissipation and can well capture the changes of high and low pollution areas in the region. |
参考文献总数: | 199 |
馆藏地: | 图书馆学位论文阅览区(主馆南区三层BC区) |
馆藏号: | 博0705Z2/22002 |
开放日期: | 2023-06-01 |