论文标题

大型传感器网络中缺少次数降水数据的插图:机器学习方法

Imputation of missing sub-hourly precipitation data in a large sensor network: a machine learning approach

论文作者

Chivers, Benedict Delahaye, Wallbank, John, Cole, Steven J., Sebek, Ondrej, Stanley, Simon, Fry, Matthew, Leontidis, Georgios

论文摘要

以次数分辨率收集的降水数据代表了由于本质上很大的随机性而在雨水与非雨水的持续时间内高度不平衡,这代表了缺少数据恢复的特定挑战。在这里,我们提出了利用当前机器学习技术的两步分析,以将任务放入(a)雨水或非雨水样品的分类中,以30分钟的间隔进行归纳降水数据,以及(b)回归预测的雨水样品的绝对值。在英国调查了37个气象站,该机器学习过程比利用邻近的雨量测量值的既定表面拟合技术产生了更准确的预测,以恢复降水数据。培训机器学习算法的可用功能的增加可以通过在目标站点的天气数据和外部采购的雨量测量表的整合中提高性能,从而提供了最高的性能。该方法通过使用同时收集的环境数据中的信息来准确预测丢失的雨数据,从而为机器学习模型提供了信息。从弱相关变量中捕获复杂的非线性关系对于在次数分辨率下的数据恢复至关重要。可以为高度时间分辨率的正在进行的数据集中开发和部署用于数据恢复的管道。

Precipitation data collected at sub-hourly resolution represents specific challenges for missing data recovery by being largely stochastic in nature and highly unbalanced in the duration of rain vs non-rain. Here we present a two-step analysis utilising current machine learning techniques for imputing precipitation data sampled at 30-minute intervals by devolving the task into (a) the classification of rain or non-rain samples, and (b) regressing the absolute values of predicted rain samples. Investigating 37 weather stations in the UK, this machine learning process produces more accurate predictions for recovering precipitation data than an established surface fitting technique utilising neighbouring rain gauges. Increasing available features for the training of machine learning algorithms increases performance with the integration of weather data at the target site with externally sourced rain gauges providing the highest performance. This method informs machine learning models by utilising information in concurrently collected environmental data to make accurate predictions of missing rain data. Capturing complex non-linear relationships from weakly correlated variables is critical for data recovery at sub-hourly resolutions. Such pipelines for data recovery can be developed and deployed for highly automated and near instantaneous imputation of missing values in ongoing datasets at high temporal resolutions.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源