Prediction method based on machine learning and data augmentation for population relocation demand during floods

Mulin WANG, Wei LÜ, Xiaoting YANG, Ting YANG, Yajing ZHANG

Journal of Tsinghua University(Science and Technology) ›› 2026, Vol. 66 ›› Issue (1) : 160-168.

PDF(2550 KB)
PDF(2550 KB)
Journal of Tsinghua University(Science and Technology) ›› 2026, Vol. 66 ›› Issue (1) : 160-168. DOI: 10.16511/j.cnki.qhdxxb.2025.22.034

Prediction method based on machine learning and data augmentation for population relocation demand during floods

Author information +
History +

Abstract

Objective: This study focuses on the critical task of predicting the number of people to be evacuated (i.e., relocation number) during flood disasters. Accurate predictions of relocation numbers are vital for ensuring timely resource allocation and efficient disaster management, particularly in flood-prone areas where rapid decision-making can drastically mitigate the adverse impacts of the disaster. Methods: This research developed a robust relocation number prediction framework that combines feature selection and data augmentation techniques using the extreme gradient boosting (XGBoost) model, a widely used gradient-boosting machine learning algorithm. The model was built using historical data from flood events across China between 2014 and 2018. These events included meteorological and geographical features and the relocation number during each disaster. Feature selection was accomplished using Shapley additive explanations (SHAP), a game theory method for measuring the contribution of each feature to the model predictions. The selected features were then fed into the XGBoost model for training. A data augmentation strategy was also introduced to handle the challenge of limited training samples. This strategy involved the injection of Gaussian noise using a weighted k-nearest neighbors method to generate synthetic data points that preserved the local structure of the data, thereby enhancing the model's robustness and generalization ability. Results: The study demonstrates that the XGBoost model performs well with the selected features and augmented data. Initially, the model is trained on a small dataset, leading to satisfactory accuracy but limited generalization ability. However, after applying data augmentation, the model's performance significantly improves, especially for extreme values in the data. The testing phase reveals that R2 improves from 0.854 to 0.967, indicating a substantial increase in the model's predictive accuracy. Additionally, the root mean square error decreases from 0.296 to 0.123, signifying a considerable reduction in prediction error. These results highlight the effectiveness of combining feature selection and data augmentation to enhance the predictive power of the model. The feature selection process, guided by SHAP, identifies several key predictors that play a dominant role in determining population relocation demand. Among the most influential features are the maximum 3-day cumulative rainfall (MCR) and the maximum cumulative rainfall over the 15 days prior to the event (MRPE). These features are the most important in predicting the relocation number during flood events. Conclusions: The proposed relocation number prediction framework, integrating feature selection through SHAP and data augmentation techniques, is a highly effective tool for forecasting the relocation number during flood disasters. The XGBoost model, after optimization through Bayesian hyperparameter tuning and data augmentation, demonstrates significantly improved prediction accuracy and robustness. This approach can be instrumental in supporting disaster management teams with more reliable forecasts, allowing for better planning and more timely deployment of resources. Improving the model's ability to generalize to unseen data ensures accurate predictions even in regions with limited historical data. Thus, this study provides a valuable decision-making support tool for emergency response teams, helping to streamline resource allocation and evacuation planning during flood disasters and thereby minimizing the impact of the disaster on human lives and infrastructure.

Key words

heavy rainfall and flooding disaster / extreme gradient boosting (XGBoost) / relocation number prediction / feature selection / data augmentation

Cite this article

Download Citations
Mulin WANG , Wei LÜ , Xiaoting YANG , et al . Prediction method based on machine learning and data augmentation for population relocation demand during floods[J]. Journal of Tsinghua University(Science and Technology). 2026, 66(1): 160-168 https://doi.org/10.16511/j.cnki.qhdxxb.2025.22.034

References

1
LIU Q , DU M , WANG Y P , et al. Global, regional and national trends and impacts of natural floods, 1990-2022[J]. Bulletin of the World Health Organization, 2024, 102 (6): 410- 420.
2
KUNDZEWICZ Z W , SU B , WANG Y J , et al. Flood risk and its reduction in China[J]. Advances in Water Resources, 2019, 130, 37- 45.
3
HEMMATI M , KORNHUBER K , KRUCZKIEWICZ A . Enhanced urban adaptation efforts needed to counter rising extreme rainfall risks[J]. npj Urban Sustainability, 2022, 2 (1): 16.
4
WANG Y Y . Multiperiod optimal allocation of emergency resources in support of cross-regional disaster sustainable rescue[J]. International Journal of Disaster Risk Science, 2021, 12 (3): 394- 409.
5
MONDAL T , BORAL N , BHATTACHARYA I , et al. Distribution of deficient resources in disaster response situation using particle swarm optimization[J]. International Journal of Disaster Risk Reduction, 2019, 41, 101308.
6
DONG L H, BAI Y B, XU Q S, et al. Optimizing the post-disaster resource allocation with Q-learning: Demonstration of 2021 China flood[C]//Proceedings of the 33rd International Conference on Database and Expert Systems Applications. Vienna, Austria: Springer, 2022: 256-262.
7
JAYAWARDENE V , HUGGINS T J , PRASANNA R , et al. The role of data and information quality during disaster response decision-making[J]. Progress in Disaster Science, 2021, 12, 100202.
8
YANG W C , YAN X , HU D , et al. A novel emergency evacuation route optimization model in flood disasters using hydrodynamic model and intelligent algorithm[J]. Safety Science, 2025, 182, 106709.
9
张琳, 王金玉, 王鑫, 等. 重大自然灾害下多灾害点应急物资智能调度优化[J]. 清华大学学报(自然科学版), 2023, 63 (5): 765- 774.
ZHANG L , WANG J Y , WANG X , et al. Intelligent dispatching optimization of emergency supplies to multidisaster areas in major natural disasters[J]. Journal of Tsinghua University (Science & Technology), 2023, 63 (5): 765- 774.
10
XU R, XIE B, GU X Q, et al. A survey on disaster prediction methods[C]//Proceedings of 2024 International Conference on Guidance, Navigation and Control (Volume 2) on Advances in Guidance, Navigation and Control. Singapore: Springer, 2025: 574-585.
11
徐宗学, 陈浩, 任梅芳, 等. 中国城市洪涝致灾机理与风险评估研究进展[J]. 水科学进展, 2020, 31 (5): 713- 724.
XU Z X , CHEN H , REN M F , et al. Progress on disaster mechanism and risk assessment of urban flood/waterlogging disasters in China[J]. Advances in Water Science, 2020, 31 (5): 713- 724.
12
LIN L , WU Z N , LIANG Q H . Urban flood susceptibility analysis using a GIS-based multi-criteria analysis framework[J]. Natural Hazards, 2019, 97 (2): 455- 475.
13
EKMEKCIOĞLU Ö , KOC K , ÖZGER M . Towards flood risk mapping based on multi-tiered decision making in a densely urbanized metropolitan city of Istanbul[J]. Sustainable Cities and Society, 2022, 80, 103759.
14
WU Z N , SHEN Y X , WANG H L . Assessing urban areas' vulnerability to flood disaster based on text data: A case study in Zhengzhou city[J]. Sustainability, 2019, 11 (17): 4548.
15
LIU W , ZHANG X , FENG Q , et al. City-scale integrated flood risk prediction under future climate change and urbanization based on the shared socioeconomic pathways (SSP) scenarios[J]. Journal of Hydrology, 2025, 655, 132971.
16
HAN F F , YU J S , ZHOU G H , et al. Projected urban flood risk assessment under climate change and urbanization based on an optimized multi-scale geographically weighted regression[J]. Sustainable Cities and Society, 2024, 112, 105642.
17
黄国如, 罗海婉, 陈文杰, 等. 广州东濠涌流域城市洪涝灾害情景模拟与风险评估[J]. 水科学进展, 2019, 30 (5): 643- 652.
HUANG G R , LUO H W , CHEN W J , et al. Scenario simulation and risk assessment of urban flood in Donghaochong basin, Guangzhou[J]. Advances in Water Science, 2019, 30 (5): 643- 652.
18
WANG Z L , LAI C G , CHEN X H , et al. Flood hazard risk assessment model based on random forest[J]. Journal of Hydrology, 2015, 527, 1130- 1141.
19
王德运, 张露丹, 吴祈. 基于社交媒体数据的城市暴雨洪涝灾害风险评估: 以郑州市"7·20"暴雨事件为例[J]. 安全与环境工程, 2024, 31 (3): 11-22, 46.
WANG D Y , ZHANG L D , WU Q . Urban storm flood disaster risk assessment based on social media data: A case study of the "7·20" rainstorm event in Zhengzhou City[J]. Safety and Environmental Engineering, 2024, 31 (3): 11-22, 46.
20
LI S P , LIN Y P , HUANG H . Relief supply-demand estimation based on social media in typhoon disasters using deep learning and a spatial information diffusion model[J]. ISPRS International Journal of Geo-Information, 2024, 13 (1): 29.
21
ZHANG H Z , ZHAO X H , FANG X , et al. Proactive resource request for disaster response: A deep learning-based optimization model[J]. Information Systems Research, 2024, 35 (2): 528- 550.
22
NGUYEN L , YANG Z , LI J , et al. Forecasting people's needs in hurricane events from social network[J]. IEEE Transactions on Big Data, 2022, 8 (1): 229- 240.
23
张颖, 杨晓婷, 韩业凡, 等. 暴雨洪涝灾害转移安置人数的组合预测模型研究[J]. 中国安全生产科学技术, 2024, 20 (3): 172- 180.
ZHANG Y , YANG X T , HAN Y F , et al. Study on combined prediction model for number of transferred and resettled people in rainstorm-flood disaster[J]. Journal of Safety Science and Technology, 2024, 20 (3): 172- 180.
24
HAN J Y , MIAO C Y , GOU J J , et al. A new daily gridded precipitation dataset for the Chinese mainland based on gauge observations[J]. Earth System Science Data, 2023, 15 (7): 3147- 3161.
25
YANG J , HUANG X . The 30 m annual land cover dataset and its dynamics in China from 1990 to 2019[J]. Earth System Science Data, 2021, 13 (8): 3907- 3925.
26
YANG J L , DONG J W , XIAO X M , et al. Divergent shifts in peak photosynthesis timing of temperate and alpine grasslands in China[J]. Remote Sensing of Environment, 2019, 233, 111395.
27
GOU J J , MIAO C Y , DUAN Q Y , et al. Sensitivity analysis-based automatic parameter calibration of the VIC model for streamflow simulations over China[J]. Water Resources Research, 2020, 56 (1): e2019WR025968.
28
CHEN J D , GAO M , CHENG S L , et al. Global 1 km×1 km gridded revised real gross domestic product and electricity consumption during 1992-2019 based on calibrated nighttime light data[J]. Scientific Data, 2022, 9 (1): 202.
29
LLOYD C T , CHAMBERLAIN H , KERR D , et al. Global spatio-temporally harmonised datasets for producing high-resolution gridded population distribution datasets[J]. Big Earth Data, 2019, 3 (2): 108- 139.
30
崔玫意, 张玉虎, 陈秋华. Box-Cox正态分布及其在降雨极值分析中的应用[J]. 数理统计与管理, 2017, 36 (1): 8- 17.
CUI M Y , ZHANG Y H , CHEN Q H . Box-Cox normal distribution and its application in rainfall extreme value[J]. Journal of Applied Statistics and Management, 2017, 36 (1): 8- 17.
31
CHEN T Q, GUESTRIN C. XGBoost: A scalable tree boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, USA: Association for Computing Machinery, 2016: 785-794.

Footnotes

数据可用性说明

本文所有数据可以在合理的要求下联系第一作者后提供。

RIGHTS & PERMISSIONS

All rights reserved. Unauthorized reproduction is prohibited.
PDF(2550 KB)

Accesses

Citation

Detail

Sections
Recommended

/