Objective: To establish a stroke prediction and feature analysis model integrating XGBoost and SHAP to aid the clinical diagnosis and prevention of stroke. Methods: Based on the open data set on Kaggle, with the help of data preprocessing and grid parameter optimization, an interpretable stroke risk prediction model was established by integrating XGBoost and SHAP and an explanatory analysis of risk factors was performed. Results: The XGBoost model’s accuracy, sensitivity, specificity, and area under the receiver operating characteristic (ROC) curve (AUC) were 96.71%, 93.83%, 99.59%, and 99.19%, respectively. Our explanatory analysis showed that age, type of residence, and history of hypertension were key factors affecting the incidence of stroke. Conclusion: Based on the data set, our analysis showed that the established model can be used to identify stroke, and our explanatory analysis based on SHAP increases the transparency of the model and facilitates medical practitioners to analyze the reliability of the model.
The Writing Group of “China Stroke Prevention Report”, 2022, Summary of “China Stroke Prevention Report 2020”. Chinese Journal of Cerebrovascular Diseases, 19(2): 136–144.
Pandian JD, Gall SL, Kate MP, et al., 2018, Prevention of Stroke: A Global Perspective. The Lancet, 392(10154): 1269–1278.
Wang Y, 2019, Research on Influencing Factors and Risk Prediction Model of Stroke Based on Big Data, thesis, Guangdong University of Technology.
Barra S, Almeida I, Caetano F, et al., 2013, Stroke Prediction with an Adjusted R-CHA2DS2VASc Score in a Cohort of Patients with a Myocardial Infarction. Thrombosis Research, 132(2): 293–299.
Vartiainen E, Laatikainen T, Peltonen M, et al., 2016, Predicting Coronary Heart Disease and Stroke: The FINRISK Calculator. Global Heart, 11(2): 213–216.
Hou Y, Zhang C, Su Y, 2019, Risk Prediction of Ischemic Stroke Based on Support Vector Machine. Modern Preventive Medicine, 46(15): 2692–2695 + 2700.
Luo Y, Shao Y, Chen D, 2021, Prediction of Annual Stroke Risk of Ischemic Stroke Based on BiLSTM-Attention Model. Journal of Donghua University (Natural Science Edition), 47(4): 62–68.
Chen J, Chen Y, Li J, et al., 2022, Stroke Risk Prediction with Hybrid Deep Transfer Learning Framework. IEEE Journal of Biomedical and Health Informatics, 26(1): 411–422.
Chawla NV, Bowyer KW, Hall LO, et al., 2002, SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 16: 321–357.
Zhou Z-H, 2021, Ensemble Learning, in Machine Learning, Springer, Singapore, 181–210.
Chen T, Guestrin C, 2016, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 13–17, 2016: XGBoost: A Scalable Tree Boosting System, ACM, San Francisco California USA, 785–794.
Gomolin A, Netchiporouk E, Gniadecki R, et al., 2020, Artificial Intelligence Applications in Dermatology: Where Do We Stand?. Frontiers in Medicine, 7: 100.
Lundberg SM, Lee S-I, 2017, A Unified Approach to Interpreting Model Predictions, in Advances in Neural Information Processing Systems 30, Curran Associates, Inc., 4765–4774.
Shapley LS, 1952, A Value for N-Person Games, RAND Corporation, Santa Monica, CA.
Li M, Wang C, Xia B, et al., 2017, Stroke Risk Prediction Model for Health Management Population. Journal of Shandong University (Medical Science), 55(6): 93–97 + 103.
Boehme AK, Esenwa C, Elkind MSV, 2017, Stroke Risk Factors, Genetics, and Prevention. Circulation Research, 120(3): 472–495.
Murphy SJX, Werring DJ, 2020, Stroke: Causes and Clinical Features. Medicine, 48(9): 561–566.