Explaining XGBoost predictions with SHAP value: a comprehensive guide to interpreting decision tree-based models
Abstract
Understanding the factors that affect Key Performance Indicators (KPIs) and how they affect them is frequently important in sectors where data and data science are crucial. Machine learning is utilized to model and predict pertinent KPIs in order to do this. Interpretability is important, nevertheless, in order to fully comprehend how the model generates its predictions. It enables users to pinpoint which traits have aided the model’s ability to learn and comprehend the data. A practical approach for evaluating the contribution of input attributes to model learning has evolved in the form of SHAP (SHapley Additive exPlanations offer an index for evaluating the influence of each feature on the forecasts made by the model. In this paper, it is demonstrated that the contribution of features to model learning may be precisely estimated when utilizing SHAP values with decision tree-based models, which are frequently used to represent tabular data.
Keyword : SHAP value, machine learning, decision tree-based model, feature importance
This work is licensed under a Creative Commons Attribution 4.0 International License.
References
Arboleda-Florez, M., & Castro-Zuluaga, C. (2023). Interpreting direct sales’ demand forecasts using SHAP values. Production, 33. https://doi.org/10.1590/0103-6513.20220035
Awotunde, J. B., Folorunso, S. O., Imoize, A. L., Odunuga, J. O., Lee, C. C., Li, C. T., & Do, D. T. (2023). An ensemble tree-based model for intrusion detection in industrial internet of things networks. Applied Sciences, 13(4), 2479. https://doi.org/10.3390/app13042479
Bowen, D., & Ungar, L. (2020). Generalized SHAP: Generating multiple types of explanations in machine learning. arXiv. https://doi.org/10.48550/arXiv.2006.07155
Chalkiadakis, G., Elkind, E., & Wooldridge, M. (2011). Computational aspects of cooperative game theory. Synthesis Lectures on Artificial Intelligence and Machine Learning, 5(6), 1–168. https://doi.org/10.1007/978-3-031-01558-8
Chen, H., Lundberg, S., & Lee, S. I. (2021). Explaining models by propagating Shapley values of local components. In Explainable AI in healthcare and medicine (pp. 261–270). Springer, Cham. https://doi.org/10.1007/978-3-030-53352-6_24
Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). https://doi.org/10.1145/2939672.2939785
Covert, I., & Lee, S. I. (2021, March). Improving KernelSHAP: Practical Shapley value estimation using linear regression. In International Conference on Artificial Intelligence and Statistics (pp. 3457–3465). PMLR.
Dargaud, L., Ibsen, M., Tapia, J., & Busch, C. (2023). A principal component analysis-based approach for single morphing attack detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 683–692). https://doi.org/10.1109/WACVW58289.2023.00075
Fagrou, F. Z., Toumi, H., Lahmar, E. H. B., Achtaich, K., El Filali, S., & Baddi, Y. (2022). Connected devices classification using feature selection with machine learning. IAENG International Journal of Computer Science, 49(2).
Fayaz, M., Khan, A., Bilal, M., & Khan, S. U. (2022). Machine learning for fake news classification with optimal feature selection. Soft Computing, 26(16), 7763–7771. https://doi.org/10.1007/s00500-022-06773-x
Futagami, K., Fukazawa, Y., Kapoor, N., & Kito, T. (2021). Pairwise acquisition prediction with SHAP value interpretation. The Journal of Finance and Data Science, 7, 22–44. https://doi.org/10.1016/j.jfds.2021.02.001
Gebreyesus, Y., Dalton, D., Nixon, S., De Chiara, D., & Chinnici, M. (2023). Machine learning for data center optimizations: Feature selection using Shapley additive explanation (SHAP). Future Internet, 15(3), 88. https://doi.org/10.3390/fi15030088
Jain, S., & Saha, A. (2022). Rank-based univariate feature selection methods on machine learning classifiers for code smell detection. Evolutionary Intelligence, 15(1), 609–638. https://doi.org/10.1007/s12065-020-00536-z
Jas, K., & Dodagoudar, G. R. (2023). Explainable machine learning model for liquefaction potential assessment of soils using XGBoost-SHAP. Soil Dynamics and Earthquake Engineering, 165, 107662. https://doi.org/10.1016/j.soildyn.2022.107662
Kilincer, I. F., Ertam, F., Sengur, A., Tan, R. S., & Acharya, U. R. (2023). Automated detection of cybersecurity attacks in healthcare systems with recursive feature elimination and multilayer perceptron optimization. Biocybernetics and Biomedical Engineering, 43(1), 30–41. https://doi.org/10.1016/j.bbe.2022.11.005
Kim, D., Handayani, M. P., Lee, S., & Lee, J. (2023). Feature attribution analysis to quantify the impact of oceanographic and maneuverability factors on vessel shaft power using explainable tree-based model. Sensors, 23(3), 1072. https://doi.org/10.3390/s23031072
Kumari, S., Singh, K., Khan, T., Ariffin, M. M., Mohan, S. K., Baleanu, D., & Ahmadian, A. (2023). A novel approach for continuous authentication of mobile users using Reduce Feature Elimination (RFE): A machine learning approach. Mobile Networks and Applications. https://doi.org/10.1007/s11036-023-02103-z
Lee, M., Lee, J. H., & Kim, D. H. (2022). Gender recognition using optimal gait feature based on recursive feature elimination in normal walking. Expert Systems with Applications, 189, 116040. https://doi.org/10.1016/j.eswa.2021.116040
Li, L., Qiao, J., Yu, G., Wang, L., Li, H. Y., Liao, C., & Zhu, Z. (2022). Interpretable tree-based ensemble model for predicting beach water quality. Water Research, 211, 118078. https://doi.org/10.1016/j.watres.2022.118078
Li, Z. (2022). Extracting spatial effects from machine learning model using local interpretation method: An example of SHAP and XGBoost. Computers, Environment and Urban Systems, 96, 101845. https://doi.org/10.1016/j.compenvurbsys.2022.101845
Liu, J., Kang, H., Tao, W., Li, H., He, D., Ma, L., Tang, H., Wu, S., Yang, K., & Li, X. (2023). A spatial distribution–Principal component analysis (SD-PCA) model to assess pollution of heavy metals in soil. Science of The Total Environment, 859, 160112. https://doi.org/10.1016/j.scitotenv.2022.160112
Liu, X., & Aldrich, C. (2023). Explaining anomalies in coal proximity and coal processing data with Shapley and tree-based models. Fuel, 335, 126891. https://doi.org/10.1016/j.fuel.2022.126891
Loecher, M. (2022, August 23–26). Debiasing MDI feature importance and SHAP values in tree ensembles. In A. Holzinger, P. Kieseberg, A. M. Tjoa, & E. Weippl (Eds.), Lecture notes in computer science: Vol. 13480. Machine learning and knowledge extraction (pp. 114–129). Springer International Publishing. https://doi.org/10.1007/978-3-031-14463-9_8
Mangalathu, S., Hwang, S. H., & Jeon, J. S. (2020). Failure mode and effects analysis of RC members based on machine-learning-based SHapley Additive exPlanations (SHAP) approach. Engineering Structures, 219, 110927. https://doi.org/10.1016/j.engstruct.2020.110927
Merrick, L., & Taly, A. (2020, August). The explanation game: Explaining machine learning models using shapley values. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction (pp. 17–38). Springer, Cham. https://doi.org/10.1007/978-3-030-57321-8_2
Mitchell, R., Frank, E., & Holmes, G. (2022). GPUTreeShap: Massively parallel exact calculation of SHAP scores for tree ensembles. PeerJ Computer Science, 8, e880. https://doi.org/10.7717/peerj-cs.880
Rozemberczki, B., Watson, L., Bayer, P., Yang, H. T., Kiss, O., Nilsson, S., & Sarkar, R. (2022). The Shapley value in machine learning. arXiv. https://doi.org/10.48550/arXiv.2202.05594
Serrão, R. G., Oliveira, M. R., & Oliveira, L. (2023). Theoretical derivation of interval principal component analysis. Information Sciences, 621, 227–247. https://doi.org/10.1016/j.ins.2022.11.093
Ullah, I., Liu, K., Yamamoto, T., Zahid, M., & Jamal, A. (2023). Modeling of machine learning with SHAP approach for electric vehicle charging station choice behavior prediction. Travel Behaviour and Society, 31, 78–92. https://doi.org/10.1016/j.tbs.2022.11.006
Wang, D., Thunéll, S., Lindberg, U., Jiang, L., Trygg, J., & Tysklind, M. (2022). Towards better process management in wastewater treatment plants: Process analytics based on SHAP values for tree-based machine learning methods. Journal of Environmental Management, 301, 113941. https://doi.org/10.1016/j.jenvman.2021.113941