Explaining XGBoost predictions with SHAP value: a comprehensive guide to interpreting decision tree-based models

Serap Ergün

doi:10.3846/ntcs.2023.17901

DOI: https://doi.org/10.3846/ntcs.2023.17901

Abstract

Understanding the factors that affect Key Performance Indicators (KPIs) and how they affect them is frequently important in sectors where data and data science are crucial. Machine learning is utilized to model and predict pertinent KPIs in order to do this. Interpretability is important, nevertheless, in order to fully comprehend how the model generates its predictions. It enables users to pinpoint which traits have aided the model’s ability to learn and comprehend the data. A practical approach for evaluating the contribution of input attributes to model learning has evolved in the form of SHAP (SHapley Additive exPlanations offer an index for evaluating the influence of each feature on the forecasts made by the model. In this paper, it is demonstrated that the contribution of features to model learning may be precisely estimated when utilizing SHAP values with decision tree-based models, which are frequently used to represent tabular data.

Keyword : SHAP value, machine learning, decision tree-based model, feature importance

How to Cite

Ergün, S. (2023). Explaining XGBoost predictions with SHAP value: a comprehensive guide to interpreting decision tree-based models. New Trends in Computer Sciences, 1(1), 19–31. https://doi.org/10.3846/ntcs.2023.17901

Published in Issue

Apr 11, 2023

Abstract Views

442

PDF Downloads

411

This work is licensed under a Creative Commons Attribution 4.0 International License.

References

Alparslan Gök, S. Z., Branzei, R., & Tijs, S. (2010). The interval Shapley value: an axiomatization. Central European Journal of Operations Research, 18(2), 131–140. https://doi.org/10.1007/s10100-009-0096-0

Arboleda-Florez, M., & Castro-Zuluaga, C. (2023). Interpreting direct sales’ demand forecasts using SHAP values. Production, 33. https://doi.org/10.1590/0103-6513.20220035

Awotunde, J. B., Folorunso, S. O., Imoize, A. L., Odunuga, J. O., Lee, C. C., Li, C. T., & Do, D. T. (2023). An ensemble tree-based model for intrusion detection in industrial internet of things networks. Applied Sciences, 13(4), 2479. https://doi.org/10.3390/app13042479

Bowen, D., & Ungar, L. (2020). Generalized SHAP: Generating multiple types of explanations in machine learning. arXiv. https://doi.org/10.48550/arXiv.2006.07155

Chalkiadakis, G., Elkind, E., & Wooldridge, M. (2011). Computational aspects of cooperative game theory. Synthesis Lectures on Artificial Intelligence and Machine Learning, 5(6), 1–168. https://doi.org/10.1007/978-3-031-01558-8

Chen, H., Lundberg, S., & Lee, S. I. (2021). Explaining models by propagating Shapley values of local components. In Explainable AI in healthcare and medicine (pp. 261–270). Springer, Cham. https://doi.org/10.1007/978-3-030-53352-6_24

Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). https://doi.org/10.1145/2939672.2939785

Covert, I., & Lee, S. I. (2021, March). Improving KernelSHAP: Practical Shapley value estimation using linear regression. In International Conference on Artificial Intelligence and Statistics (pp. 3457–3465). PMLR.

Dargaud, L., Ibsen, M., Tapia, J., & Busch, C. (2023). A principal component analysis-based approach for single morphing attack detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 683–692). https://doi.org/10.1109/WACVW58289.2023.00075

Fagrou, F. Z., Toumi, H., Lahmar, E. H. B., Achtaich, K., El Filali, S., & Baddi, Y. (2022). Connected devices classification using feature selection with machine learning. IAENG International Journal of Computer Science, 49(2).

Fayaz, M., Khan, A., Bilal, M., & Khan, S. U. (2022). Machine learning for fake news classification with optimal feature selection. Soft Computing, 26(16), 7763–7771. https://doi.org/10.1007/s00500-022-06773-x

Futagami, K., Fukazawa, Y., Kapoor, N., & Kito, T. (2021). Pairwise acquisition prediction with SHAP value interpretation. The Journal of Finance and Data Science, 7, 22–44. https://doi.org/10.1016/j.jfds.2021.02.001

Gebreyesus, Y., Dalton, D., Nixon, S., De Chiara, D., & Chinnici, M. (2023). Machine learning for data center optimizations: Feature selection using Shapley additive explanation (SHAP). Future Internet, 15(3), 88. https://doi.org/10.3390/fi15030088

Jain, S., & Saha, A. (2022). Rank-based univariate feature selection methods on machine learning classifiers for code smell detection. Evolutionary Intelligence, 15(1), 609–638. https://doi.org/10.1007/s12065-020-00536-z

Jas, K., & Dodagoudar, G. R. (2023). Explainable machine learning model for liquefaction potential assessment of soils using XGBoost-SHAP. Soil Dynamics and Earthquake Engineering, 165, 107662. https://doi.org/10.1016/j.soildyn.2022.107662

Kilincer, I. F., Ertam, F., Sengur, A., Tan, R. S., & Acharya, U. R. (2023). Automated detection of cybersecurity attacks in healthcare systems with recursive feature elimination and multilayer perceptron optimization. Biocybernetics and Biomedical Engineering, 43(1), 30–41. https://doi.org/10.1016/j.bbe.2022.11.005

Kim, D., Handayani, M. P., Lee, S., & Lee, J. (2023). Feature attribution analysis to quantify the impact of oceanographic and maneuverability factors on vessel shaft power using explainable tree-based model. Sensors, 23(3), 1072. https://doi.org/10.3390/s23031072

Kumari, S., Singh, K., Khan, T., Ariffin, M. M., Mohan, S. K., Baleanu, D., & Ahmadian, A. (2023). A novel approach for continuous authentication of mobile users using Reduce Feature Elimination (RFE): A machine learning approach. Mobile Networks and Applications. https://doi.org/10.1007/s11036-023-02103-z

Lee, M., Lee, J. H., & Kim, D. H. (2022). Gender recognition using optimal gait feature based on recursive feature elimination in normal walking. Expert Systems with Applications, 189, 116040. https://doi.org/10.1016/j.eswa.2021.116040

Li, L., Qiao, J., Yu, G., Wang, L., Li, H. Y., Liao, C., & Zhu, Z. (2022). Interpretable tree-based ensemble model for predicting beach water quality. Water Research, 211, 118078. https://doi.org/10.1016/j.watres.2022.118078

Li, Z. (2022). Extracting spatial effects from machine learning model using local interpretation method: An example of SHAP and XGBoost. Computers, Environment and Urban Systems, 96, 101845. https://doi.org/10.1016/j.compenvurbsys.2022.101845

Liu, J., Kang, H., Tao, W., Li, H., He, D., Ma, L., Tang, H., Wu, S., Yang, K., & Li, X. (2023). A spatial distribution–Principal component analysis (SD-PCA) model to assess pollution of heavy metals in soil. Science of The Total Environment, 859, 160112. https://doi.org/10.1016/j.scitotenv.2022.160112

Liu, X., & Aldrich, C. (2023). Explaining anomalies in coal proximity and coal processing data with Shapley and tree-based models. Fuel, 335, 126891. https://doi.org/10.1016/j.fuel.2022.126891

Loecher, M. (2022, August 23–26). Debiasing MDI feature importance and SHAP values in tree ensembles. In A. Holzinger, P. Kieseberg, A. M. Tjoa, & E. Weippl (Eds.), Lecture notes in computer science: Vol. 13480. Machine learning and knowledge extraction (pp. 114–129). Springer International Publishing. https://doi.org/10.1007/978-3-031-14463-9_8

Mangalathu, S., Hwang, S. H., & Jeon, J. S. (2020). Failure mode and effects analysis of RC members based on machine-learning-based SHapley Additive exPlanations (SHAP) approach. Engineering Structures, 219, 110927. https://doi.org/10.1016/j.engstruct.2020.110927

Merrick, L., & Taly, A. (2020, August). The explanation game: Explaining machine learning models using shapley values. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction (pp. 17–38). Springer, Cham. https://doi.org/10.1007/978-3-030-57321-8_2

Mitchell, R., Frank, E., & Holmes, G. (2022). GPUTreeShap: Massively parallel exact calculation of SHAP scores for tree ensembles. PeerJ Computer Science, 8, e880. https://doi.org/10.7717/peerj-cs.880

Rozemberczki, B., Watson, L., Bayer, P., Yang, H. T., Kiss, O., Nilsson, S., & Sarkar, R. (2022). The Shapley value in machine learning. arXiv. https://doi.org/10.48550/arXiv.2202.05594

Serrão, R. G., Oliveira, M. R., & Oliveira, L. (2023). Theoretical derivation of interval principal component analysis. Information Sciences, 621, 227–247. https://doi.org/10.1016/j.ins.2022.11.093

Ullah, I., Liu, K., Yamamoto, T., Zahid, M., & Jamal, A. (2023). Modeling of machine learning with SHAP approach for electric vehicle charging station choice behavior prediction. Travel Behaviour and Society, 31, 78–92. https://doi.org/10.1016/j.tbs.2022.11.006

Wang, D., Thunéll, S., Lindberg, U., Jiang, L., Trygg, J., & Tysklind, M. (2022). Towards better process management in wastewater treatment plants: Process analytics based on SHAP values for tree-based machine learning methods. Journal of Environmental Management, 301, 113941. https://doi.org/10.1016/j.jenvman.2021.113941