학술논문

Interpretability in HealthCare A Comparative Study of Local Machine Learning Interpretability Techniques
Document Type
Conference
Source
2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS) CBMS Computer-Based Medical Systems (CBMS), 2019 IEEE 32nd International Symposium on. :275-280 Jun, 2019
Subject
Bioengineering
Robotics and Control Systems
Signal Processing and Analysis
Measurement
Predictive models
Machine learning
Testing
Diabetes
Data models
Training
Machine Learning
Black-Box Model
Machine Learning Interpretability
Model-Agnostic Interpretability
Language
ISSN
2372-9198
Abstract
Although complex machine learning models (e.g., Random Forest, Neural Networks) are commonly outperforming the traditional simple interpretable models (e.g., Linear Regression, Decision Tree), in the healthcare domain, clinicians find it hard to understand and trust these complex models due to the lack of intuition and explanation of their predictions. With the new General Data Protection Regulation (GDPR), the importance for plausibility and verifiability of the predictions made by machine learning models has become essential. To tackle this challenge, recently, several machine learning interpretability techniques have been developed and introduced. In general, the main aim of these interpretability techniques is to shed light and provide insights into the predictions process of the machine learning models and explain how the model predictions have resulted. However, in practice, assessing the quality of the explanations provided by the various interpretability techniques is still questionable. In this paper, we present a comprehensive experimental evaluation of three recent and popular local model agnostic interpretability techniques, namely, LIME, SHAP and Anchors on different types of real-world healthcare data. Our experimental evaluation covers different aspects for its comparison including identity, stability, separability, similarity, execution time and bias detection. The results of our experiments show that LIME achieves the lowest performance for the identity metric and the highest performance for the separability metric across all datasets included in this study. On average, SHAP has the smallest average time to output explanation across all datasets included in this study. For detecting the bias, SHAP enables the participants to better detect the bias.