학술논문

Fairwashing Explanations with Off-Manifold Detergent

Document Type

Working Paper

Author

Anders, Christopher J.; Pasliev, Plamen; Dombrowski, Ann-Kathrin; Müller, Klaus-Robert; Kessel, Pan

Source

Subject

Computer Science - Machine Learning
Statistics - Machine Learning

Language

Abstract

Explanation methods promise to make black-box classifiers more transparent. As a result, it is hoped that they can act as proof for a sensible, fair and trustworthy decision-making process of the algorithm and thereby increase its acceptance by the end-users. In this paper, we show both theoretically and experimentally that these hopes are presently unfounded. Specifically, we show that, for any classifier $g$, one can always construct another classifier $\tilde{g}$ which has the same behavior on the data (same train, validation, and test error) but has arbitrarily manipulated explanation maps. We derive this statement theoretically using differential geometry and demonstrate it experimentally for various explanation methods, architectures, and datasets. Motivated by our theoretical insights, we then propose a modification of existing explanation methods which makes them significantly more robust.
Comment: 22 pages with 43 figures, to be published in ICML2020

Online Access

Open Access (Arxiv) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송