학술논문

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

Document Type

Working Paper

Author

Jin, Qiao; Chen, Fangyuan; Zhou, Yiliang; Xu, Ziyang; Cheung, Justin M.; Chen, Robert; Summers, Ronald M.; Rousseau, Justin F.; Ni, Peiyun; Landsman, Marc J; Baxter, Sally L.; Al'Aref, Subhi J.; Li, Yijia; Chen, Alex; Brejt, Josef A.; Chiang, Michael F.; Peng, Yifan; Lu, Zhiyong

Source

npj Digital Medicine, 2024

Subject

Computer Science - Computer Vision and Pattern Recognition
Computer Science - Artificial Intelligence
Computer Science - Computation and Language

Language

Abstract

Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges - an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.

Online Access

Open Access (Arxiv) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송