학술논문

Repeatability, reproducibility, and diagnostic accuracy of a commercial large language model (ChatGPT) to perform emergency department triage using the Canadian triage and acuity scale
Document Type
Article
Source
Canadian Journal of Emergency Medicine; January 2024, Vol. 26 Issue: 1 p40-46, 7p
Subject
Language
ISSN
14818035; 14818043
Abstract
Purpose: The release of the ChatGPT prototype to the public in November 2022 drastically reduced the barrier to using artificial intelligence by allowing easy access to a large language model with only a simple web interface. One situation where ChatGPT could be useful is in triaging patients arriving to the emergency department. This study aimed to address the research problem: “can emergency physicians use ChatGPT to accurately triage patients using the Canadian Triage and Acuity Scale (CTAS)?”. Methods: Six unique prompts were developed independently by five emergency physicians. An automated script was used to query ChatGPT with each of the 6 prompts combined with 61 validated and previously published patient vignettes. Thirty repetitions of each combination were performed for a total of 10,980 simulated triages. Results: In 99.6% of 10,980 queries, a CTAS score was returned. However, there was considerable variations in results. Repeatability (use of the same prompt repeatedly) was responsible for 21.0% of overall variation. Reproducibility (use of different prompts) was responsible for 4.0% of overall variation. Overall accuracy of ChatGPT to triage simulated patients was 47.5% with a 13.7% under-triage rate and a 38.7% over-triage rate. More extensively detailed text given as a prompt was associated with greater reproducibility, but minimal increase in accuracy. Conclusions: This study suggests that the current ChatGPT large language model is not sufficient for emergency physicians to triage simulated patients using the Canadian Triage and Acuity Scale due to poor repeatability and accuracy. Medical practitioners should be aware that while ChatGPT can be a valuable tool, it may lack consistency and may frequently provide false information.