학술논문

Abstract 1122‐000089: Characterization of Critical Sequelae in Ischemic Stroke Using Natural Language Processing
Document Type
article
Source
Stroke: Vascular and Interventional Neurology, Vol 1, Iss S1 (2021)
Subject
Acute Stroke
New Technique
Stroke
Diagnostic Neuroradiology
Neurology. Diseases of the nervous system
RC346-429
Diseases of the circulatory (Cardiovascular) system
RC666-701
Language
English
ISSN
2694-5746
Abstract
Introduction: Automated processing of electronic health data to classify complications of ischemic stroke serves numerous purposes, including improved electronic phenotyping for clinical research. Here, we present a natural language processing (NLP) approach to identify critical findings in acute ischemic stroke from unstructured radiology reports of computed tomography (CT) and magnetic resonance imaging (MRI). Methods: Text reports of CT and MRI scans taken from 2292 patients admitted for large (>1/2 middle cerebral artery territory), acute anterior circulation ischemic stroke were gathered from a single‐institution retrospective cohort. Reports were reviewed and labelled for the presence of hemorrhagic conversion, intracerebral edema, midline shift, intraventricular hemorrhage and parenchymal hematoma as defined by European Cooperative Acute Stroke Study PH1 and PH2 categories. For binary classifications, we quantified co‐occurrence of individual words within reports using two separate NLP methods: Bag‐of‐Words (BOW) and Term Frequency‐Inverse Document Frequency (TF‐IDF). We then trained Lasso regression, random forest, and neural network classifiers to predict all complications based on word co‐occurrence. Classifier performance was measured by area under receiver operating characteristic curves (AUC) using five separate folds of an internal test dataset. To predict midline shift as a continuous outcome, we developed a semantic rule‐based system (RBS) based on regular radiographic report expressions. This system was tested using an external validation dataset of 1472 acute large anterior circulation stroke reports from a separate hospital. Results: 2292 reports were fully labelled for the presence of all stroke complications. Lasso regression consistently displayed the best discrimination among all models. For BOW and TF‐IDF, Lasso yielded respective AUCs of 0.894 and 0.919 (hemorrhagic conversion), 0.935 and 0.950 (intracerebral edema), 0.968 and 0.963 (midline shift), 0.933 and 0.904 (intraventricular hemorrhage), and 0.873 and 0.879 (parenchymal hematoma). All models were well‐calibrated to underlying complication rates. The RBS also achieved strong performance in quantifying midline shift, achieving a mean absolute error (MAE) of 0.103 mm, sensitivity of 99.1% and specificity of 97.5% in the original cohort. In the external validation set of 1472 additional stroke reports, this same system achieved a MAE of 0.126 mm, sensitivity of 99.5% and specificity of 97.5% for midline shift. Wilcoxon rank sum testing on bootstrapped samples confirmed no statistically‐significant differences in RBS performance between institutions when comparing MAE (p = 0.918), sensitivity (p = 0.152), and specificity (p = 0.929). Conclusions: A machine learning pipeline based on Lasso regression successfully identified critical complications of large anterior circulation ischemic stroke from unstructured radiology reports, while our RBS quantified midline shift with a high degree of generalized accuracy between different institutions. We propose that these systems may warrant prospective validation in care settings and data mining for stroke research.