학술논문
Learning debiased graph representations from the OMOP common data model for synthetic data generation.
Document Type
Article
Author
Schulz, Nicolas Alexander; Carus, Jasmin; Wiederhold, Alexander Johannes; Johanns, Ole; Peters, Frederik; Rath, Natalie; Rausch, Katharina; Holleczek, Bernd; Katalinic, Alexander; Nennecke, Alice; Kusche, Henrik; Heinrichs, Vera; Eberle, Andrea; Luttmann, Sabine; Abnaof, Khalid; Kim-Wanner, Soo-Zin; Handels, Heinz; Germer, Sebastian; Halber, Marco; Richter, Martin
Source
Subject
*REPRESENTATIONS of graphs
*MEDICAL informatics
*DATA modeling
*NURSING informatics
*MARKOV processes
ELECTRONIC health record standards
*
*
*
*
Language
ISSN
1471-2288
Abstract
Background: Generating synthetic patient data is crucial for medical research, but common approaches build up on black-box models which do not allow for expert verification or intervention. We propose a highly available method which enables synthetic data generation from real patient records in a privacy preserving and compliant fashion, is interpretable and allows for expert intervention. Methods: Our approach ties together two established tools in medical informatics, namely OMOP as a data standard for electronic health records and Synthea as a data synthetization method. For this study, data pipelines were built which extract data from OMOP, convert them into time series format, learn temporal rules by 2 statistical algorithms (Markov chain, TARM) and 3 algorithms of causal discovery (DYNOTEARS, J-PCMCI+, LiNGAM) and map the outputs into Synthea graphs. The graphs are evaluated quantitatively by their individual and relative complexity and qualitatively by medical experts. Results: The algorithms were found to learn qualitatively and quantitatively different graph representations. Whereas the Markov chain results in extremely large graphs, TARM, DYNOTEARS, and J-PCMCI+ were found to reduce the data dimension during learning. The MultiGroupDirect LiNGAM algorithm was found to not be applicable to the problem statement at hand. Conclusion: Only TARM and DYNOTEARS are practical algorithms for real-world data in this use case. As causal discovery is a method to debias purely statistical relationships, the gradient-based causal discovery algorithm DYNOTEARS was found to be most suitable. [ABSTRACT FROM AUTHOR]