학술논문

Topical hidden genome: discovering latent cancer mutational topics using a Bayesian multilevel context-learning approach.
Document Type
Academic Journal
Author
Chakraborty S; Department of Biostatistics, State University of New York at Buffalo, Buffalo, NY 14214, USA.; Guan Z; Biostatistics Center, Mass General Research Institute, Boston, MA 02114, USA.; Begg CB; Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY 10065, USA.; Shen R; Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY 10065, USA.
Source
Publisher: Biometric Society Country of Publication: England NLM ID: 0370625 Publication Model: Print Cited Medium: Internet ISSN: 1541-0420 (Electronic) Linking ISSN: 0006341X NLM ISO Abbreviation: Biometrics Subsets: MEDLINE
Subject
Language
English
Abstract
Inferring the cancer-type specificities of ultra-rare, genome-wide somatic mutations is an open problem. Traditional statistical methods cannot handle such data due to their ultra-high dimensionality and extreme data sparsity. To harness information in rare mutations, we have recently proposed a formal multilevel multilogistic "hidden genome" model. Through its hierarchical layers, the model condenses information in ultra-rare mutations through meta-features embodying mutation contexts to characterize cancer types. Consistent, scalable point estimation of the model can incorporate 10s of millions of variants across thousands of tumors and permit impressive prediction and attribution. However, principled statistical inference is infeasible due to the volume, correlation, and noninterpretability of mutation contexts. In this paper, we propose a novel framework that leverages topic models from computational linguistics to effectuate dimension reduction of mutation contexts producing interpretable, decorrelated meta-feature topics. We propose an efficient MCMC algorithm for implementation that permits rigorous full Bayesian inference at a scale that is orders of magnitude beyond the capability of existing out-of-the-box inferential high-dimensional multi-class regression methods and software. Applying our model to the Pan Cancer Analysis of Whole Genomes dataset reveals interesting biological insights including somatic mutational topics associated with UV exposure in skin cancer, aging in colorectal cancer, and strong influence of epigenome organization in liver cancer. Under cross-validation, our model demonstrates highly competitive predictive performance against blackbox methods of random forest and deep learning.
(© The Author(s) 2024. Published by Oxford University Press on behalf of The International Biometric Society.)