|Integration of Multiple High-dimensional Genomic Data Types for Survival Prediction
|Year of Publication
|Wong, Kin Yau
Large-scale genomic projects, including The Cancer Genome Atlas (TCGA), have collected different types of genomic data, including copy number variation, somatic mutation, methylation, and mRNA, miRNA, and protein expression data, on patients of various cancer types. It is of interest to integrate the information across the genomic data for better outcome prediction. Because the genomic data are typically high dimensional, variable selection procedures like the lasso and elastic net are commonly used. However, traditional variable selection procedures do not distinguish different types of data but treat all predictors equally. To take into account the differences across data types, we propose a variable selection method called Integrative Boosting (I-Boost), which is based on statistical boosting. I-Boost considers different data types separately so that small but predictive data types would not be dominated by the larger ones. I-Boost can be efficiently implemented using existing computer packages for the elastic net. We show by simulations that I-Boost has better variable selection and prediction performance compared to the lasso and elastic net, especially when the signal is unevenly distributed across data types. Finally, we compare I-Boost with the lasso and elastic net on a TCGA data set and show that it provides more accurate survival prediction.