Session 28: Statistical Learning and Genomics – Conference on Statistical Learning and Data Science / Nonparametric Statistics

Session title: Statistical Learning and Genomics
Organizer: Ji Zhu (Umich)
Chair: Bing Li (Penn State)
Time: June 5^th, 1:15pm – 2:45pm
Location: VEC 1403

Speech 1: Proteomics and Genomics Integration for Translational Cancer Research
Speaker: Umut Ozbek (Mount Sinai)
Abstract: Advances in biomedical research bring the opportunity to gather data in various platforms. Integrating those diverse data to understand complex biological
systems has been a big challenge for statisticians. We propose a novel statistical tool, spaceMap; a conditional graphical model, which learns the conditional dependency relationships between two types of high dimensional omic profiles through a penalized multivariate regression framework. spaceMap infers an undirected graph among response variables in tandem with a directed graph encoding perturbations from predictor variables on the response network. In addition, it utilizes cross-validation and model aggregation to reduce the false
discovery rate and consequently to improve reproducibility. We applied spaceMap to the copy number alterations, gene expression and proteomics datasets from CPTAC-TCGA ovarian cancer study. The results help to pinpoint crucial cancer genes and provide insights on the functional consequences of important CNA in the disease.

Speech 2: What can we gain from proteogenomics prediction? The downstream analysis of NCI-CPTAC Proteogenomics DREAM Challenge
Speaker: Xiaoyu Song (MSSM)
Abstract: Background: Proteins are complex macromolecules responsible for nearly every task of cellular life, and thus play an essential role in the formation, progression and metastasis of cancer. A community-based collaborative competition, NCI-CPTAC DREAM Proteogenomics Challenge, is developing computational tools to answer “Can one predict abundance of any given protein from mRNA and genetic data?” in sub-challenge 2 and “Can one predict phosphoprotein abundances from protein abundance?” and in sub-challenge 3. Methods and Results: In pairwise Pearson correlation analyses, we found the correlation between true protein/phosphoprotein abundances and their predicted scores from the top performing models varies dramatically from protein to protein. Therefore, we investigated the biological factors that influence the performance of predictions. We also applied the well-predicted proteins/phosphosites to 317 independent samples of ovarian cancer in TCGA for protein prediction and to 105 independent samples of ovarian cancer in CPTAC for phosphoprotein prediction. We found that the most significant overall survival associated pathways were repeatedly identified in the top performing models and both in training samples of the models and the independent samples. The utility of these prediction models in drug sensitivity analyses, cell lines and trans-cancer models have also been investigated. Conclusion: Proteogenomics prediction is promising to improve our understanding of molecular mechanisms of human cancer.

Speech 3: An empirical comparison of deep neural networks and other
supervised learning methods
Speaker: Wei Pan (U of Minnesota)
Abstract: Deep convolutional neural networks (CNN) have been proposed for supervised classification of high-throughput microscopy images to predict protein subcellular localization. We consider several CNN architectures in addition to a few traditional supervised learning methods such as random forest and gradient boosting. We compare their empirical performance when applied to a large dataset.