Session 6: Statistical methods of integrating -omics data – Conference on Statistical Learning and Data Science / Nonparametric Statistics

Session title: Statistical methods of integrating -omics data
Organizer: Wei, Ying (Columbia U)
Chair: X. Song (Mount Sinai)
Time: June 4^th, 9:00-10:30am
Location: VEC 1403

Speech 1: A Statistical Framework for Leveraging Information across Multiple Traits in Genetic Studies
Speaker: Gen Li (Columbia)
Abstract: In genetic studies, pleiotropy occurs when a genetic variant affects multiple traits simultaneously. The true effect sizes for different traits usually have significant correlations. Most existing genome-wide association studies only focus on one trait at a time and fail to leverage the relationships between different traits. Motivated by the multi-tissue expression quantitative trait loci (eQTL) analysis in the Genotype-Tissue Expression (GTEx) project, we develop a two-stage method to address this limitation. It effectively borrows strength across multiple traits to identify genetic variants regulating a target trait. The method is based on summary statistics and allows genotype data to partially overlap between traits. We apply the proposed method to the GTEx data and identify more eQTLs with potential functionality.

Speech 2: A new method to study the change of miRNA–mRNA interactions due to environmental exposures
Speaker: Pei Wang (Icahn School of Medicine at Mount Sinai)
Abstract: ntegrative approaches characterizing the interactions among different types of biological molecules have been demonstrated to be useful for revealing informative biological mechanisms. One such example is the interaction between microRNA (miRNA) and messenger RNA (mRNA), whose deregulation may be sensitive to environmental insult leading to altered phenotypes. In this work, we introduce a new network approach—integrative Joint Random Forest (iJRF), which characterizes the regulatory system between miRNAs and mRNAs using a network model. iJRF is designed to work under the high-dimension low-sample-size regime, and can borrow information across different treatment conditions to achieve more accurate network inference. It also effectively takes into account prior information of miRNA–mRNA regulatory relationships from existing databases. We then apply iJRF to data from an animal experiment designed to investigate the effect of low-dose environmental chemical exposure on normal mammary gland development. We detected a few important miRNAs that regulated a large number of mRNAs in the control group but not in the exposed groups, suggesting the disruption of miRNA activity due to chemical exposure. Effects of chemical exposure on two affected miRNAs were further validated using breast cancer human cell lines.

Speech 3: smFARM: sparse multivariate Factor Analysis Regression Model in integrative genomics analysis
Speaker: Peter Song(Umich)
Abstract:
The multivariate regression model is a useful tool to explore complex associations between multiple response variables (e.g. gene expressions) and multiple predictors (e.g. SNPs). When the multiple responses are correlated, ignoring such dependency will impair statistical power in the data analysis. Motivated by an integrative genomic data, we propose a new methodology – sparse multivariate factor analysis regression model (smFARM), in which the covariance of the response variables is modeled by a factor analysis model with latent factors. This proposed method not only allows us to address the challenge that the number of genetic predictors is larger than the sample size, but also to adjust for unobserved genetic and/or non-genetic factors that potentially conceal the underlying real response-predictor associations. The proposed smFARM is implemented efficiently by utilizing the strength of the EM algorithm and the group-wise coordinate descend algorithm. In addition, the identified latent factors are explained by the means of gene enrichment analysis. The proposed methodology is evaluated and compared to the existing methods through extensive simulation studies. We apply smFARM in an integrative genomics analysis of a breast cancer dataset on the relationship between DNA copy numbers and gene expression arrays to derive genetic regulatory patterns relevant to breast cancer.