Poster session

Tuesday, June 5th, 5:00 pm – 6:30 pm
VEC 401 multiple purpose room

Sponsored by 

Name with * indicates the presenter if there are more than one author.

  1. Title: The Blessings of Multiple Causes: Causal Inference without Strong Ignorability
    Author: Yixin Wang (Columbia), David M. Blei (Columbia)

    Causal inference from observation data often assumes “strong ignorability,” that all confounders are observed. This assumption is standard yet untestable. However, many scientific studies involve multiple causes, different variables whose effects are simultaneously of interest. We propose the deconfounder, an algorithm that combines unsupervised machine learning and predictive model checking to perform causal inference in multiple-cause settings. The deconfounder infers a latent variable as a substitute for unobserved confounders and then uses that substitute to perform causal inference. We develop theory for when the deconfounder leads to unbiased causal estimates, and show that it requires weaker assumptions than classical causal inference. We analyze its performance in three types of studies: semi-simulated data around smoking and lung cancer, semi-simulated data around genomewide association studies, and a real dataset about actors and movie revenue. The deconfounder provides a checkable approach to estimating close-to-truth causal effects.


  1. Title: Which Genes Are Really Causing My Problems? Using Filtering with LASSO and Elastic Net to Find the Signal in Ultra High Dimensional Data
    Author: Jill Lundell (Utah State University)
    Abstract: LASSO and elastic net have been used in GWAS to identify SNPs that are linked to a specific disease. However, GWAS data are typically very wide, which creates problems for LASSO. It is also not clear how to effectively select the parameters alpha and lambda for this type of problem. This project examines the efficacy of various filter methods for selecting a trimmed dataset that can then have LASSO and elastic net applied for variable selection. Different methods for selecting alpha and lambda are also explored.
  2. Title: Statistical Methods for Leveraging Public Controls in Epigenome-Wide Association Study
    Author: Ziqiao Wang*(MD Anderson), Donghui Li (MD Anderson), Peng Wei (MD Anderson)
    Abstract: It is challenging to perform high-dimensional hypothesis testing in high-throughput genomics data setting, featured by large number of variables (p) and small sample size (n). To boost the statistical power and reduce the cost of biological experiment, we propose a new statistical strategy by leveraging public controls in a case-control study with limited sample size. In our case study, we conducted genome-wide DNA methylation analysis for pancreatic cancer with 44 cases and 20 frequency-matched controls from the MD Anderson Cancer Center (MDACC). All data were collected with epigenome-wide methylation profiling of whole blood featuring >850,000 CpG sites. We increased the number of controls from 20 to 633 by combining public data, whole blood 450K methylation data of the Framingham Heart Study, with the MDACC controls. We successfully removed the batch effects between the two datasets from different resources based on the results of unsupervised learning. Then we employed state-of-the-art statistical methods and identified significantly differentially methylated CpG sites and regions associated with pancreatic cancer which led to substantial increase in statistical power than solely using the MDACC controls in the analysis. Finally, we applied the adaptive sum of power scored (aSPU) test on the combined methylation dataset and detected interesting genes and pathways associated with pancreatic cancer.
  3. Title: Joint Skeleton Estimation of Multiple Directed Acyclic Graphs for Heterogeneous Population
    Author: Jianyu Liu* (UNC), Wei Sun (Fred Hutch Cancer Center) and Yufeng Liu (UNC, Chapel Hill)
    Abstract: The directed acyclic graph (DAG) is a powerful tool to model the interactions of high-dimensional variables. While estimating edge directions in a DAG often requires interventional data, one can estimate the skeleton of a DAG (i.e., an undirected graph formed by removing the direction of each edge in a DAG) using observational data. In real data analyses, the samples of the high-dimensional variables may be collected from a mixture of multiple populations. Each population has its own DAG while the DAGs across populations may have significant overlap. In this paper, we propose a two-step approach to jointly estimate the DAG skeletons of multiple populations while the population origin of each sample may or may not be labeled. In particular, our method allows a probabilistic soft label for each sample, which can be easily computed and often leads to more accurate skeleton estimation than hard labels. Compared with separate estimation of skeletons for each population, our method is more accurate and robust to labeling errors. We construct the estimation consistency for our method. Simulation studies are performed to demonstrate the performance of the new method. Finally, we apply our method to analyze gene expression data from breast cancer patients of multiple cancer subtypes.


  1. Title: Randomized Algorithms of Maximum Likelihood Estimation with Spatial Autoregressive Models for Large-Scale Networks
    Author: Miaoqi Li (University of Cincinnati), Emily Lei Kang (University of Cincinnati)
    Abstract:  Spatial autoregressive (SAR) models have been widely used in analyses of economic, environmental data and recently social network data. Maximum likelihood estimation with the SAR models has been extensively studied and utilized. However, when dealing with large amount of data, direct evaluation of the log-likelihood function from the SAR models becomes computationally infeasible. To alleviate this challenge, we propose a randomized algorithm that provides an efficient way to obtain the maximum likelihood estimator, denoted as the randomized maximum likelihood estimator (RMLE). Numerical studies with simulated and real data are carried out to investigate the performance of the proposed algorithm. It is shown that the RMLE performs favorably in comparison with existing methods.
  2. Title: IMDB Review Mining and Movie Recommendation
    Author: Yingjun Guan (UIUC)
    Abstract: The world of movies contains enormous information and is worth digging for prediction and recommendation. IMDB database is one good example for extracting the relevant features, including both the item-based information (only related to the movies), for example, the title, the poster, the trailer, the genre, the cast, the director, etc., and the user-related information, such as name, gender, age, occupation, review, number of watching (the same movie), etc. Our project plans to benefit both the movie production corporations and the audience. For movie production corporations, our goals include better predicting the movie ratings given the item-based information and the user-related information. Data mining and supervised machine learning techniques are applied for labelling and classification to the nominal features. Text mining and sentimental analysis techniques are used for the review context in the database. For the audience, the hybrid recommendation system algorithm is introduced. The missing patterns of the features in the dataset are researched via visualizations and analyses. And special analysis on the “cold start” phenomenon (recommendation for new items or from insufficient information) on the database is also conducted.


  1. Title: A Bottom-up Approach to Selecting Associated Subtrees with False Discovery Rate Control
    Author: Yunxiao Li* (Emory), Glen A. Satten (Emory), Yi-Juan Hu (Emory)
    Abstract: Motivation: In many fields of science, data are collected with an intrinsically tree structure and hypotheses on association with a covariate are tested at multiple resolutions. For example, in research of the microbiome, after testing association at the species level, we may also wish to know if there is evidence that higher taxonomic groups (e.g., genera, family, etc.) are also associated with the trait of interest. When testing these associations, we wish to control the false discovery rate (FDR) of all the tests we consider. While tree-based testing procedures that control FDR are available, these are top-down tests that first test for association at the highest level in the tree, and only test the next lower level if the association at the top level is significant. Here, a ‘bottom-up’ procedure is more appropriate, since a reasonable outcome may be that the species in some genera are associated, but that this association does not extend further up the taxonomic tree.
    Results: We propose a bottom-up sequential testing approach for tree-structured hypotheses. Given independent p values on null leaves, valid post-selection inference can be carried upward through a conditional Fisher’s combination test adjusting for selection events at lower levels. A series of Benjamini-Hochberg-type rules for each level guarantees the FDR control of all hypotheses tested. Using both simulated data and real microbiome datasets, we show that our approach controls for the overall FDR and has increased power to detect associations found higher in the tree, compared with existing methods.
  1. Title: Multicategory Angle-Based Direct Learning in Estimating Optimal Dynamic Treatment Regime
    Author: Weibin Mo (UNC)
    Abstract: Dynamic treatment regime (DTR) is a sequential decision making process along which rewards resulting from the treatment history are collected. It fits into many chronic disease management problems where treatments are adaptively assigned accounting for patients’ heterogeneity to treatments and the timing-varying effects of treatments. The goal is to find the optimal dynamic treatment rules over time which maximizes the overall collected rewards. Existing literatures under the casual inference framework with structural nested mean models (SNMMs) are well-developed in identifying regime effect via G-estimation semi-parametrically.  The temporal difference technique in reinforcement learning also motivates the Q-learning approach with the more intuitive regression-based estimation. All such methods aim to identify regime consequences instead of finding the optimal regime directly. Similar to model-based methods, they may also suffer from potential model misspecification and the additional need of main effect estimation. Recent development in the outcome-weighted formulation has broadened the way in finding the optimal decision rules through a classification problem. However, the classification perspective only provides solutions to dyadic treatment scenario, where the stochastic signed weights make multicategory extensions nontrivial. Unbounded weights can also cause unstable estimation. To overcome all these obstacles, we develop an angle-based direct learning (AD-learning) approach to model the decision function directly. We show that such a model formulation targets the treatment effect contrast in a geometrically intuitive way. Our method avoids estimation of the main effect and consequently is more efficient. Our method also helps to bring us more insights in the intrinsic connection between G-estimation and regression-based methods.


  1. Title: Nonlinear Mixture Model for Modeling Trajectories of Ordinal Markers in Neurological Disorder
    Author: Qinxia Wang* (Columbia), Ming Sun (Columbia), Yuanjia Wang (Columbia)
    Abstract: Current diagnosis of neurological disorder often relies on late-stage clinical symptoms.However, recent research shows that there is some relationship between underlying disease progression and measures of biological markers or subtle changes in clinical markers that can be used to assist early diagnosis. We propose a nonlinear mixture model to investigate the trajectories of such markers, allowing for subject-specific inflection points as indicators of disease severity. Specifically, we focus on markers with ordinal outcomes for which higher values imply higher levels of disease severity. The latent binary variable in the mixture model indicates disease susceptibility for which the probability is associated with individual-specific characteristics. If a subject is not susceptible, we assume that he will have an ordinal outcome following an adjacent category logistic model. The odds of disease comparing adjacent categories depends on subject’s baseline biological (e.g., genomic and neuroimaging measures), demographic measures and a latent subject-specific vulnerability score shared among markers. Model parameters are estimated using EM algorithm. We conduct simulation studies to demonstrate validity of the proposed method and algorithm. Lastly, we apply our method to estimate the effect of personal characteristics on the trajectory of different markers collected in the Parkinson’s Progression Markers Initiative, and show utility to aid early personalized diagnostic decisions.
  1. Title: Penalized linear regression with high-dimensional pairwise screening
    Author: Siliang Gong*(UNC), Kai Zhang(UNC), Yufeng Liu (UNC)
    Abstract: In variable selection, most existing screening methods focus on marginal effects and ignore dependence between covariates. To improve the performance of selection, we incorporate pairwise effects in covariates for screening and penalization. We achieve this by studying the asymptotic distribution of the maximal absolute pairwise sample correlation among independent covariates. The novelty of the theory is in that the convergence is with respect to the dimensionality p, and is uniform with respect to the sample size n. Moreover, we obtain an upper bound for the maximal pairwise R squares when regressing the response onto two different covariates. Based on these extreme value results, we propose a screening procedure to detect covariates pairs that are potentially correlated and associated with the response. We further combine the pairwise screening with Sure Independence Screening and develop a new regularized variable selection procedure. Numerical studies show that our method is very competitive in terms of both prediction accuracy and variable selection accuracy.
  1. Title: AI for Earth – Keeping a Close Watch on Our Trees
    Author: Chengliang Tang (Columbia)
    Abstract: Climate change is anticipated to have profound implications for the long term forest ecosystem resilience and increase carbon fluxes to the atmosphere. Our understanding of these critical problems have been limited by conventional plot-level ground-based observations. Recently developed remote sensing techniques such as Light Detection and Ranging (LiDAR) and stereo photos provide large-scale high-resolution data that would allow researchers to carry out ecological surveys of forest at an unprecedented scale and quantify key factors that mediate a storm’s impacts on forest systems. We propose a data science workflow to study the effects of Hurricane Maria on forested landscapes in Puerto Rico by applying machine learning tools to data from imaging and remote sensing technologies, combined with ground observation data from field plots. Innovative methods are being developed for feature engineering using deep representational learning tools, and training deep learning models for species classification at the pixel level.
  1. Title: Community detection with dependent connectivity
    Author: Yubai Yuan* (UIUC), Annie Qu (UIUC)
    Abstract: Community detection is very important in network data analysis. One of the most popular probabilistic models for fitting community structure is the stochastic block model (SBM). However, the SBM is not able to fully capture the dependence among edges from the same community. Various SBM approaches using the random effects are proposed to incorporate correlation among edges. However, this mainly designs for the exchangeable dependence structure, and also suffers high computational cost. In this ongoing project, we propose a new community detection approach to utilize the dependence of network connectivity based on the estimating equation approach and the generalized method of moments. The proposed method provides greater flexibility in handling different types of within-community dependence structure. In addition, the proposed algorithm does not involve specifying the likelihood function and direct estimation of correlation parameters. We expect that the new method would outperform the SBM-based methods in the network community detection.
  1. Title: Estimation of Individualized Decision Rules Based on an Optimized Covariate-Dependent Equivalent
    Author: Zhengling Qi* (UNC), Ying Cui (USC), Yufeng Liu (UNC), Jong-Shi Pang (USC)
    Abstract: Recent exploration of optimal individualized decision rules (IDRs) for patients in precision medicine has attracted a lot of attention due to heterogeneous responses of patients to different treatments. In the existing literature of precision medicine, an optimal IDR is defined as a decision function mapping from the patients’ covariate space into the treatment space that maximizes the expected outcome of each individual. Motivated by the concept of Optimized Certainty Equivalent (OCE) introduced originally in \cite{ben1986expected}, we propose a decision-rule based optimized covariates dependent equivalent (CDE) for individualized decision making problems. Our proposed IDR-CDE broadens the existing expected mean outcome framework in precision medicine and enriches the previous concept of the OCE. Under a functional margin description of the decision rule modeled by an indicator function as in the literature of large-margin classifiers, the empirical minimization problem for estimating the optimal IDRs involves a discontinuous objective function. We show that, under a mild condition at the population level, the epigraphical formulation of this empirical optimization problem is a difference-of-convex (dc) constrained dc program. A dc algorithm is adopted to solve the resulting dc program.
    Numerical experiments demonstrate that our overall approach outperforms existing methods in estimating optimal IDRs under heavy tail distributions of the data. In addition to providing a risk based approach for individualized medical treatments, which is new in the area of precision medicine,  the main contributions of this work in general are: the broadening of the concept of the OCE, the epigraphical description of the empirical IDR-CDE minimization problem,  its equivalent dc formulation, and the sequential convergence proof of the dc algorithm for a (special) dc constrained dc program.
  1. Title: Decision Making considerations for the Trade-off Assessment of Diagnostic Errors when Comparing Diagnostic Tests
    Norberto Pantoja-Galicia* (FDA), Gene Pennello (FDA)
    Abstract: In this paper we investigate the implicit or explicit trade-offs between false positive and false negative tests errors provided by the information from the non-parametric Receiver Operating Characteristic (ROC) curve. We discuss its impact for the evaluation of the performance of a new medical diagnostic test in comparison with an already established test, as well as challenges and solutions.
  1. Title: Generalization bounds for ERM algorithm with regenerative Markov chain samples
    Author: Gabriela Ciolek* (Télécom ParisTech), Patrice Bertail (Télécom ParisTech), Stephan Clemencon (Télécom ParisTech)
    Abstract: We present generalization bounds for ERM algorithm with regenerative Markov chain samples . This result underlies application of ERM type of learning algorithms.  We introduce new concentration inequality in order to show that learning rate bounds depend not only on the complexity of the class of candidate sets but also on the ergodicity rate of the chain X, expressed in terms of tail conditions for the length of the regenerative cycles.  Finally, we will show generalization bounds for minimum volume set  estimation problem when the data are Markovian. Minimum volume sets can be used to detect anomalies/outliers, determine highest posterior density, multivariate confidence regions or clustering.
  1. Title: Model Averaging Based on Ranks
    Author: Eddy Kwessi (Trinity University)
    Abstract: In this article,we investigate model selection and model averaging based on rank regression. Under mild conditions, we propose a focused information criterion and a frequentist model averaging estimator for the focused parameters in rank regression model. Compared to the least squares method, the new method is not only highly efficient but also robust. The large sample properties of the proposed procedure are established. The finite sample properties are investigated via extensive Monte Claro simulation study. Finally, we use the Boston Housing Price Dataset to illustrate the use of the proposed rank methods.
  1. Title:  Imputed Factor Regression For High-dimensional Block-wise Missing Data
    Author: Yanqing Zhang* (Yunnan University), Nian-Sheng Tang (Yunnan University), Annie Qu (UIUC)
    Abstract: Block-wise missing data arise more frequently nowadays in high-dimensional biomedical studies, social, psychological and environmental sciences, and there is an urgent need to develop efficient dimension reduction to extract important information for prediction under block-wise missing data. Existing dimension reduction methods and feature combination are ineffective for handling block-wise missing data. We propose the factor-model imputation
    approach targeting block-wise missing and model imputed factor regression for dimension reduction and prediction. Specifically, we first perform screening to identify the important features, impute important features based on the factor model, and then build a factor regression model to predict the response variable based on the important features imputed. The proposed method utilizes the essential information from all observed data through the factor structure model, and shows that it is still efficient even when the block-wise missing proportion is high. We show that the imputed factor regression model and its prediction are consistent under regularity conditions. We compare the proposed method with other existing approaches through simulation studies and a real data application to the Alzheimer’s Disease Neuroimaging Initiative
    (ADNI) data. Our numerical studies confirm that the proposed method outperforms the existing competitive approaches.
  1. Title: Volatility Forecasting of Financial Data By Using OrnsteinUhlenbeck Type Model
    Author: Bhuiyan, Md Al Masum (The University of Texas at El Paso)
    Abstract: This work is devoted to the study of modeling financial data. A stochastic technique with time-varying parameters is used to forecast the volatility of data arising in finance. Using the daily closing prices from developed and emergent stock markets, we conclude that the incorporation of stochastic volatility into the time-varying parameter estimation improves the forecasting performance via Maximum Likelihood Estimation (MLE). A class of stochastic differential equation arising on the superposition of two independent Gamma Ornstein-Uhlenbeck processes is used to simulate the time series data in a special case where the MLE does not fit the original data. The simulated data mimics the original financial time series, which is observed from the estimates of the root mean square error. Furthermore, the stochastic model used in this study exhibits the physical and long memory behavior of the data. We also conclude that the Ornstein-Uhlenbeck type models used in this study guarantees the convergence of the MLE technique, which makes the estimation algorithm feasible with large data sets and facilitates prediction.
  1. Title: Correlation tensor decomposition and its application in spatial imaging data
    Author: Yujia Deng*(UIUC), Xiwei Tang (University of Virginia) and Annie Qu (UIUC)
    Abstract: In this talk, we propose a new method to analyze spatial-correlated imaging data. In contrast to the conventional multivariate analysis where the variables are treated as vectors and correlation is represented as a matrix form, we formulate spatial correlation based on the tensor decomposition to preserve the spatial information of imaging data. Specifically, we propose an innovative algorithm to decompose the spatial correlation into a sum of rank-1 tensor and an identity core tensor such that the structure of the spatial information can be captured more fully compared to traditional approaches. Our method is effective in reducing the dimension of spatial correlated data, which is advantage in computation. In addition, we show that the proposed method can test against the null hypothesis of independent structure, and identifies the block patterns of spatial correlations of imaging data effectively and efficiently. We compare the proposed method with other competing methods through simulations and optical image data to detect early-stage breast cancer.
  1. Title: Smooth neighborhood recommender systems
    Author: Ben Dai (City University of Hong Kong), Junhui Wang (City University of Hong Kong), Xiaotong Shen (University of Minnesota) and Annie Qu (UIUC)
    Abstract: Recommender systems predict users’ preferences over a large number of items by pooling similar information from other users and/or items in the presence of sparse observations. One major challenge is how to utilize user-item specific covariates and networks describing user-item interactions in a high-dimensional situation, for accurate personalized prediction. In this article, we propose a smooth neighborhood recommender in the framework of the latent factor models. A similarity kernel is utilized to borrow neighborhood information from continuous covariates over a user-item specific network, such as a user’s social network, where the grouping information defined by discrete covariates is also integrated through the network. Consequently, user-item specific information is built into the recommender to battle the ‘cold-start” issue in the absence of observations in collaborative and content-based filtering. Moreover, we develop a “divide-and-conquer” version of the alternating least squares algorithm to achieve scalable computation, and establish asymptotic results for the proposed method, showing that it achieves superior prediction accuracy. Finally, we demonstrate that the proposed method gains substantial improvement over its competitors in simulated examples and real benchmark data– music data.
  1. Title: Robust Estimation for Longitudinal Data with Covariate Measurement Errors and Outliers
    Author: Yuexia Zhang* (Fudan University), Guoyou Qin (Fudan University), Zhongyi Zhu (Fudan University) and Jiajia Zhang (University of South Carolina)
    Abstract: Measurement errors and outliers are commonly generated during the process of longitudinal data collection and ignoring them in data analysis may lead to serious biases in estimators. Therefore, it is important to take account of measurement errors and outliers appropriately in longitudinal data analysis. In this paper, a robust estimating equation for analyzing longitudinal data with covariate measurement errors and outliers is proposed. Specifically, the biases caused by measurement errors are reduced via using the replicate measurements and the biases caused by outliers are corrected via centralizing the error-prone covariate matrix. The proposed method does not require specifying the distributions of the true covariates, response and measurement error. In practice, it can be implemented by the standard generalized estimating equations algorithms easily. The asymptotic normality of the proposed estimator is established under some regularity conditions. Extensive simulation studies show that the proposed method performs better in handling measurement errors and outliers than the existing methods. For illustration, the proposed method is applied to a data set from the Lifestyle Education for Activity and Nutrition (LEAN) study.