Session 9: Advances in estimation and prediction for understanding complex disorders – Conference on Statistical Learning and Data Science / Nonparametric Statistics

Session title: Advances in estimation and prediction for understanding complex disorders
Organizer: Heping Zhang (Yale)
Chair: Naveen Narisetty (UIUC)
Time: June 4^th, 11:00am-12:30pm
Location: VEC 902

Speech 1: Uncertainty Quantification of Treatment Regime in Precision Medicine by Confidence Distributions
Speaker: Min-ge Xie (Rutgers)
Abstract: Personalized decision rule in precision medicine can be viewed as a “discrete parameter”, for which theoretical development for statistical inference is lacking. In this talk, we propose a new way to quantify the estimation uncertainty in a personalized decision based on recent developments of confidence distribution (CD). Specifically, in a parametric regression model setup, suppose the decision for treatment versus control for an individual x_a is determined by a linear decision rule D_a = I(x_ab> x_ag), where b and g are unknown regression coefficients in models for potential outcomes of treatment and control, respectively. The data-driven decision hat-D_a relies on the estimates of b and g, which in turn introduces uncertainty on the decision. In this work, we propose to find a CD for h_a=x_ab – x_ag and compute a “confidence measure” of the decision {Da = 1} = {ha > 0}. This measure has a value between 0 and 1, and provides a frequency-based assessment on how reliable our decision is. For example, if the confidence measure of the decision {Da = 1} is 63%, then we know that, out of 100 patients who are the same as patient xa, 63 will benefit to have the treatment and 38 will be better off to be in the control group. Numerical study suggests that this new measurement is inline with classical assessments (such as sensitivity, specificity, etc.), but different from the classical assessments, this measurement can be directly computed from the observed data without the need to know the truth of {Da = 1} or {Da = 0}. Utility of this new measure will also be demonstrated in an application of an adaptive-design clinical trial. (Joint work with Yilei Zhan and Sijian Wang)Personalized decision rule in precision medicine can be viewed as a “discrete parameter”, for which theoretical development for statistical inference is lacking. In this talk, we propose a new way to quantify the estimation uncertainty in a personalized decision based on recent developments of confidence distribution (CD). Specifically, in a parametric regression model setup, suppose the decision for treatment versus control for an individual x_ais determined by a linear decision rule D_a = I(x_ab> x_ag), where b and g are unknown regression coefficients in models for potential outcomes of treatment and control, respectively. The data-driven decision hat-D_a relies on the estimates of andg, which in turn introduces uncertainty on the decision. In this work, we propose to find a CD for h_a=x_ab – x_ag and compute a “confidence measure” of the decision D_a = 1} = {h_a> 0}. This measure has a value between 0 and 1, and provides a frequency-based assessment on how reliable our decision is. For example, if the confidence measure of the decision D_a = 1} is 63%, then we know that, out of 100 patients who are the same as patient x_a, 63 will benefit to have the treatment and 38 will be better off to be in the control group. Numerical study suggests that this new measurement is inline with classical assessments (such as sensitivity, specificity, etc.), but different from the classical assessments, this measurement can be directly computed from the observed data without the need to know the truth of D_a = 1} or D_a = 0}. Utility of this new measure will also be demonstrated in an application of an adaptive-design clinical trial. (Joint work with Yilei Zhan and Sijian Wang)

Speech 2: Semiparametric Estimation in the Secondary Analysis of Case-Control Studies
Speaker: Yanyuan Ma (Penn State)
Abstract:

We study the regression relationship among covariates in case-control data, an area known as the secondary analysis of case-control studies. The context is such that only the form of the regression mean is specified, so that we allow an arbitrary regression error distribution, which can depend on the covariates and thus can be heteroscedastic. Under mild regularity conditions we establish the theoretical identifiability of such models. Previous work in this context has either (a) specified a fully parametric distribution for the regression errors, (b) specified a homoscedastic distribution for the regression errors, (c) has specified the rate of disease in the population (we refer this as true population), or (d) has made a rare disease approximation. We construct a class of semiparametric estimation procedures that rely on none of these. The estimators differ from the usual semiparametric ones in that they draw conclusions about the true population, while technically operating in a hypothetic superpopulation. We also construct estimators with a unique feature, in that they are robust against the misspecification of the regression error distribution in terms of variance structure, while all other nonparametric effects are estimated despite of the biased samples. We establish the asymptotic properties of the estimators and illustrate their finite sample performance through simulation studies, as well as through an empirical example on the relation between red meat consumption and heterocyclic amines. Our analysis verified the positive relationship between red meat consumption and two forms of HCA, indicating that increased red meat consumption leads to increased levels of MeIQA and PhiP, both being risk factors for colorectal cancer.

Speech 3: Quantile Decision Trees and Forest with its application for predicting the risk (Post-Traumatic Stress Disorder) PTSD after experienced an acute coronary syndrome
Speaker: Ying Wei (Columbia)
Abstract: Classification and regression trees (CART) are a classic statistical learning method that efficiently partitions the sample space into mutually exclusive subspaces with the distinctive means of an outcome of interest. It is a powerful tool for efficient subgroup analysis and allows for complex associations and interactions to achieve high prediction accuracy and stability. Hence, they are appealing tools for precision health applications that deal with large amounts of data from EMRs, genomics, and mobile data and aim to provide a transparent decision mechanism. Although there is a vast literature on decision trees and random forests, most algorithms identify subspaces with distinctive outcome means. The most vulnerable or high-risk groups for certain diseases are often patients with extremely high (or low) biomarker and phenotype values. However, means-based partitioning may not be effective for identifying patients with extreme phenotype values. We propose a new regression tree framework based on quantile regression \cite{KoenkerBassett1978} that partitions the sample space and predicts the outcome of interest based on conditional quantiles of the outcome variable. We implemented and evaluated the performance of the conditional quantile trees/forests to predict the risk of developing PTSD after experiencing an acute coronary syndrome (ACS), using an observational cohort data from the REactions to Acute Care and Hospitalization (REACH) study\cite{onge2017depressive} at New York Presbyterian Hospital. The results show that the conditional quantile based trees/forest have better discrimination power to identify patients with severe PTSD symptoms, in comparison to the classical mean based CART.