Session 2: High-dimensional inference: assumption-lean or assumption-laden? – Conference on Statistical Learning and Data Science / Nonparametric Statistics

Session title: High-dimensional inference: assumption-lean or assumption-laden?
Organizer: Ryan Tibshirani (CMU)
Chair: Jelena Bradic (UCSD)
Time: June 4^th, 9:00-10:30am
Location: VEC 902/903

Speech 1: Inferential goals, targets, and principles in high-dimensional regression
Speaker: Todd Kuffner (Washington U)
Abstract: This talk will focus more on theory than methodology. We will analyze the inferential goals, the targets of inference, and the various principles employed to justify prominent methodologies. A new perspective, motivated by philosophy of science, will be presented on how to discern between competing inferential procedures for high-dimensional regression.

Speech 2: Should We Model X in High-Dimensional Inference?
Speaker: Lucas Janson (Harvard)
Abstract: For answering questions about the relationship between a response variable Y and a set of explanatory variables X, most statistical methods focus their assumptions on the conditional distribution of Y given X (or Y | X for short). I will describe some benefits of shifting those assumptions from the conditional distribution Y | X to the joint distribution of X, especially for high-dimensional data. First, modeling X can lead to assumptions that are more realistic and verifiable. Second, there are substantial methodological payoffs in terms of much greater flexibility in the tools an analyst can bring to bear on their data while also being guaranteed exact (non-asymptotic) inference. I will briefly mention some of my recent and ongoing work on methods for high-dimensional inference that model X instead of Y, as well as some challenges and interesting directions for the future.

Speech 3: Towards a Better Understanding of “High-Dimensional” Linear Least Squares Regression
Speaker: Andreas Buja (U Penn)
Abstract: High-dimensional regression is conventionally interpreted as an optimization problem of, most commonly, a penalized or constrained LS criterion. Instead, we propose to reinterpret high-dimensional regression as follows: Data is first explored either in a principled way (e.g., lasso or best subset selection) or in an exploratory/unprincipled way to select a manageable set of variables; subsequently the reduced data are subjected to linear regression. The final set of variables is often much smaller than the sample size and
the total number of initial variables. We will treat the combination of both steps as forming high-dimensional linear regression. A first question we consider is to ask what the nature of the OLS estimator is if regressors have been subselected by some variable selection procedure. We answer this question in full generality by proving a deterministic uniform-in-model result about linear regression, and this provides an interpretation of what is being estimated irrespective of the data-dependent variable selection procedure.A second question we consider is how to perform statistical inference using the OLS estimator obtained from a variable selection procedure. This problem is exactly the problem of valid Post-Selection Inference (PoSI). This talk will focus on an approach to PoSI based on an asymptotic linear representation and a high-dimensional central limit theorem.All our results are proved without assuming any probability models, and they allow for non-identically distributed random vectors. In addition, they apply equally to independent and functionally dependent data. Finally, our results do not require any sparsity assumptions. Joint work with the Wharton Linear Models Group including Lawrence
Brown, Edward George and Linda Zhao. Some of this talk is based on https://arxiv.org/abs/1802.05801.