Session 48: Causal inference and statistical learning – Conference on Statistical Learning and Data Science / Nonparametric Statistics

Session title: Causal inference and statistical learning
Organizer: Cynthia Rudin (Duke)
Chair: Cynthia Rudin (Duke)
Time: June 6^th, 3:15pm – 4:45pm
Location: VEC 1402

Speech 1: Teaching History and Ethics of Data, with Python
Speaker: Chris Wiggins (Columbia & NY Times)
Abstract: Data-empowered algorithms are reshaping our professional, personal, and political realities. However, existing curricula are predominantly designed either for future technologists, focusing on functional capabilities; or for future humanists, focusing on critical and rhetorical context surrounding data. “Data: Past, Present, and Future” is a new course at Columbia which seeks to define a curriculum at present taught to neither group, yet of interest and utility to future statisticians, CEOs, and senators alike. The course has been co-developed by Matt Jones, Professor at Columbia’s History department, and myself. The intellectual arc traces from the 18th century to present day, beginning with examples of contemporary technological advances, disquieting ethical debates, and financial success powered by panoptic persuasion architectures. The weekly cadence of the course pairs primary and secondary readings with Jupyter notebooks in Python, engaging directly with the data and intellectual advances under study. Throughout, these intellectual technical advances are paired with critical inquiry into the forces which encouraged and benefited from these new capabilities, i.e., the political dimension of data and technology. In this talk I will give an overview of lessons learned from teaching the class, and argue that 1) the material can be engaged by students from a wide variety of curricular backgrounds and 2) the structure of the class — using history to make the present strange, then critiquing the ethics of the technology-enabled future we are building — can be useful for a variety of subjects. Syllabus, Jupyter notebooks, and additional info can be found via https://urldefense.proofpoint.com/v2/url?u=https-3A__data-2Dppf.github.io_&d=DwIGaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=tOZZtjNyrCSrR8o-Z8CHQgSAixSz_BAEnVZS6kAcAqM&m=F8OpY9yVfjfDb4WjIn47RbBGz-fwLvStzUaysCj_eBs&s=nWjj0uj0jEKmFsD0m3J_SPilwvfY3Fs5ZxA0HMUKPao&e= “Data: Past, Present, and Future” is supported by the Columbia University Collaboratory Fellows Fund. Jointly founded by Columbia University’s Data Science Institute and Columbia Entrepreneurship, The Collaboratory@Columbia is a university-wide program dedicated to supporting collaborative curricula innovations designed to ensure that all Columbia University students receive the education and training that they need to succeed in today’s data rich world.

Speech 2: Bayesian optimization and A/B tests
Speaker: Ben Letham (Facebook data science)
Abstract: Randomized experiments provide a direct, albeit time-consuming and noisy, measurement of the effect of changes to a system. We often want to optimize the parameters of systems that can only be evaluated via noisy experiments. I will describe how Bayesian optimization is used in this setting at Facebook, such as for optimizing web server compiler flags. I will then discuss our current efforts to expand the scope of optimization via field experiments.

Speech 3: Causal inference from complex observational data
Speaker: Alex Volfovsky (Duke)
Abstract: A classical problem in causal inference is that of matching treatment units to control units. Some of the main challenges in developing matching methods arise from the tension among (i) inclusion of as many covariates as possible in defining the matched groups, (ii) having matched groups with enough treated and control units for a valid estimate of Average Treatment Effect (ATE) in each group, (iii) computing the matched pairs efficiently for large datasets, and (iv) dealing with complicating factors such as non-independence among units. We propose the Fast Large-scale Almost Matching Exactly (FLAME) framework to tackle these problems. At its core this framework proposes an optimization objective for match quality that captures covariates that are integral for making causal statements while encouraging as many matches as possible. This objective can then be augmented to capture common complicating factors.