Session 34: Data Science in IT Industries – Conference on Statistical Learning and Data Science / Nonparametric Statistics

Session title: Data Science in IT Industries
Organizer: David Banks (Duke)
Chair: Genevera Allen(Rice)
Time: June 5^th, 3:15pm – 4:45pm
Location: VEC 1403

Speech 1 : Using Data Science to Improve Streaming Quality at Netflix
Speaker: Julie Novak (Netflix)
Abstract: In this talk, I will begin by giving an overview of the data science challenges involved in providing an optimal streaming service at Netflix. There are many dimensions to this problem, including selecting best picture quality based on network speed, determining proper content to cache on Netflix’s Content Delivery Networks (CDN), and improving each customer’s Quality of Experience (QoE). The talk will then dive deeper into the notion of QoE by explaining how to use statistical tools to measure and gain deeper understanding of it in the context of A/B testing.

Speech 2: Random Forests, Decision Trees, and Categorical Predictors: The “Absent Levels” Problem
Speaker: Tim Au (Google)
Abstract: One advantage of decision tree based methods like random forests is their ability to natively handle categorical predictors without having to first transform them (e.g., by using one-hot encoding). However, in this talk, we show how this capability can lead to an inherent “absent levels” problem for decision tree based methods that has never been thoroughly discussed, and whose consequences have never been carefully explored. This problem occurs whenever there is an indeterminacy over how to handle an observation that has reached a categorical split which was determined when the observation in question’s level was absent during training. Although these incidents may appear to be innocuous, by using Leo Breiman and Adele Cutler’s random forests FORTRAN code and the randomForest R package (Liaw and Wiener, 2002) as motivating case studies, we study how overlooking the absent levels problem can systematically bias a model. Furthermore, by using three real data examples, we illustrate how absent levels can dramatically alter a model’s performance in practice, and we empirically demonstrate how some simple heuristics can be used to help mitigate the effects of the absent levels problem until a more robust theoretical solution is found.

Speech 3: The Challenge of Educating Data Scientists for Industry
Speaker: David Banks (Duke University and SAMSI)
Abstract: The statistical world is changing quickly, and our graduate programs are (generally) not keeping pace. This talk reviews some of the structural and cultural barriers that we need to overcome. Besides proposing a model curriculum, it also addresses ways in which our publication processes now longer serve the interests of our profession, and it discusses the commodification of analysis.