Session 3: Modern nonparametric statistics – Conference on Statistical Learning and Data Science / Nonparametric Statistics

Session title: Modern nonparametric statistics
Organizer: Richard Samworth (Cambridge)
Chair: Zhengling Qi (UNC)
Time: June 4^th, 9:00-10:30am
Location: VEC 1202/1203

Speech 1: Regimes of label-noise determine the benefits of Active Learning
Speaker: Samory Kpotufe (Princeton)
Abstract: In active learning (for classification tasks), the learner has the ability to request the labels of carefully chosen points over space. Intuitively, this might speedup the learning process — in terms of the number of labels required to achieve a fixed error— over the usual passive setting where the learner accesses i.i.d labeled data. Unfortunately, despite significant progress on the subject, the benefits of active over passive learning remain largely unclear: for example, in the usual PAC setting with VC classes, label requirements in active learning are of the same order as in passive learning outside of strong assumptions on label noise. However, a clearer picture of the benefits of active learning emerges under refined parameterization of label noise – this is considered e.g. in work by Hanneke and by Koltchinskii, however under the strong assumption of bounded ‘disagreement-coefficient’.In this talk, we aim to gain a better picture of the benefits of active learning over passive learning. In particular, we will consider parametrizations of label noise that help capture a continuum from easy to hard classification problems, and elicit a clearer picture of the benefits of active learning along this continuum. Such parametrizations draw on intuition from the so-called `cluster assumption’ in ML, and more generally on so-called ‘margin conditions’ common in both ML and Statistics. Our results reveal interesting phase transitions (in label requirements) driven by the interaction between noise parameters, marginal distribution, and data dimension. In particular, we manage to address a previous conjecture about the existence of some such transitions. Furthermore, our algorithmic strategies are adaptive, i.e., require no a priori knowledge of distributional parameters, yet are rate-optimal.The talk is based on recent collaborations with S. Ben-David, R. Urner, A. Locatelli, and A. Carpentier.

Speech 2: Sampling design and stochastic gradient descent for relational data
Speaker: Peter Orbanz (Columbia)
Abstract: State-of-the art learning procedures for relational data typically involve several steps that randomly subsample a data set: (1) During data acquisition from the underlying population. (2) During data splitting or cross validation. (3) During training, if learning involves stochastic gradient descent. There are many natural ways to subsample relational data, and in practice, it is not the exception but the rule that different sampling schemes are used in the different steps. That raises a number of problems: If the sampling schemes do not cohere in a suitable sense, the meaning of prediction becomes ambiguous, error estimates are biased, etc. I will discuss what conditions are required to avoid such problems, and describe a new method for learning from relational data that incorporates the sampling scheme as an explicit model design choice.

Joint work with Victor Veitch, Wenda Zhou, Morgane Austern and David Blei.

Speech 3: Statistical Properties of Maximum Mean Discrepancy with Gaussian Kernels
Speaker: Tong Li(Columbia)
Abstract: Despite the popularity of reproducing kernel based techniques for nonparametric hypothesis testing, the choice of kernel in these approaches is usually ad hoc and how to do so in a more principled way remains one of the most critical challenges in practice. To overcome this difficulty, we provide here justifications for one of the most common and successful choices, Gaussian kernels with a flexible shape parameter. More specifically, we study the statistical properties of maximum mean discrepancy (MMD) based testing procedures with Gaussian kernels. We show that they arise naturally when maximizing MMD over a general class of radial basis function kernels. Moreover, we show that when the underlying distributions are sufficiently smooth, MMD with Gaussian kernels gives rise to a test adaptive over different levels of smoothness in that it attains the minimax optimal detection rates, up to a logarithmic factor, for any given smoothness index.