Session 5: Statistical Inference in Clustering Problems

Session title: Statistical Inference in Clustering Problems
Organizer: Jacob Bien (Cornell)
Chair: Jacob Bien (Cornell)
June 4th, 9:00-10:30am
Location: VEC 1402

Speech 1: Inference for variable clustering under correlation-like similarities

Speaker:  Max G’Sell (CMU)
Abstract: Clustering is often applied to detect dependence structure among the variables in large data sets.  However, it is typically difficult to determine the appropriate amount of clustering to carry out in a given application.  We will take a selective inference approach to testing of hierarchical clustering of variables based on measures of their correlation.  We will see that this yields reasonable goodness-of-fit stopping rules for selecting the number of clusters.  We will consider weakening the required assumptions and generalizing the measure of correlation, and the computational issues that arise in this pursuit.

Speech 2: Large scale cluster analysis via L1 fusion penalization

Speaker: Gourab Mukherjee (USC)
Abstract: We study the large sample behavior of a convex clustering framework, which minimizes the sample within cluster sum of squares under an L_1 fusion constraint on the cluster centroids. We establish that the sample procedure consistently estimates its population analog. We derive the corresponding rates of convergence and develop a novel methodology for feature screening in the clustering of massive datasets. We demonstrate empirically the applicability of our method to cluster analysis of big datasets arising in single-cell gene expression studies.

Speech 3: Density Tree and Density Ranking in Singular Measures
Speaker: Yen-Chi Chen (UW)
Abstract: A density tree (also known as a cluster tree of a probability density function) is a tool in topological data analysis that uses a tree structure to represent the shape of a density function. Even if the density function is multivariate, a density tree can always be displayed on a two-dimensional plane, making it an ideal tool for visualizing the shape of a multivariate dataset. However, in complex datasets such as GPS data, the underlying distribution function is singular so the usual density function and density tree no longer exist. To analyze this type of data and generalize the density tree, we introduce the concept of density ranking and ranking tree (also called an $\alpha$-tree). We then show that one can consistently estimate the density ranking and the ranking tree using a kernel density estimator. Based on the density ranking, we introduce several geometric and topological summary curves for analyzing GPS datasets.