Session 19: Big Data of different forms and different challenges

Session title: Big Data of different forms and different challenges
Organizer: Regina Liu (Rutgers)
Chair: Xuan Bi(Yale)
Time: June 5th, 8:30am – 10:00am
Location: VEC 1202 /1203

Speech 1: Individualized Multilayer Learning with An Application in Breast Cancer Imaging
Speaker: Annie Qu (UIUC)
Abstract: This work is motivated by breast cancer imaging data produced by a multimodal multiphoton optical imaging technique. One unique aspect of breast cancer imaging is that different individuals might have breast imaging at different locations, which also creates a technical difficulty in that the imaging background could vary for different individuals. We develop a multilayer tensor learning method to predict disease status effectively through utilizing subject-wise imaging information. In particular, we construct an individualized multilayer model which leverages an additional layer of individual structure of imaging in addition to employing a high-order tensor decomposition shared by populations. In addition, to incorporate multimodal imaging data for different profiling of tissue, cellular and molecular levels, we propose a higher order tensor representation to combine multiple sources of information at different modalities, so important features associated with disease status and clinical outcomes can be extracted effectively. One major advantage of our approach is that we are able to capture the spatial information of microvesicles observed in certain modalities of optical imaging through combining multimodal imaging data. This has medical and clinical significance since microvesicles are more frequently observed among cancer patients than healthy ones, and identification of microvesicles enables us to provide an effective diagnostic tool for early-stage cancer detection. This is joint work with Xiwei Tang and Xuan Bi.

Speech 2:  Efficient estimation and fast algorithms for genetic microarray data with survival outcomes
Speaker: Catherine Chunling Liu (Polytech U of HK) 
Abstract: In gene expression microarray studies,   genetic and genomic data tend to be  high- or ultrahigh- dimensional and are accompanied with random censored survival outcomes.  To search out and evaluate influence features that will impact on the disease, it  is imperative to develop  new modeling, efficient estimation methodology, and feasible algorithms within such data setting. In this talk, we will discuss in three aspects. First of all, for ultrahigh dimensional data modeled by the proportional hazard model, we present a non-monotone proximal gradient algorithm with lasso-type initial value to do feature screening and variable selection; Next, we recommend a single index hazard model without specifying the functional form. Efficient estimation procedures for index coefficients will facilitate detection of significance of individual effects. Finally we consider jointly modeling the mean and intensity function  involving multiple index structure and develop a unified methodology to conduct dimension reduction. A normal acute myeloid leukemia data is analyzed to demonstrate our approaches.

Speech 3: Nonparametric mean estimation for big-but-biased data
Speaker: Ricardo Cao (Universidade da Coruña)
Abstract: Some authors have recently warned about the risks of the sentence “with enough data, the numbers speak for themselves”. Some of the problems coming from ignoring sampling bias in big data statistical analysis have been recently reported. The problem of nonparametric statistical inference in big data under the presence of sampling bias is considered in this work. The mean estimation problem is studied in this setup, in a nonparametric framework, when the biasing weight function is known (unrealistic) as well as for unknown weight functions (realistic). Two different scenarios are considered to remedy the problem of ignoring the weight function: (i) having a small sized simple random sample of the real population and (ii) having observed a sample from a doubly biased distribution. In both cases the problem is related to nonparametric density estimation. Asymptotic expressions for the mean squared error of the estimators proposed for scenario (i) are considered. This leads to asymptotic formulas for the optimal smoothing parameters. Some simulations illustrate the performance of the nonparametric methods proposed in this work.