Category Archives: Department of Statistics

Preparing for a career in "Big Data"

This comes from a Statistics PhD student of ours who is also a Data Scientist for a computing company in Research Park. Here are his thoughts on what students can do to prepare.

“I would recommend taking Java courses which is relative simple and more widely used these days, then maybe can take some training about Hadoop and Hive. There are also some good Books, like : Think in Java, Hadoop Definitive guide, Programming Hive etc.  And there are maybe some open source project using “Big Data”, usually we can learn a lot from other’s design and code.”

Bohrer Workshop – Nov. 15

One of the most notable activities in our department for graduate students is the Bohrer Workshop. A subset of the graduate students are selected to give a quality presentation of their current research in statistics. This is an excellent opportunity for undergraduate students to get a peek at what might lie ahead if you too choose to pursue an advanced degree. You don’t have to be there the whole day. Just pick one or maybe two that sound interesting. Schedule below.

Bohrer Workshop Schedule


The power of Statistics

A post from one of your fellow Statistics students…

I wanted to bring you attention to as you may have seen during the election.  Nate Silver used some awesome yet simple models to predict the election.  This LA times article has some awesome stuff about math/statistics and the future use of it!,0,2926239.story
I thought it was very relevant and interesting as this just shows the power of numbers.  I think some of our grad students (even undergrad) would be interested in it so feel free to pass this on.  I followed that website for the last 6 months, and it was really cool to see how his numbers and projections changed.  He sure did stand his ground with his methods.


Statistics Seminar — John Lafferty at AIIS this Friday!

This week we are hosting Prof. John Lafferty ( He’ll deliver a talk at the AIIS seminar ( Please note that the talk venue has been moved from 3405 SC to 2405 SC as we are expecting a larger audience. Following are the details:

Nov 9, Friday. 4 pm.

2405, Siebel Center

Graphical Model Estimation

The graphical model has proven to be a useful abstraction in statistics and machine learning.  The starting point is the graph of a distribution.  While often the graph is assumed given, we have been studying the problem of estimating the graph from data.  In this talk we present several nonparametric and semi-parametric methods for graph estimation.  One approach is a nonparametric extension of the Gaussian graphical model that allows arbitrary graphs.  For the discrete Gaussian (Ising model), we use parallel neighborhood selection with L1-regularized logistic regression.  Alternatively, we can restrict the family of graphs to spanning forests, enabling the use of fully nonparametric density estimation in high dimensions.  When additional covariates are available, we propose a framework for graph-valued regression.  The resulting methods are easy to understand and use, theoretically well supported, and effective for modeling and exploring high dimensional data.  Joint work with Han Liu, Pradeep Ravikumar, Martin Wainwright, and Larry Wasserman.

John Lafferty is the Louis Block Professor in the Departments of Statistics, Computer Science, and the College at The University of Chicago. His research area is machine learning, with a focus on computational and statistical aspects of nonparametric methods, high-dimensional data, graphical models, and applications.  An associate editor of the Journal of Machine Learning Research, Dr. Lafferty served as program co-chair and general co-chair of the Neural Information Processing Systems Foundation conferences in 2009 and 2010. Dr. Lafferty received his doctoral degree in mathematics from Princeton University, where he was a member of the Program in Applied and Computational Mathematics.  Prior to joining the University of Chicago in 2011, he was Professor of Computer Science, Machine Learning, and Statistics at Carnegie Mellon University, where he is currently an Adjunct Professor.

Dept. of Statistics Weekly seminar

Lee DeVille (University of Illinois at Urbana-Champaign): Stochastic dynamics on networks. Emergence of collective behaviors
Date                Nov 1, 2012
Time               4:00 pm – 4:50 pm
Location         156 Henry
Sponsor         Statistics Department
Event type     Seminar
Dynamical systems defined on networks have applications in many fields in science and engineering. In particular, it is important to understand when networks exhibit synchronous or other types of coherent collective behaviors. Other questions include whether such coherent behavior is stable with respect to random perturbation, or what the detailed structure of this behavior is as it evolves. We will examine several models of networked dynamical systems and present a mixture of results that range from rigorous theorems for abstract models to quantitative comparisons of models and data.

Department of Statistics Weekly Seminar

Wei Sun (University of North Carolina): Statistical methods for RNA-seq data
Date            Oct 18, 2012
Time            4:00 pm – 5:00 pm
Location     156 Henry
Sponsor      Statistics Department
Event type   Seminar
RNA-seq is replacing gene expression microarrays as the most commonly used technique to assess genome-wide transcription abundance. RNA-seq delivers two novel features. First, it provides information on allele-specific expression (ASE), which is not available from gene expression microarrays. Second, it generates unprecedentedly rich data to study RNA-isoform expression. I will present statistical methods for joint study of allele-specific expression and total expression of a gene, transcriptome reconstruction, isoform abundance estimation, and Differential isOform usage Testing (DOT).

Department of Statistics weekly seminar

Yuan Ji, Ph.D. (NorthShore University HealthSystem): Bayesian Models for Next-Generation Sequencing Data on Histone Modifications
Speaker         Yuan Ji, Ph.D. (NorthShore University HealthSystem )
Date                Oct 11, 2012
Time               4:00 pm – 4:50 pm
Location         156 Henry
Sponsor         Statistics Department
Event type     Seminar
In this talk, I will describe how Bayesian models are successfully applied to the field of epigenetics, which is concerned about regulatory mechanism of gene expression. Epigenetics, one of the most heavily researched and challenging field in biology, increasingly draws attention from statisticians due to breakthroughs in bioengineer and biotechnology that allow large-scale and high-throughput experiments to be routinely conducted with affordable cost. A central topic of epigenetics is to understand the chromatin state — modifications to histones and other proteins that package the DNA. A complex mechanism called “histone code” is believed to dictate the dynamics of DNA expression. As a step towards deciphering the histone code, we develop Bayesian models based on genome-wide mapping of histone modifications. Such models are only initial attempts to decipher the complex histone code but highlight the need of Bayesian inference in the research of gene regulations, receiving relatively small amount of attention from statisticians. I will summarize our recent work and results using a comprehensive ChIP-Seq data set.

Department of Statistics Weekly Seminar

Heike Hofmann, Ph.D. (Iowa State University)
Speaker           Heike Hofmann, Ph.D. (Iowa State University)
Date                Oct 4, 2012
Time                4:00 pm – 5:00 pm
Location          156 Henry
Sponsor           Statistics Department
A Discussion of Graphical Inference 
How do you know if something that you see in a data plot is really there? 
Statistical inference for exploratory data analysis allows us to quantitatively assess the strength of a visual finding, and places statistical graphics in the context of classical inference. New work builds on the lineup protocol, which puts graphics into an inference framework, that examines the data plot in relation to null plots. This talk describes various aspects of the development of graphics inference: definitions of terminology and concepts, experiments conducted to validate the lineup protocol, how to compute p-values and power. Applications of visual inference in practice will be discussed. This includes how to choose the best display and also includes scenarios where no classical test exists, because critical assumptions are violated.

Department of Statistics Weekly Seminar

Sewoong Oh (University of Illinois at Urbana-Champaign): Budget-Optimal Task-Allocation for Reliable Crowdsourcing Systems
Speaker           Sewoong Oh, University of Illinois at Urbana-Champaign
Date                Sep 27, 2012
Time                4:00 pm – 5:00 pm
Location          156 Henry
Sponsor           Statistics Department
Event type        Seminar
This talk is on my ongoing research on designing reliable and cost-efficient crowdsourcing systems. Crowdsourcing is a novel paradigm for solving large scale problems by breaking them down into small tasks that are electronically distributed to numerous on-demand human contributors. In typical crowdsourcing, these tasks are submitted to an electronic labor market and completed by any worker choosing to pick it up for a small reward. However, since typical crowdsourced tasks are tedious and the reward is small, errors are common even among those who make an effort. Thus, all taskmasters need to devise schemes to increase confidence in their answers. A common approach is to assign each task multiple times and combining the answers in some way such as majority voting. For such systems, there is a fundamental problem of interest: how can we achieve a certain reliability in our answers at minimum cost? Under a general model, we provide an optimal algorithm based on low-rank matrix approximation and belief propagation. We prove that our approach significantly outperforms majority voting and, in fact, is asymptotically order-optimal through comparison to an oracle that knows the reliability of every worker. We also provide experimental results on synthetic and real datasets that support the optimality of our approach.

Department of Statistics Weekly Seminar

Song-Xi Chen (Iowa State University): High Dimensional Empirical Likelihood for Generalized Estimating Equations with Dependent Data
Speaker           Song-Xi Chen, Iowa State University
Date                Sep 20, 2012
Time                3:30 pm – 4:30 pm
Location          122 Illini Hall
Sponsor           Statistics Department
Event type        Seminar
This paper studies the maximum empirical likelihood estimation (MELE) and inference on parameters identified by generalized estimating equations with weakly dependent data when the dimensions of the estimating equations and the parameters are diverging. Our theory greatly extends a wide range of existing results to the new time series framework of growing dimensions of the parameters, the estimating equations and the observed covariates. We obtain the consistency with rates and the asymptotic normality of the MELE by properly restricting the growth rates of the dimensions of the parameters and the estimating equations, as well as the degree of dependence. We also show that, even in this high dimensional nonlinear time series setting, the empirical likelihood ratio still behaves like a Chi-square random variable asymptotically. (Note that time and location are different from usual)