Erin Molloy

University of Illinois at Urbana-Champaign

PhD Candidate

I am a PhD candidate in the Department of Computer Science at the University of Illinois at Urbana-Champaign. I work at the intersection of computational biology (advised by Tandy Warnow) and parallel computing (advised by William Gropp), and my dissertation research focuses on scalable algorithm design (combining divide-and-conquer with techniques from graph theory and statistical learning) in order to build evolutionary trees from large genomic datasets (for example, the Tree of Life). I have been awarded the National Science Foundation Graduate Research Fellowship and two exploratory allocations on the Blue Waters supercomputer in support of my thesis research. More broadly, I am interested in the design and evaluation of (parallel) algorithms to enable scientific discovery from (big) datasets. Last fall, I spent three months at the Institute for Pure and Applied Mathematics for the long program: Science at Extreme Scales—Where Big Data Meets Large Scale Computing. Before coming to Illinois, I was a neuroimaging researcher at the Health Emotions Research Institute (University of Wisconsin-Madison). My bachelor’s degree is in physics (University of Chicago). Learn more on my webpage: http://erinkmolloy.web.illinois.edu/

Research Abstract:

Graphical models of evolution (called phylogenies) can be used to study evolutionary processes, for example, how do species evolve and adapt to their environments. In addition, phylogenies are important for many applications from protein structure and function prediction to metagenomics. While recent advances in sequencing technologies have generated large datasets (including full genomes for an increasingly large number of species), phylogenetic inference can be computationally challenging. Many of the best methods are heuristics for solving NP-hard optimization problems, and such methods typically have limited parallelism when the number of species is large.

For my dissertation research, I address this challenge by introducing a novel divide-and-conquer phylogeny estimation pipeline. The pipeline operates by (1) dividing the set of n species into k pairwise disjoint subsets, (2) building a tree on each subset, and (3) combining the subset trees together using some auxiliary information (also computed from the dataset). For step 3, I present two polynomial-time methods (called NJMerge and TreeMerge) with the later method having improved parallel efficiency and substantially fewer rounds of collective communication—O(1) instead of O(n). Finally, I show that both of these methods perform well in practice (via a simulation study) and prove that they can be used to create statistically consistent phylogeny estimation pipelines.