Laure Thompson is a PhD candidate in computer science at Cornell University where she is advised by David Mimno. Her research interests are in the areas of natural language processing, machine learning, and digital humanities. More specifically, Laure’s research focuses on understanding what computational models learn and how we can intentionally change what they learn. Driven by humanistic applications, her work uses a wide range of cultural heritage corpora: from texts of science fiction novels and the Patrologia Graeca to images of avant-garde journals and engraved gemstones. Laure is a recipient of an NSF Graduate Research Fellowship and a Cornell University Fellowship. She received her bachelor’s degrees in computer science and electrical engineering with minors in mathematics and classical studies from the University of Washington in 2013.
Predicting and Directing What Machine Learning Learns
Although a vast amount of digital content is available for scholars to study, it is often underused because of the difficulty to explore and analyze it efficiently at scale. Machine learning and statistical methods, such as unsupervised semantic models and convolutional neural networks, can help by organizing large collections into lower-dimensional latent vector spaces thereby providing a way for scholars to determine which items are worth a closer look and to identify characteristics of whole collections. However, it is often unclear what underlying patterns these spaces learn and which patterns they are more likely to learn, making it difficult for scholars to understand when these methods are applicable to their work. Moreover, since not every learned structure is useful for a given line of scholarly inquiry, how can we directly influence what models learn? My research addresses all of these issues with particular emphasis on the final question. In recent work, I have focused on a common problem in the topic modeling of literary collections: many topics—learned word distributions—are highly correlated by known metadata such as author. For example, a topic with the top words of “robot”, “human”, and “brain” corresponds to Isaac Asimov’s Robots series rather than a general topic on artificial intelligence. I developed metrics to identify these correlated topics and a method for biasing models from learning these correlations in the first place. Currently, I am investigating how linguistic features are encoded in word embeddings across languages, as well as how extracted features from neural networks can be tailored for cultural heritage collections.