On Monday, March 5, the Scholarly Commons Interns (Matt and Clay) hosted the first Project Forum Discussion. In order to address the variety of projects and scholarly backgrounds, we decided that our conversations should be organized around presentations of projects and related readings from other Digital Humanities scholars or related research.
We began by discussing some consistent topics or questions that are present in each of our Digital Humanities projects and how we conceptualize them. These questions will not only guide our reading discussion on this article, but also further conversations as we read work under the DH umbrella.
1. How does the article make its DH work legible to other scholars / fields?
2. How does the article display information?
3. What affordances or impact does the digital platform (artifact) have on the study?
4. How does the article conceptualize gaps in the data?
If you would like to participate in our next discussion, please join us Monday, March 26, at 2 pm in Library 220.
Author Archives: Matt Pitchford
Preparing Your Data for Topic Modeling
In keeping with my series of blog posts on my research project, this post is about how to prepare your data for input into a topic modeling package. I used Twitter data in my project, which is relatively sparse at only 140 characters per tweet, but the principles can be applied to any document or set of documents that you want to analyze.
Topic Models:
Topic models work by identifying and grouping words that co-occur into “topics.” As David Blei writes, Latent Dirichlet allocation (LDA) topic modeling makes two fundamental assumptions: “(1) There are a fixed number of patterns of word use, groups of terms that tend to occur together in documents. Call them topics. (2) Each document in the corpus exhibits the topics to varying degree. For example, suppose two of the topics are politics and film. LDA will represent a book like James E. Combs and Sara T. Combs’ Film Propaganda and American Politics: An Analysis and Filmography as partly about politics and partly about film.”
Topic models do not have any actual semantic knowledge of the words, and so do not “read” the sentence. Instead, topic models use math. The tokens/words that tend to co-occur are statistically likely to be related to one another. However, that also means that the model is susceptible to “noise,” or falsely identifying patterns of cooccurrence if non-important but highly-repeated terms are used. As with most computational methods, “garbage in, garbage out.”
In order to make sure that the topic model is identifying interesting or important patterns instead of noise, I had to accomplish the following pre-processing or “cleaning” steps.
- First, I removed the punctuation marks, like “,.;:?!”. Without this step, commas started showing up in all of my results. Since they didn’t add to the meaning of the text, they were not necessary to analyze.
- Second, I removed the stop-words, like “I,” “and,” and “the,” because those words are so common in any English sentence that they tend to be over-represented in the results. Many of my tweets were emotional responses, so many authors wrote in the first person. This tended to skew my results, although you should be careful about what stop words you remove. Simply removing stop-words without checking them first means that you can accidentally filter out important data.
- Finally, I removed too common words that were uniquely present in my data. For example, many of my tweets were retweets and therefore contained the word “rt.” I also ended up removing mentions to other authors because highly retweeted texts tended to mean that I was getting Twitter user handles as significant words in my results.
Cleaning the Data:
My original data set was 10 Excel files of 10,000 tweets each. In order to clean and standardize all these data points, as well as combining my file into one single document, I used OpenRefine. OpenRefine is a powerful tool, and it makes it easy to work with all your data at once, even if it is a large number of entries. I uploaded all of my datasets, then performed some quick cleaning available under the “Common Transformations” option under the triangle dropdown at the head of each column: I changed everything to lowercase, unescaped HTML characters (to make sure that I didn’t get errors when trying to run it in Python), and removed extra white spaces between words.
OpenRefine also lets you use regular expressions, which is a kind of search tool for finding specific strings of characters inside other text. This allowed me to remove punctuation, hashtags, and author mentions by running a find and replace command.
- Remove punctuation: grel:value.replace(/(\p{P}(?<!’)(?<!-))/, “”)
- Any punctuation character is removed.
- Remove users: grel:value.replace(/(@\S*)/, “”)
- Any string that begins with an @ is removed. It ends at the space following the word.
- Remove hashtags: grel:value.replace(/(#\S*)/,””)
- Any string that begins with a # is removed. It ends at the space following the word.
Regular expressions, commonly abbreviated as “regex,” can take a little getting used to in order to understand how they work. Fortunately, OpenRefine itself has some solid documentation on the subject, and I also found this cheatsheet valuable as I was trying to get it work. If you want to create your own regex search strings, regex101.com has a tool that lets you test your expression before you actually deploy it in OpenRefine.
After downloading the entire data set as a Comma Separated Value (.csv) file, I then used the Natural Language ToolKit (NLTK) for Python to remove stop-words. The code itself can be found here, but I first saved the content of the tweets as a single text file, and then I told NLTK to go over every line of the document and remove words that are in its common stop word dictionary. The output is then saved in another text file, which is ready to be fed into a topic modeling package, such as MALLET.
At the end of all these cleaning steps, my resulting data is essentially composed of unique nouns and verbs, so, for example, @Phoenix_Rises13’s tweet “rt @drlawyercop since sensible, national gun control is a steep climb, how about we just start with orlando? #guncontrolnow” becomes instead “since sensible national gun control steep climb start orlando.” This means that the topic modeling will be more focused on the particular words present in each tweet, rather than commonalities of the English language.
Now my data is cleaned from any additional noise, and it is ready to be input into a topic modeling program.
Interested in working with topic models? There are two Savvy Researcher topic modeling workshops, on December 6 and December 8, that focus on the theory and practice of using topic models to answer questions in the humanities. I hope to see you there!
Studying Rhetorical Responses to Terrorism on Twitter
As a part of my internship at the Scholarly Commons, I’m going to do a series of posts describing the tools and methodologies that I’ve used in order to work on my dissertation project. This write-up serves as an introduction to my project, it’s larger goals, and tools that I use to start working with my data.
The Dissertation Project
In general, my dissertation draws on computational methodologies to account for the digital circulation and fragmentation of political movement texts in new media environments. In particular, I will examine the rhetorical responses on Twitter to three terrorist attacks in the U.S.: the 2013 Boston Marathon Bombing, the 2015 San Bernardino Shooting, and the 2016 Orlando Nightclub shooting. I begin with the idea that terrorism is a kind of message directed at an audience, and I am interested in how digital audiences in the U.S. come to understand, make meaning of, and navigate uncertainty following a terrorist attack. I am interested in the patterns of narratives, community construction, and expressions of affect that characterize terrorism as a social media phenomenon.
I am interested in the following questions: What methods might rhetorical scholars use to better understand the vast numbers of texts, posts, and “tweets” that make up our social media? How do digital audiences construct meanings in light of terrorist attacks? How does the interwoven agency and materiality of digital spaces influence forms of rhetorical action, such as invention and style? In order to better address such challenges, I turn to the tools and techniques of the Digital Humanities as a computational modes of analysis to examine the digitally circulated rhetoric surrounding terror events. Investigation of this rhetoric using topic models will help scholars to understand not only particular aspects of terrorism as a social media phenomenon, but also to better see the ways that community and identity are themselves formed amid digitally circulated texts.
At the beginning of this project, I had no experience working with textual data, so the following posts represent a cleaned and edited version of the learning process I went through. There was a lot of mess and exploration involved, but that meant I’ve come to understand a lot more.
Gathering The Tools
I use a Mac, so accessing the command line is as simple as firing up the Terminal.App. Windows users have to do a bit more work in order to get all these tools, but plenty of tutorials can be found with a quick search.
Python (Anaconda)
The first big choice was to learn how to code in R or Python. I’d heard that Python was better for text and R was better for statistical work, but it seems that it mostly comes down to personal preference as you can find people doing both in either language. Both R and Python have a bit of a learning curve, but a quick search for topic modeling in Python gave me a ton of useful results, so I chose to start there.
Anaconda is a package management system for the Python languages. What’s great about Anaconda is not only that it has a robust management system (so I can easily download the tools and libraries that I need without having to worry about dependencies or other errors), but also that it encourages the creation of “environments” for you to work in. This means that I can make mistakes or install and uninstall packages without having to worry about messing up my overall system or my other environments.
Instructions for downloading Anaconda can be found here, and I found this cheat-sheet very useful in setting up my initial environments. Python has a ton of documentation, so these pages are useful, and there are plenty of tutorials online. Each environment comes with a few default packages, and I quickly added some toolkits for processing text and plotting graphs.

Confirming the Conda installation in Terminal, activating an environment, and listing the installed packages.
StackOverflow
Lots of people working with Python have the same problems or issues that I did. Whenever my code encountered an error, or when I didn’t know how to do something like write to a .txt file, searching StackOverflow usually got me on the right track. Most answers link to the Python documentation that relates to the question, so not only did I fix what was wrong but I also learned why.
GitHub
Sometimes scholars put their code on GitHub for sharing, advancing research, and confirming their findings. I found code on here that is for topic modeling in Python, as well as setting up repositories for my own work. Using GitHub is a useful version control system, so it also meant that I never “lost” old code and could track changes over time.
Programming Historian
This is a site for scholars interested in learning how to use tools for Digital Humanities work. There are some great tutorials here on a range of topics, including how to set up and use Python. It’s approachable and does a good job of covering everything you need to know.
These tools, taken together, form the basis of my workspace for dealing with my data. Upcoming topics will cover Data Collection, Cleaning the Data, Topic Models, and Graphing the Results.