Lightning Review: Text Analysis with R for Students of Literature

Cover of Text Analysis with R book

My undergraduate degree is in Classical Humanities and French, and like many humanities and liberal arts students, computers were mostly used for accessing Oxford Reference Online and double checking that “bonjour” meant “hello” before term papers were turned in. Actual critical analysis of literature came from my mind and my research, and nothing else. Recently, scholars in the humanities began seeing the potential of computational methods for their study, and coined these methods “digital humanities.” Computational text analysis provides insights that in many cases, aren’t possible for a human mind to complete. When was the last time you read 100 books to count occurrences of a certain word, or looked at thousands of documents to group their contents by topic? In Text Analysis with R for Students of Literature, Matthew Jockers presents programming concepts specifically how they relate to literature study, with plenty of help to make the most technophobic English student a digital humanist.

Jockers’ book caters to the beginning coder. You download practice text from his website that is already formatted to use in the tutorials presented, and he doesn’t dwell too much on pounding programming concepts into your head. I came into this text having already taken a course on Python, where we did edit text and complete exercises similar to the ones in this book, but even a complete beginner would find Jockers’ explanations perfect for diving into computational text analysis. There are some advanced statistical concepts presented which may turn those less mathematically inclined, but these are mentioned only as furthering understanding of what R does in the background, and can be left to the computer scientists. Practice-based and easy to get through, Text Analysis with R for Students of Literature serves its primary purpose of bringing the possibilities of programming to those used to traditional literature research methods.

Ready to start using a computer to study literature? Visit the Scholarly Commons to view the physical book, or download the eBook through the Illinois library.

Puentes/Bridges: Highlights from DH2018

At the end of June, the Alliance of Digital Humanities Organizations (ADHO) coordinated their annual international DH conference, Digital Humanities 2018, in Mexico City. DH2018 was the first conference in the organization’s history to be held in Latin America and in the global south. With a theme of Puentes/Bridges, DH2018 emphasized transnational discourse and inclusivity. Here are some highlights from the event!

Latin@ voices in the Midwest: Ohio Habla Podcast
Elena Foulis of Ohio State University discussed Ohio Habla, a podcast project that seeks to educate others on the Latin@ experience in the Midwest with interviews conducted in English and Spanish (and a mixture of the two).

Visualizing the Digital Humanities Community
What does the DH community look like? Researchers from University College London’s Centre for Digital Humanities visualized how authors of DH articles cite each other and interact with each other on Twitter, and compared the two networks.

Network Analysis of Javanese Traditional Theatre
How do characters in Javanese traditional theatre relate to one another? In an excellent example of non-traditional digital publishing, Miguel Escobar Varela of the National University of Singapore communicates his research findings on an interactive webpage.

Mayan hieroglyphs as a computer font

Mayan hieroglyphs as a computer font

Achieving Machine-Readable Mayan Text Via Unicode
Carlos Pallan Gayol of the University of Bonn and Deborah Anderson of UC Berkeley work to create Unicode equivalents of Mayan hieroglyphs to create a machine-readable version, ensuring reliable access to this language across devices.

Hurricane Memorial: Chronicling the Hurricane of 1928
A massive hurricane devastated Florida, Puerto Rico, and other parts of the Caribbean in 1928, but the story of this storm shifts depending on who you ask. Most of the storm’s victims were black migrant workers from Puerto Rico and Caribbean islands, whose deaths are minimized in most accounts. Christina Boyles of Trinity College seeks to “bring the stories of the storm’s underrepresented victims back into our cultural memory.”

Does “Late Style” Exist? New Stylometric Approaches to Variation in Single-Author Corpora
Jonathan Pearce Reeve presented some preliminary findings of his research on investigating whether or not an author has a true “late style.” Late style is a term most well-known from the works of Edward Said, alluding to an author’s shift to a writing style later in life that is unique from their “early” style. Read a review of his book, On Late Style. Code and other supplemental materials from Reeve’s research are available on GitHub.

screenshot from 4 rios webpage, shows drawings of people

4 Ríos: El Naya
A digital storytelling project about the impacts of armed conflict in Colombia, 4 Ríos is a transmedia project that includes a website, short film, and an interactive web-comic.

Researchers from our own University of Illinois participated in the conference, including Megan Senseney and Dan Tracy. Senseney, along with other Illinois researchers, presented “Audiences, Evidence, and Living Documents: Motivating Factors in Digital Humanities Monograph Publishing,” a survey of motivations behind humanities scholars digital publishing actions and needs. Megan also participated in a panel, “Unanticipated Afterlives: Resurrecting Dead Projects and Research Data for Pedagogical Use,” a discussion about how we might use unmaintained DH projects and data for learning purposes.

Tracy and other Illinois researchers presented a poster, Building a Bridge to Next Generation DH Services in Libraries with a Campus Needs Assessment, a report of results gathered while surveying the need for future DH services at research institutions, and how the library might facilitate this evolution. View Tracy’s poster in IDEALS.

ADHO gathered all resources tweeted out during the conference that you can view. You can also view a detailed schedule of presentations with descriptions here, or see paper abstracts here. Or, search #DH2018 on Twitter to see all the happenings!

Our Graduate Assistants: Kayla Abner

This interview is part of a new series introducing our graduate assistants to our online community. These are some of the people you will see when you visit our space, who will greet you with a smile and a willingness to help! Say hello to Kayla Abner!

What is your background education and work experience?

I have a Bachelor’s degree in Classical Humanities and French from Wright State University in Dayton (Go Raiders!). My original plan was to teach high school French or Latin, but after completing a student teaching practicum, I decided that wasn’t for me. During undergrad and after graduation, I always wound up in a job role that involved research or customer service in some capacity, which I really enjoyed.

What led you to your field?

Knowing that I enjoyed working on research, I considered going back to school for Library Science, but wanted to be sure before taking the jump. It was always interesting to see the results of the research I helped conduct, and I enjoyed helping people find answers, whether it was a coworker or a client. After a visit to an American Library Association conference in 2016,  I fell in love with the collaborative and share-alike nature of librarianship, and was accepted to this program the next year!

What are your research interests?

Library science has so many interesting topics, it’s hard to choose one. But, I like looking at how people seek information, and how libraries can use that knowledge to enhance their services and resources. I’m a browser when it comes to book shelves, and it’s interesting to see how libraries are succeeding/failing at bringing that experience to the digital realm.

What are your favorite projects you’ve worked on?

I have two positions here in the library, one in the Scholarly Commons, and one working with our (current) Digital Humanities librarian, Dan Tracy. In both roles, I’ve worn a lot of hats, so to speak. My favorites have been creating resources like library guides, and assisting with creating content for our Savvy Researcher workshop series. Maintaining our library guides requires some experience with the software, so I enjoy learning new cool things that our programs can do. I also do a lot of graphic design work, which is a lot of fun!

Completing some of these tasks let me use some Python knowledge from my coursework, which is sort of like a fun puzzle (how do I get this to work??). I’m really interested in using digital methods and tools in research, like text mining and data visualization. Coming from a humanities background, it is very exciting to see the cool things humanists can do beyond traditional scholarship. Digital humanities is a really interesting field that bridges the gap between computer science and the humanities.

What are some of your favorite underutilized resources that you would recommend?

Our people! They aren’t underutilized, but I love an opportunity to let campus know that we are an excellent point of contact between you and an expert. If you have a weird research question in one of our service areas, we can put in contact with the best person to help you.

When you graduate, what would your ideal job position look like?

I would love to work in an academic research library in a unit similar to the Scholarly Commons, where researchers can get the support they need to use digital methods and data in their research, especially in the humanities. There is a such a breadth of digital techniques that humanities researchers can utilize, that don’t necessarily replace traditional research methods. Distant reading a text puts forth different observations than traditional close reading, and both are equally useful.

What is the one thing you would want people to know about your field?

Librarians are happy to help you; don’t let a big desk intimidate you away from asking a question. That’s why we’re here!

New Digital Humanities Books in the Scholarly Commons!

Is there anything quite as satisfying as a new book? We just got a new shipment of books here in the Scholarly Commons that complement all our services, including digital humanities. Our books are non-circulating, so you cannot check them out, but these DH books are always available for your perusal in our space.

Stack of books in the Scholarly Commons

Two brand new and two mostly new DH books

Digital Humanities: Knowledge and Critique in a Digital Age by David M. Berry and Anders Fagerjord

Two media studies scholars examine the history and future of digital humanities. DH is a relatively new field, and one that is still not clearly defined. Berry and Fagerjord take a deep dive into the methods that digital humanists gravitate towards, and critique their use in relation to the broader cultural context. They are more critical of the “digital” than the “humanities,” meaning they consider more how use of digital tools affects the society as a whole (there’s that media studies!) than how scholars use digital methods in humanities work. They caution against using digital tools just because they are “better,” and instead encourage the reader to examine their role in the DH field to contribute to its ongoing growth. Berry has previously edited Understanding Digital Humanities (eBook available through Illinois library), which discusses similar issues. For a theoretical understanding of digital humanities, and to examine the issues in the field, read Digital Humanities.

Text Mining with R: A Tidy Approach by Julia Silge and David Robinson

Working with data can be messy, and text even messier. It never behaves how you expect it to, so approaching text analysis in a “tidy” manner is crucial. In Text Mining with R, Silge and Robinson present their tidytext framework for R, and instruct the reader in applying this package to natural language processing (NLP). NLP can be applied to derive meaning from unstructured text by way of unsupervised machine learning (wherein you train the computer to organize or otherwise analyze your text and then you go get coffee while it does all the work). This book is most helpful for those with programming experience, but no knowledge of text mining or natural language processing is required. With practical examples and easy to follow, step-by-step guides, Text Mining with R serves as an excellent introduction to tidying text for use in sentiment analysis, topic modeling, and classification.

No programming or R experience? Try some of our other books, like R Cookbook for an in-depth introduction, or Text Analysis with R for Students of Literature for a step-by-step learning experience focused on humanities people.

Visit us in the Scholarly Commons, 306 Main Library, to read some of our new books. Summer hours are Monday through Friday, 10 AM-5 PM. Hope to see you soon!

Using an Art Museum’s Open Data

*Edits on original idea and original piece by C. Berman by Billy Tringali

As a former art history student, I’m incredibly interested in the how the study of art history can be aided by the digital humanities. More and more museums have started allowing the public to access a portion of their data. When it comes to open data, museums seem to be lagging a bit behind other cultural heritage institutions, but many are providing great open data for humanists.

For art museums, the range of data provided ranges. Some museums are going the extra mile to give a lot of their metadata to the public. Others are picking and choosing aspects of their collection, such as the Museum of Modern Art’s Exhibition and Staff Histories.

Many museums, especially those that collect modern and contemporary art, can have their hands tied by copyright laws when it comes to the data they present. A few of the data sets currently available from art museums are the Cooper Hewitt’s Collection Data, the Minneapolis Institute of Arts metadata, the Rijksmuseum API, the Tate Collection metadata, and the Getty Vocabularies.

The Metropolitan Museum of Art has recently released all images of the museum’s public domain works under a Creative Commons Zero license.

More museum data can be found here!

Using Reddit’s API to Gather Text Data

The Reddit logo.

I initially started my research with an eye to using digital techniques to analyze an encyclopedia that collects a number of conspiracy theories in order to determine what constitute typical features of conspiracy theories. At this point, I realize there were two flaws in my original plan. First, as discussed in a previous blog post, the book I selected failed to provide the sort of evidence I required to establish typical features of conspiracy theories. Second, the length of the book, though sizable, was nowhere near large enough to provide a corpus that I could use a topic model on in order to derive interesting information.

My hope is that I can shift to online sources of text in order to solve both of these problems. Specifically, I will be collecting posts from Reddit. The first problem was that my original book merely stated the content of a number of conspiracy theories, without making any effort to convince the reader that they were true. As a result, there was little evidence of typical rhetorical and argumentative strategies that might characterize conspiracy theories. Reddit, on the other hand, will provide thousands of instances of people interacting in an effort to convince other Redditors of the truth or falsity of particular conspiracy theories. The sorts of strategies that were absent from the encyclopedia of conspiracy theories will, I hope, be present on Reddit.
The second problem was that the encyclopedia failed to provide a sufficient amount of text. Utilizing Reddit will certainly solve this problem; in less than twenty-four hours, there were over 1,300 comments on a recent post alone. If anything, the solution to this problem represents a whole new problem: how to deal with such a vast (and rapidly changing) body of information.

Before I worry too much about that, it is important that I be able to access the information in the first place. To do this, I’ll need to use Reddit’s API. API stands for Application Programming Interface, and it’s essentially a tool for letting a user interact with a system. In this case, the API allows a user to access information on the Reddit website. Of course, we can already do this with an web browser. The API, however, allows for more fine-grained control than a browser. When I navigate to a Reddit page with my web browser, my requests are interpreted in a very pre-scripted manner. This is convenient; when I’m browsing a website, I don’t want to have to specify what sort of information I want to see every time a new page loads. However, if I’m looking for very specific information, it can be useful to use an API to hone in on just the relevant parts of the website.

For my purposes, I’m primarily interested in downloading massive numbers of Reddit posts, with just their text body, along with certain identifiers (e.g., the name of the poster, timestamp, and the relation of that post to other posts). The first obstacle to accessing the information I need is learning how to request just that particular set of information. In order to do this, I’ll need to learn how to write a request in Reddit’s API format. Reddit provides some help with this, but I’ve found these other resources a bit more helpful. The second obstacle is that I will need to write a program that automates my requests, to save myself from having to perform tens of thousands of individual requests. I will be attempting to do this in Python. While doing this, I’ll have to be sure that I abide by Reddit’s regulations for using its API. For example, a limited number of requests per minute are allowed so that the website is not overloaded. There seems to be a dearth of example code on the Internet for text acquisition of this sort, so I’ll be posting a link to any functional code I write in future posts.

An Obstacle and (Hopefully) a Solution in Digital Research

This post is part of an ongoing series about my research on conspiracy theories and the tools I use to pursue it. You can read Part I: What is a Conspiracy Theory and Part II: Why Are Conspiracy Theories So Compelling? on Commons Knowledge.


Part of my research project, in which I am attempting to give an empirically-informed account of what constitutes a conspiracy theory, involves reading through a text that compiles a few hundred different conspiracy theories and gives brief accounts of them. By reading through them and coding for the presence of various features, I hoped to get some information on what features were most typical of conspiracy theories. My own suspicion is that an important part of the appeal of conspiracy theories is that, in general, we tend to find appeals to coincidence unconvincing. For example, if a student is repeatedly absent from class on test days and gives as an excuse a series of illnesses, we are inclined to find this unconvincing. It seems very coincidental that their illnesses always occur on test days. Of course, it’s possible that it really is a coincidence, but we strongly discount the explanatory weight of such an appeal. If we cast about for another theory to explain their absences, we quickly happen across another one: The student didn’t prepare for the tests and so wanted to avoid coming to class. Again, it’s possible this theory is incorrect, but it is much more satisfying than the theory that the student just coincidentally gets sick on test days.

I suspect that conspiracy theories derive much of their appeal from the unsatisfactory character of appeals to coincidence. To pick just one example: The plane crash that killed Senator Paul Wellstone in 2002 has been the focal point of a number of conspiracy theories. The standard account is that the crash was due to pilot error. One way that suspicion has been raised about this account is by noting a number of seeming coincidences. One coincidence is that Wellstone was one of the most outspoken voices against the Bush administration at the time. It raised some conspiracy theorists’ eyebrows that such a prominent liberal voice “just happened” to die unexpectedly in a plane crash. Alternative theories propose that members of the Bush administration arranged for the assassination of Wellstone. Another purported coincidence involved accounts of electronic malfunction: cell phones and automatic garage door openers in the vicinity supposedly malfunctioned at roughly the same time the plane crashed. This is accounted for in some conspiracy theories by appealing to the use of electromagnetic frequency weapon which disabled the controls of the plane, while also causing malfunction in nearby electronic equipment. My hypothesis is that this style of explanation and theory development is typical of conspiracy theories in general.

Federal investigators sift through debris in this Oct. 27, 2002 file photo, from the twin engine plane that crashed two days earlier near Eveleth, Minn,. killing Sen. Paul Wellstone, his wife Sheila, daughter Marcia and several others. The National Transportation Safety Board is ready to vote on the likely cause of the 2002 accident. (AP Photo/Jim Mone, File)

In my study, I have noted whether each conspiracy theory in my chosen compilation points to an appeal to coincidence in the rival (usually “standard”) account and, if it does, whether it then appeals to a conspiracy in order to provide a “better” theory to replace the one that appeals to coincidence. Unfortunately, this strategy has hit an obstacle. While there is a strong correlation between pointing to an appeal to coincidence as a problem with a theory and substituting an appeal to conspiracy in its place, there were relatively few theories that appeared to do this, based on the text. Even in cases where I knew the criticism of appeals to coincidence frequently played a large role in the justification of particular conspiracy theories, I often found no evidence of this in the brief accounts of the theories given in the book. It could be, of course, that my hypothesis is just mistaken; given that it didn’t match up with other conclusions I’d drawn based on other sources, however, I am inclined to think the problem is the source text. Thinking it over, it was clear that the nature of the text was to present the accounts “objectively,” stating the content of the views, normally without any effort to convince the reader one way or the other. Occasionally, talk of coincidences finds its way into the entries. Even then, it is only rarely explicit: for example, there are zero appearances of the word ‘coincidence’ or the phrase ‘just happened’, only three appearances of ‘coincidental’, and all appearances of ‘happened to’ are, upon checking the context, not related to appeals to coincidence in explanation. No other typically “coincidental” language makes a significant appearance. My concern is that this reveals only that my chosen text doesn’t address whether the presented theories are explanatorily superior to its rivals or explore how they developed in the first place. Since those are the areas in which coincidence would play a larger role, I’ve concluded that my chosen text is misleading as a source of data about conspiracy theories (at least with regard to the role of coincidence; in other areas, such as whether the theory is an official or unofficial account, it is much more reliable).

In order to resolve this obstacle, I have settled on using primary sources that are more likely to involve attempts to persuade the reader that the contained theory is correct and superior to its rivals. This includes books and websites that present a particular conspiracy theory and online fora where proponents of various conspiracy theories argue and collaborate in the development of conspiracy theories. This is obviously vastly larger than the single anthology I initially intended to use. My focus currently is finding a way to carve out a manageable chunk of this gigantic data set, most likely from online message boards, like Reddit, and use text from these fora to find evidence for my hypothesis about appeals to coincidence. This will necessitate the use of at least two kinds of digital techniques: web scraping, in order to extract usable text from a large number of individual websites, and topic modeling, in order to find meaningful relationships within an otherwise unmanageably large corpus. In my next post, I will talk about my initial forays into these techniques.

Introducing the Scholarly Commons Project Forum

A logo for the Scholarly Commons Project Forum.

The Scholarly Commons Project Forum is an hour-long bi-weekly meeting space for scholars who are interested in Digital Humanities questions regarding data and text. These meetings are an opportunity for informal, open-ended conversations about research where we will discuss conceptual, methodological, and workflow issues for projects. Those projects may be at any stage of development, whether still formative or largely complete. The goal is to think together about how to develop robust Digital Humanities research, whether as beginners interested in trying out DH techniques or those with more experience, and to make that research more legible to others.

These conversations will be facilitated by Interns at the Scholarly Commons, and will be held Mondays in Main Library 220 from 2:00-3:00 pm, starting March 5, and every two weeks following. Please RSVP to sc@library.illinois.edu.

Celebrating Frederick Douglass with Crowdsourced Transcriptions

A flier advertising the Transcribe-a-thon, which includes a photo of Frederick Douglass

On February 14, 2018, the world celebrated Frederick Douglass’ 200th birthday. Douglass, the famed Black social reformer, abolitionist, writer and statesman, did not know the date of his birth, and chose the date of Februar

 

y 14, 1818 to celebrate his birthday. This year, to celebrate the 200th anniversary of his birth, Colored Conventions, the Smithsonian Transcription Center, and the National Museum of African American History & Culture partnered together to host a Transcribe-a-thon of the Freedmen’s Bureau Papers in Douglass’ honor.

The Freedmen’s Bureau Papers consist of 2 million digitized papers through a partnership between the Smithsonian Transcription Center and the National Museum of African American History and Culture. It is the largest crowdsourcing initiative ever hosted by the Smithsonian. The Freedmen’s Bureau helped solve the everyday problems of formerly enslaved individuals, from obtaining clothing and food to helping find lost family members. The Bureau operated from 1865-1872 and closed due to opposition from Congress and President Andrew Johnson.

The Transcribe-a-thon was held on February 14th from 12-3 PM EST. According to the Smithsonian Transcription Center, over 779 pages of the Freedmen’s Bureau Papers were transcribed during this time, 402 pages were reviewed and approved, and 600 new volunteers registered for the project. Over sixty institutions hosted Transcribe-a-thon locations, many of which bought birthday cakes in Douglass’ honor from African American-owned bakeries in their area. Meanwhile, Colored Conventions livestreamed participants during the event.If you’re interested in seeing more from Douglass Day 2018, check out the Smithsonian Transcription Center’s Twitter Moment.

The Douglass Day Transcribe-a-thon was a fantastic example of people coming together and doing fantastic digital humanities work together, and for a great cause. While crowdsourced transcription projects are not new, the enthusiasm for Douglass Day is certainly unique and infectious, and we’re so excited to see where this project goes in the future and to get involved ourselves!

 

Digital Timeline Tools

Everyone has a story to tell. For many of us doing work in the humanities and social sciences, presenting our research as a timeline can bring it new depth and a wider audience. Today, I’ll be talking about two unique digital storytelling options that you can use to add dimension to your research project.

Timeglider

An image of Timeglider's sample timeline on the Wright Brothers

Timeglider is an interactive timeline application. It allows you to move in and out time, letting you see time in large or small spans. It also allows events to overlap, so you can show the relationship of things in time. Timeglider also gives some great aesthetic options, including what they call their “special sauce” — the way they relate the size of an event to its importance. This option emphasizes certain events in the timeline to the user, and can make getting important ideas across simpler.

Started in 2002 as a flash-based app, Timeglader is one of the older timeline options on the web. After a major redesign in 2010, Timeglider is now written in HTML5 and JavaScript. Timeglider is free for students for a basic package, and plans for non-students can choose to pay either $5/month or $50/year.

Overall, Timeglider is an interesting timeline application with numerous options. Give it a try!

myHistro

A screenshot from a myHistro project on the Byzantine Empire.

myHistro uses text, video and pictures on maps and timelines to tell stories. Some of the power of myHistro comes from the sheer amount of information you can provide in one presentation. Presentations can include introductory text, an interactive timeline, a Google Maps-powered annotated map, and a comment section, among other attributes. The social aspect, in particular, makes myHistro powerful. You can open your work up to a large audience, or simply ask students and scholars to make comments on your work for an assignment. Another interesting aspect of myHistro is the sheer amount of projects people have come up with for it. There is everything from histories of the French Revolution to the biography of Justin Bieber, with everything in between!

myHistro is free, and you can sign up using your email or social network information.