Cool Text Data – Music, Law, and News!

Computational text analysis can be done in virtually any field, from biology to literature. You may use topic modeling to determine which areas are the most heavily researched in your field, or attempt to determine the author of an orphan work. Where can you find text to analyze? So many places! Read on for sources to find unique text content.

Woman with microphone

Genius – the song lyrics database

Genius started as Rap Genius, a site where rap fans could gather to annotate and analyze rap lyrics. It expanded to include other genres in 2014, and now manages a massive database covering Ariana Grande to Fleetwood Mac, and includes both lyrics and fan-submitted annotations. All of this text can be downloaded and analyzed using the Genius API. Using Genius and a text mining method, you could see how themes present in popular music changed over recent years, or understand a particular artist’s creative process.

homepage of, with Ohio highlighted, 147,692 unique cases. 31 reporters. 713,568 pages scanned.

Homepage of – the case law database

The Caselaw Access Project (CAP) is a fairly recent project that is still ongoing, and publishes machine-readable text digitized from over 40,000 bound volumes of case law from the Harvard Law School Library. The earliest case is from 1658, with the most recent cases from June 2018. An API and bulk data downloads make it easy to get this text data. What can you do with huge amounts of case law? Well, for starters, you can generate a unique case law limerick:

Wheeler, and Martin McCoy.
Plaintiff moved to Illinois.
A drug represents.
Pretrial events.
Rocky was just the decoy.

Check out the rest of their gallery for more project ideas.

Newspapers and More

There are many places you can get text from digitized newspapers, both recent and historical. Some newspaper are hundreds of years old, so there can be problems with the OCR (Optical Character Recognition) that will make it difficult to get accurate results from your text analysis. Making newspaper text machine readable requires special attention, since they are printed on thin paper and have possibly been stacked up in a dusty closet for 60 years! See OCR considerations here, but the newspaper text described here is already machine-readable and ready for text mining. However, with any text mining project, you must pay close attention to the quality of your text.

The Chronicling America project sponsored by the Library of Congress contains digital copies of newspapers with machine-readable text from all over the United States and its territories, from 1690 to today. Using newspaper text data, you can analyze how topics discussed in newspapers change over time, among other things.

newspapers being printed quickly on a rolling press

Looking for newspapers from a different region? The library has contracts with several vendors to conduct text mining, including Gale and ProQuest. Both provide newspaper text suitable for text mining, from The Daily Mail of London (Gale), to the Chinese Newspapers Collection (ProQuest). The way you access the text data itself will differ between the two vendors, and the library will certainly help you navigate the collections. See the Finding Text Data library guide for more information.

The sources mentioned above are just highlights of our text data collection! The Illinois community has access to a huge amount of text, including newspapers and primary sources, but also research articles and books! Check out the Finding Text Data library guide for a more complete list of sources. And, when you’re ready to start your text mining project, contact the Scholarly Commons (, and let us help you get started!

Preparing Your Data for Topic Modeling

In keeping with my series of blog posts on my research project, this post is about how to prepare your data for input into a topic modeling package. I used Twitter data in my project, which is relatively sparse at only 140 characters per tweet, but the principles can be applied to any document or set of documents that you want to analyze.

Topic Models:

Topic models work by identifying and grouping words that co-occur into “topics.” As David Blei writes, Latent Dirichlet allocation (LDA) topic modeling makes two fundamental assumptions: “(1) There are a fixed number of patterns of word use, groups of terms that tend to occur together in documents. Call them topics. (2) Each document in the corpus exhibits the topics to varying degree. For example, suppose two of the topics are politics and film. LDA will represent a book like James E. Combs and Sara T. Combs’ Film Propaganda and American Politics: An Analysis and Filmography as partly about politics and partly about film.”

Topic models do not have any actual semantic knowledge of the words, and so do not “read” the sentence. Instead, topic models use math. The tokens/words that tend to co-occur are statistically likely to be related to one another. However, that also means that the model is susceptible to “noise,” or falsely identifying patterns of cooccurrence if non-important but highly-repeated terms are used. As with most computational methods, “garbage in, garbage out.”

In order to make sure that the topic model is identifying interesting or important patterns instead of noise, I had to accomplish the following pre-processing or “cleaning” steps.

  • First, I removed the punctuation marks, like “,.;:?!”. Without this step, commas started showing up in all of my results. Since they didn’t add to the meaning of the text, they were not necessary to analyze.
  • Second, I removed the stop-words, like “I,” “and,” and “the,” because those words are so common in any English sentence that they tend to be over-represented in the results. Many of my tweets were emotional responses, so many authors wrote in the first person. This tended to skew my results, although you should be careful about what stop words you remove. Simply removing stop-words without checking them first means that you can accidentally filter out important data.
  • Finally, I removed too common words that were uniquely present in my data. For example, many of my tweets were retweets and therefore contained the word “rt.” I also ended up removing mentions to other authors because highly retweeted texts tended to mean that I was getting Twitter user handles as significant words in my results.

Cleaning the Data:

My original data set was 10 Excel files of 10,000 tweets each. In order to clean and standardize all these data points, as well as combining my file into one single document, I used OpenRefine. OpenRefine is a powerful tool, and it makes it easy to work with all your data at once, even if it is a large number of entries. I uploaded all of my datasets, then performed some quick cleaning available under the “Common Transformations” option under the triangle dropdown at the head of each column: I changed everything to lowercase, unescaped HTML characters (to make sure that I didn’t get errors when trying to run it in Python), and removed extra white spaces between words.

OpenRefine also lets you use regular expressions, which is a kind of search tool for finding specific strings of characters inside other text. This allowed me to remove punctuation, hashtags, and author mentions by running a find and replace command.

  • Remove punctuation: grel:value.replace(/(\p{P}(?<!’)(?<!-))/, “”)
    • Any punctuation character is removed.
  • Remove users: grel:value.replace(/(@\S*)/, “”)
    • Any string that begins with an @ is removed. It ends at the space following the word.
  • Remove hashtags: grel:value.replace(/(#\S*)/,””)
    • Any string that begins with a # is removed. It ends at the space following the word.

Regular expressions, commonly abbreviated as “regex,” can take a little getting used to in order to understand how they work. Fortunately, OpenRefine itself has some solid documentation on the subject, and I also found this cheatsheet valuable as I was trying to get it work. If you want to create your own regex search strings, has a tool that lets you test your expression before you actually deploy it in OpenRefine.

After downloading the entire data set as a Comma Separated Value (.csv) file, I then used the Natural Language ToolKit (NLTK) for Python to remove stop-words. The code itself can be found here, but I first saved the content of the tweets as a single text file, and then I told NLTK to go over every line of the document and remove words that are in its common stop word dictionary. The output is then saved in another text file, which is ready to be fed into a topic modeling package, such as MALLET.

At the end of all these cleaning steps, my resulting data is essentially composed of unique nouns and verbs, so, for example, @Phoenix_Rises13’s tweet “rt @drlawyercop since sensible, national gun control is a steep climb, how about we just start with orlando? #guncontrolnow” becomes instead “since sensible national gun control steep climb start orlando.” This means that the topic modeling will be more focused on the particular words present in each tweet, rather than commonalities of the English language.

Now my data is cleaned from any additional noise, and it is ready to be input into a topic modeling program.

Interested in working with topic models? There are two Savvy Researcher topic modeling workshops, on December 6 and December 8, that focus on the theory and practice of using topic models to answer questions in the humanities. I hope to see you there!

CITL Workshops and Statistical Consulting Fall 2017

CITL is back at it again with the statistics, survey, and data consulting services! They have a busy fall 2017, with a full schedule of workshops on the way, as well as their daily consulting hours in the Scholarly Commons.

Their workshops are as follows:

  • 9/19: R I: Getting Started with R
  • 10/17: R I: Getting Started with R
  • 9/26: R II: Inferential Statistics
  • 10/24: R II: Inferential Statistics
  • 10/3: SAS I: Getting Started with SAS
  • 10/10: SAS II: Inferential Statistics with SAS
  • 10/4: STATA I: Getting Started with Stata
  • 9/20: SPSS I: Getting Started with SPSS
  • 9/27: SPSS II: Inferential Statistics with SPSS
  • 10/11: ATLAS.ti I: Qualitative Data analysis
  • 10/12: ATLAS.ti II: Data Exploration and Analysis

Workshops are free, but participants must register beforehand. For more information about each workshop, and to register, head to the CITL Workshop Details and Resources page.

And remember that CITL is at the Scholarly Commons Monday – Friday, 10 AM – 4 PM.You can always request a consultation, or walk-in.

Adventures at the Spring 2017 Library Hackathon

This year I participated in an event called HackCulture: A Hackathon for the Humanities, which was organized by the University Library. This interdisciplinary hackathon brought together participants and judges from a variety of fields.

This event is different than your average campus hackathon. For one, it’s about expanding humanities knowledge. In this event, teams of undergraduate and graduate students — typically affiliated with the iSchool in some way — spend a few weeks working on data-driven projects related to humanities research topics. This year, in celebration of the sesquicentennial of the University of Illinois at Urbana-Champaign, we looked at data about a variety of facets of university life provided by the University Archives.

This was a good experience. We got firsthand experience working with data; though my teammates and I struggled with OpenRefine and so we ended up coding data by hand. I now way too much about the majors that are available at UIUC and how many majors have only come into existence in the last thirty years. It is always cool to see how much has changed and how much has stayed the same.

The other big challenge we had was not everyone on the team had experience with design, and trying to convince folks not to fall into certain traps was tricky.

For an idea of how our group functioned, I outlined how we were feeling during the various checkpoints across the process.


We had grand plans and great dreams and all kinds of data to work with. How young and naive we were.

Midpoint Check:

Laura was working on the Python script and sent a well-timed email about what was and wasn’t possible to get done in the time we were given. I find public speaking challenging so that was not my favorite workshop. I would say it went alright.


We prevailed and presented something that worked in public. Laura wrote a great Python script and cleaned up a lot of the data. You can even find it here. One day in the near future it will be in IDEALS as well where you can already check out projects from our fellow humanities hackers.

Key takeaways:

  • Choose your teammates wisely; try to pick a team of folks you’ve worked with in advance. Working with a mix of new and not-so-new people in a short time frame is hard.
  • Talk to your potential client base! This was definitely something we should have done more of.
  • Go to workshops and ask for help. I wish we had asked for more help.
  • Practicing your presentation in advance as well as usability testing is key. Yes, using the actual Usability Lab at Scholarly Commons is ideal but at the very least take time to make sure the instructions for using what you created are accurate. It’s amazing what steps you will leave off when you have used an app more than twice. Similarly make sure that you can run your program and another program at the same time because if you can’t chances are it means you might crash someone’s browser when they use it.

Overall, if you get a chance to participate in a library hackathon, go for it, it’s a great way to do a cool project and get more experience working with data!

Scholarly Smackdown: StoryMap JS vs. Story Maps

In today’s very spatial Scholarly Smackdown post we are covering two popular mapping visualization products, Story Maps and StoryMap JS.Yes they both have “story” and “map” in the name and they both let you create interactive multimedia maps without needing a server. However, they are different products!

StoryMap JS

StoryMap JS, from the Knight Lab at Northwestern, is a simple tool for creating interactive maps and timelines for journalists and historians with limited technical experience.

One  example of a project on StoryMap JS is “Hockey, hip-hop, and other Green Line highlights” by Andy Sturdevant for the Minneapolis Post, which connects the stops of the Green Line train to historical and cultural sites of St. Paul and Minneapolis Minnesota.

StoryMap JS uses Google products and map software from OpenStreetMap.

Using the StoryMap JS editor, you create slides with uploaded or linked media within their template. You then search the map and select a location and the slide will connect with the selected point. You can embed your finished map into your website, but Google-based links can deteriorate over time! So save copies of all your files!

More advanced users will enjoy the Gigapixel mode which allows users to create exhibits around an uploaded image or a historic map.

Story Maps

Story maps is a custom map-based exhibit tool based on ArcGIS online.

My favorite example of a project on Story Maps is The Great New Zealand Road Trip by Andrew Douglas-Clifford, which makes me want to drop everything and go to New Zealand (and learn to drive). But honestly, I can spend all day looking at the different examples in the Story Maps Gallery.

Story Maps offers a greater number of ways to display stories than StoryMap JS, especially in the paid version. The paid version even includes a crowdsourced Story Map where you can incorporate content from respondents, such as their 2016 GIS Day Events map.

With a free non-commercial public ArcGIS Online account you can create a variety of types of maps. Although it does not appear there is to overlay a historical map, there is a comparison tool which could be used to show changes over time. In the free edition of this software you have to use images hosted elsewhere, such as in Google Photos. Story Maps are created through their wizard where you add links to photos/videos, followed by information about these objects, and then search and add the location. It is very easy to use and almost as easy as StoryMap JS. However, since this is a proprietary software there are limits to what you can do with the free account and perhaps worries about pricing and accessing materials at a later date.

Overall, can’t really say there’s a clear winner. If you need to tell a story with a map, both software do a fine job, StoryMap JS is in my totally unscientific opinion slightly easier to use, but we have workshops for Story Maps here at Scholarly Commons!  Either way you will be fine even with limited technical or map making experience.

If you are interested in learning more about data visualization, ArcGIS Story Maps, or geopatial data in general, check out these upcoming workshops here at Scholarly Commons, or contact our GIS expert, James Whitacre!

Love and Big Data

Can big data help you find true love?

It’s Love Your Data Week, but did you know people have been using Big Data for to optimize their ability to find their soul mate with the power of data science! Wired Magazine profiled mathematician and data scientist Chris McKinlay in “How to Hack OkCupid“.There’s even a book spin-off from this! “Optimal Cupid”, which unfortunately is not at any nearby libraries.

But really, we know you’re all wondering, where can I learn the data science techniques needed to find “The One”, especially if I’m not a math genius?


What did Chris McKinlay do?

Methods used:

  • Automating tasks, such as writing a python script to answer questions on OKCupid
  • Scraping data from dating websites
  • Surveying
  • Statistical analysis
  • Machine learning to figure out how to rank the importance of answers of questions
  • Bots to visit people’s pages
  • Actually talking to people in the real world!

Things we can help you with at Scholarly Commons:

Selected workshops and resources, come by the space to find more!

Whether you reach out to us by email, phone, or in-person our experts are ready to help with all of your questions and helping you make the most of your data! You might not find “The One” with our software tools, but we can definitely help you have a better relationship with your data!

Register for Spring 2017 Workshops at CITL!

Exciting news for anyone interested in learning the basics of statistical and qualitative analysis software! Registration is open for workshops to be held throughout spring semester at the Center for Innovation in Teaching and Learning! There will be workshops on ATLAS.ti, R, SAS, Stata, SPSS, and Questionnaire Design on Tuesdays and Wednesdays in February and March from 5:30-7:30 pm. To learn more details and to register click here to go to the workshops offered by CITL page. And if you need a place to use these statistical and qualitative software packages, such as to practice the skills you gained at the workshops stop by Scholarly Commons, Monday-Friday 9 am- 6 pm! And don’t forget, you can also schedule a consultation with our experts here for specific questions about using statistical and qualitative analysis software for your research!

JMP Pro Pilot Run at the Scholarly Commons Through March 14

Library patrons have the opportunity to use JMP Pro predictive analytics software through March 14 at the Scholarly Commons. JMP Pro is a sophisticated statistical discover tool from SAS designed for advanced data science. Two Scholarly Commons computers are currently equipped with version 12. To learn more about the software, visit the JMP Pro website or the Webstore’s product page.

During this trial period, we are hoping to get a sense of whether this software would provide a useful tool for our patrons. We encourage anyone who is interested in this opportunity to visit the Scholarly Commons to take JMP Pro for a test drive and share your thoughts about the software with us. If you’re an experienced JMP Pro user, we’d also love to hear about your impressions of the package. You can contact us by email, or leave a message in the comments below.



ICPSR 2014 Summer Program in Quantitative Methods of Social Research

Still making a list of summer plans? As you gear up for summer, keep in mind that the Institute for Social Research at the University of Michigan is offering a wide range of classes on quantitative data-analysis. Whether you are a beginner or you are ready to study more advanced techniques, the program has something unique to offer each individual. Course instruction is centered around interactive, participatory data-analysis within a broader context of substantive social research.

Courses for the summer 2014 program are offered in two four-week sessions, May through August. These sessions include lecture, seminar, and workshop formats with participants from a diverse range of departments, universities, and organizations.

The following are a few examples of courses that will be offered:

Basic Foundation
Introduction to Statistics and Data Analysis
Introduction to Regression
Introduction to Computing

Linear Models and Beyond
Regression Analysis
Hierarchical Linear and Multilevel Models
Categorical Data Analysis

Substantive Topics
Race and Ethnicity
Curating Data & Providing Data Services
Designing, Conducting, and Analyzing Field Experiments

Advanced Techniques
Applied Bayesian Modeling
Advanced Time Series
The R Statistical Computing Environment

Multivariate Techniques
Multivariate Statistical methods
Scaling and Dimensional Analysis
Intro & Advanced Network Analysis

Formal Modeling
Game Theory
Rational Choice
Empirical Modeling for Theory Evaluation

Registration is now open. There are also a few free workshops that will be offered over the summer, but registration for those sessions ends May 15, 2014 and seats are limited!

For a full list of courses, fee and discount information, and to fill out an application visit the website.

Call: (734) 763-7400