Cool Text Data – Music, Law, and News!

Computational text analysis can be done in virtually any field, from biology to literature. You may use topic modeling to determine which areas are the most heavily researched in your field, or attempt to determine the author of an orphan work. Where can you find text to analyze? So many places! Read on for sources to find unique text content.

Woman with microphone

Genius – the song lyrics database

Genius started as Rap Genius, a site where rap fans could gather to annotate and analyze rap lyrics. It expanded to include other genres in 2014, and now manages a massive database covering Ariana Grande to Fleetwood Mac, and includes both lyrics and fan-submitted annotations. All of this text can be downloaded and analyzed using the Genius API. Using Genius and a text mining method, you could see how themes present in popular music changed over recent years, or understand a particular artist’s creative process.

homepage of case.law, with Ohio highlighted, 147,692 unique cases. 31 reporters. 713,568 pages scanned.

Homepage of case.law

Case.law – the case law database

The Caselaw Access Project (CAP) is a fairly recent project that is still ongoing, and publishes machine-readable text digitized from over 40,000 bound volumes of case law from the Harvard Law School Library. The earliest case is from 1658, with the most recent cases from June 2018. An API and bulk data downloads make it easy to get this text data. What can you do with huge amounts of case law? Well, for starters, you can generate a unique case law limerick:

Wheeler, and Martin McCoy.
Plaintiff moved to Illinois.
A drug represents.
Pretrial events.
Rocky was just the decoy.

Check out the rest of their gallery for more project ideas.

Newspapers and More

There are many places you can get text from digitized newspapers, both recent and historical. Some newspaper are hundreds of years old, so there can be problems with the OCR (Optical Character Recognition) that will make it difficult to get accurate results from your text analysis. Making newspaper text machine readable requires special attention, since they are printed on thin paper and have possibly been stacked up in a dusty closet for 60 years! See OCR considerations here, but the newspaper text described here is already machine-readable and ready for text mining. However, with any text mining project, you must pay close attention to the quality of your text.

The Chronicling America project sponsored by the Library of Congress contains digital copies of newspapers with machine-readable text from all over the United States and its territories, from 1690 to today. Using newspaper text data, you can analyze how topics discussed in newspapers change over time, among other things.

newspapers being printed quickly on a rolling press

Looking for newspapers from a different region? The library has contracts with several vendors to conduct text mining, including Gale and ProQuest. Both provide newspaper text suitable for text mining, from The Daily Mail of London (Gale), to the Chinese Newspapers Collection (ProQuest). The way you access the text data itself will differ between the two vendors, and the library will certainly help you navigate the collections. See the Finding Text Data library guide for more information.

The sources mentioned above are just highlights of our text data collection! The Illinois community has access to a huge amount of text, including newspapers and primary sources, but also research articles and books! Check out the Finding Text Data library guide for a more complete list of sources. And, when you’re ready to start your text mining project, contact the Scholarly Commons (sc@library.illinois.edu), and let us help you get started!

Wikidata and Wikidata Human Gender Indicators (WHGI)

Wikipedia is a central player in online knowledge production and sharing. Since its founding in 2001, Wikipedia has been committed to open access and open editing, which has made it the most popular reference work on the web. Though students are still warned away from using Wikipedia as a source in their scholarship, it presents well-researched information in an accessible and ostensibly democratic way.

Most people know Wikipedia from its high ranking in most internet searches and tend to use it for its encyclopedic value. The Wikimedia Foundation—which runs Wikipedia—has several other projects which seek to provide free access to knowledge. Among those are Wikimedia Commons, which offers free photos; Wikiversity, which offers free educational materials; and Wikidata, which provides structured data to support the other wikis.

The Wikidata logo

Wikidata provides structured data to support Wikimedia and other Wikimedia Foundation projects

Wikidata is a great tool to study how Wikipedia is structured and what information is available through the online encyclopedia. Since it is presented as structured data, it can be analyze quantitatively more easily than Wikipedia articles. This has led to many projects that allow users to explore data through visualizations, queries, and other means. Wikidata offers a page of Tools that can be used to analyze Wikidata more quickly and efficiently, as well as Data Access instructions for how to use data from the site.

The webpage for the Wikidata Human Gender Indicators project

The home page for the Wikidata Human Gender Indicators project

An example of a project born out of Wikidata is the Wikidata Human Gender Indicators (WHGI) project. The project uses metadata from Wikidata entries about people to analyze trends in gender disparity over time and across cultures. The project presents the raw data for download, as well as charts and an article written about the discoveries the researchers made while compiling the data. Some of the visualizations they present are confusing (perhaps they could benefit from reading our Lightning Review of Data Visualization for Success), but they succeed in conveying important trends that reveal a bias toward articles about men, as well as an interesting phenomenon surrounding celebrities. Some regions will have a better ratio of women to men biographies due to many articles being written about actresses and female musicians, which reflects cultural differences surrounding fame and gender.

Of course, like many data sources, Wikidata is not perfect. The creators of the WHGI project frequently discovered that articles did not have complete metadata related to gender or nationality, which greatly influenced their ability to analyze the trends present on Wikipedia related to those areas. Since Wikipedia and Wikidata are open to editing by anyone and are governed by practices that the community has agreed upon, it is important for Wikipedians to consider including more metadata in their articles so that researchers can use that data in new and exciting ways.

An animated gif of the Wikipedia logo bouncing like a ball

New Uses for Old Technology at the Arctic World Archive

In this era of rapid technological change, it is easy to fall into the mindset that the “big new thing” is always an improvement on the technology that came before it. Certainly this is often true, and here in the Scholarly Commons we are always seeking innovative new tools to help you out with your research. However, every now and then it’s nice to just slow down and take the time to appreciate the strengths and benefits of older technology that has largely fallen out of use.

A photo of the arctic

There is perhaps no better example of this than the Arctic World Archive, a facility on the Norwegian archipelago of Svalbard. Opened in 2017, the Arctic World Archive seeks to preserve the world’s most important cultural, political, and literary works in a way that will ensure that no manner of catastrophe, man-made or otherwise, could destroy them.

If this is all sounding familiar to you, that’s because you’ve probably heard of the Arctic World Archive’s older sibling, the Svalbard Global Seed Vault. The Global Seed Vault, which is much better known and older than the Arctic World Archive, is an archive seeds from around the world, meant to ensure that humanity would be able to continue growing crops and making food in the event of a catastrophe that wipes out plant life.

Indeed, the two archives have a lot in common. The World Archive is housed deep within a mountain in an abandoned coal mine that once served as the location of the seed vault, and was founded to be for cultural heritage what the seed vault is for crops. But the Arctic World Archive has made truly innovative use of old technology that makes it a truly impressive site in its own right.

A photo of the arctic

Perhaps the coolest (pun intended) aspect of the Arctic World Archive is the fact that it does not require electricity to operate. It’s extreme northern location (it is near the northernmost town of at least 1,000 people in the world) means that the temperature inside the facility is naturally very cold year-round. As any archivist or rare book librarian who brings a fleece jacket to work in the summer will happily tell you, colder temperatures are ideal for preserving documents, and the ability to store items in a very cold climate without the use of electricity makes the World Archive perfect for sustainable, long-term storage.

But that’s not all: in a real blast from the past, all information stored in this facility is kept on microfilm. Now, I know what you’re thinking: “it’s the 21st century, grandpa! No one uses microfilm anymore!”

It’s true that microfilm is used by a very small minority of people nowadays, but nevertheless it offers distinct advantages that newer digital media just can’t compete with. For example, microfilm is rated to last for at least 500 years without corruption, whereas digital files may not last anywhere near that long. Beyond that, the film format means that the archive is totally independent from the internet, and will outlast any major catastrophe that disrupts part or all of our society’s web capabilities.

A photo of a seal

The Archive is still growing, but it is already home to film versions of Edvard Munch’s The Scream, Dante’s The Divine Comedy, and an assortment of government documents from many countries including Norway, Brazil, and the United States.

As it continues to grow, its importance as a place of safekeeping for the world’s cultural heritage will hopefully serve as a reminder that sometimes, older technology has upsides that new tech just can’t compete with.

Using an Art Museum’s Open Data

*Edits on original idea and original piece by C. Berman by Billy Tringali

As a former art history student, I’m incredibly interested in the how the study of art history can be aided by the digital humanities. More and more museums have started allowing the public to access a portion of their data. When it comes to open data, museums seem to be lagging a bit behind other cultural heritage institutions, but many are providing great open data for humanists.

For art museums, the range of data provided ranges. Some museums are going the extra mile to give a lot of their metadata to the public. Others are picking and choosing aspects of their collection, such as the Museum of Modern Art’s Exhibition and Staff Histories.

Many museums, especially those that collect modern and contemporary art, can have their hands tied by copyright laws when it comes to the data they present. A few of the data sets currently available from art museums are the Cooper Hewitt’s Collection Data, the Minneapolis Institute of Arts metadata, the Rijksmuseum API, the Tate Collection metadata, and the Getty Vocabularies.

The Metropolitan Museum of Art has recently released all images of the museum’s public domain works under a Creative Commons Zero license.

More museum data can be found here!

Using Reddit’s API to Gather Text Data

The Reddit logo.

I initially started my research with an eye to using digital techniques to analyze an encyclopedia that collects a number of conspiracy theories in order to determine what constitute typical features of conspiracy theories. At this point, I realize there were two flaws in my original plan. First, as discussed in a previous blog post, the book I selected failed to provide the sort of evidence I required to establish typical features of conspiracy theories. Second, the length of the book, though sizable, was nowhere near large enough to provide a corpus that I could use a topic model on in order to derive interesting information.

My hope is that I can shift to online sources of text in order to solve both of these problems. Specifically, I will be collecting posts from Reddit. The first problem was that my original book merely stated the content of a number of conspiracy theories, without making any effort to convince the reader that they were true. As a result, there was little evidence of typical rhetorical and argumentative strategies that might characterize conspiracy theories. Reddit, on the other hand, will provide thousands of instances of people interacting in an effort to convince other Redditors of the truth or falsity of particular conspiracy theories. The sorts of strategies that were absent from the encyclopedia of conspiracy theories will, I hope, be present on Reddit.
The second problem was that the encyclopedia failed to provide a sufficient amount of text. Utilizing Reddit will certainly solve this problem; in less than twenty-four hours, there were over 1,300 comments on a recent post alone. If anything, the solution to this problem represents a whole new problem: how to deal with such a vast (and rapidly changing) body of information.

Before I worry too much about that, it is important that I be able to access the information in the first place. To do this, I’ll need to use Reddit’s API. API stands for Application Programming Interface, and it’s essentially a tool for letting a user interact with a system. In this case, the API allows a user to access information on the Reddit website. Of course, we can already do this with an web browser. The API, however, allows for more fine-grained control than a browser. When I navigate to a Reddit page with my web browser, my requests are interpreted in a very pre-scripted manner. This is convenient; when I’m browsing a website, I don’t want to have to specify what sort of information I want to see every time a new page loads. However, if I’m looking for very specific information, it can be useful to use an API to hone in on just the relevant parts of the website.

For my purposes, I’m primarily interested in downloading massive numbers of Reddit posts, with just their text body, along with certain identifiers (e.g., the name of the poster, timestamp, and the relation of that post to other posts). The first obstacle to accessing the information I need is learning how to request just that particular set of information. In order to do this, I’ll need to learn how to write a request in Reddit’s API format. Reddit provides some help with this, but I’ve found these other resources a bit more helpful. The second obstacle is that I will need to write a program that automates my requests, to save myself from having to perform tens of thousands of individual requests. I will be attempting to do this in Python. While doing this, I’ll have to be sure that I abide by Reddit’s regulations for using its API. For example, a limited number of requests per minute are allowed so that the website is not overloaded. There seems to be a dearth of example code on the Internet for text acquisition of this sort, so I’ll be posting a link to any functional code I write in future posts.

Whimsical Data

Photograph of a Yorkshire terrier in a field of yellow flowers.

It’s finally springtime!

It’s April! After what felt like eternity, it’s starting to warm up here at the University of Illinois at Urbana-Champaign. So today, in celebration of spring, we’re going to take a look at few whimsical data sets that have made us laugh, smile, and think.

Dogs of NYC

Dogs of NYC was published by the NYC Department of Health and Mental Hygiene in 2013. The department collected data on 50,000 New York dogs, including their name, gender, breed, birth date, dominant, secondary and third color, and whether they are spayed/neutered or a guard dog, along with the borough they live in and their zip code. WYNC used this data to explore dog names and breeds by area, and Kaylin Pavlik used the data to show the relationship between dog names and dog breeds.

What made us laugh: How high the TF-IDF score for the name Pugsley was for Pugs as compared to other breeds.

What made us think: Does the perceived danger of a dog breed influence what people name them?

UK Government Hospitality wine cellar annual statement

Each year, the UK publishes an annual statement on the Government Wine Cellar, which they describe as being “used to support the work of Government Hospitality in delivering business hospitality for all government ministers and departments”. The first report was published in July 2014, and the latest was published in September 2017.

What made us laugh: Government Hospitality has an an advisory committee that meets four times a year and are known as Masters of Wine. They are unpaid.

What made us think: With threats to government transparency across the globe, it is nice to see data that some may brush off as inconsequential, but actually deals with large sums of money.

Most Popular Christmas Toys According to Search Data

Published by Reckless in November 2017, this data set shows search data based on the Toys R Us catalog (RIP) that shows which toys, video games, and board games were most popular among different age groups. Favorite toys included the Barbie Dreamhouse, Furby Connect, Razor Crazy Cart, and R2D2 Interactive Robotic Droid.

What made us laugh: The Silly Sausage game was one of the most searched board games during this period.

What made us think: Toys play a pivotal role during childhood development. It’s a little astonishing to see that, despite all of her critics, Barbie still reigns supreme in the 2-4 year-old age group.

Do you have a favorite data set? Let us know in the comments!

Endangered Data Week is Coming

The Endangered Data Week logo

Did you know that Endangered Data Week is happening from February 26-March 2? Endangered Data Week is a collaborative effort to help highlight on public datasets that are in danger of being deleted, repressed, mishandled, or lost. Inspired by recent events that have shown how fragile publicly administered data is, Endangered Data Week hopes to promote care for endangered collections by publicizing datasets and increasing engagement with them, and through advocating for political activism.

The Endangered Data Week organizes hope to cultivate a broad community of supporters for access to public data, and who advocate for open data policies and help cultivate data skills and competencies among students and colleagues. During Endangered Data Week, librarians, scholars and activists will use the #EndangeredData Twitter hashtag, as well as host events across the country.

While this is the first year of Endangered Data Week, the organizers hope to work both on the momentum of similar movements, such as Sunshine Week, Open Access Week, and the #DataRescue, and to continue organizing events into the future.

What are you doing during Endangered Data Week? Let us know in the comments!

Spotlight: Unexpected Surprises in the Internet Archive

Image of data banks with the Internet Archive logo on them.

The Internet Archive.

For most of us, our introduction to the Internet Archive was the Wayback Machine, a search engine that can show you snapshots of websites from the past. It’s always fun to watch a popular webpage like Google evolve from November 1998 to July 2004 to today, but there is so much more that the Internet Archive has to offer. Today, I’m going to go through a few highlights from the Internet Archive’s treasure trove of material, just to get a glimpse at all of the things you can do and see on this amazing website.

Folksoundomy: A Library of Sound

Folksoundomy is the Internet Archives’ collection of sounds, music and speech. The collection is collaboratively created and tagged, with many participants from outside of the library sphere. There are more than 155,960 items in the Folksoundomy collection, with items that range in date back to the invention of Thomas Edison’s invention of recorded sound in 1877. From Hip Hop Mixtapes to Russian audiobooks, sermons to stand-up comedy, music, podcasts, radio shows and more, Folksoundomy is an incredible resource for scholars looking at the history of recorded sound.

TV News Archive

With over 1,572,000 clips collected since 2009, the TV News Archive includes everything from this morning’s news to curated series of fact-checked clips. Special collections within the TV News Archive include Understanding 9/11, Political Ads, and the TV NSA Clip Library. With the ability to search closed captions from US TV new shows, the Internet Archive provides a unique research opportunity for those studying modern US media.

Software Library: MS-DOS Games

Ready to die of dysentery on The Oregon Trail again? Now is your chance! The Internet Archive’s MS-DOS Games Software Library uses an EM-DOSBOX in-browser emulator that lets you go through and play games that would otherwise seem lost to time. Relive your childhood memories or start researching trends in video games throughout the years with this incredible collection of playable games!

National Security Internet Archive (NSIA)

Created in March 2015, the NSIA collects files from muckracking and national security organizations, as well as historians and activists. With over 2 million files split into 36 collections, the NSIA helps collect everything from CIA Lessons Learned from Czechoslovakia to the UFO Files, a collection of declassified UFO files from around the world. Having these files accessible and together is incredibly helpful to researchers studying the history of national security, both in the US and elsewhere in the world.

University of Illinois at Urbana-Champaign

That’s right! We’re also on the Internet Archive. The U of I adds content in several areas: Illinois history, culture and natural resources; US railroad history; rural studies and agriculture; works in translation; as well as 19th century “triple-decker” novels and emblem books. Click on the above link to see what your alma mater is contributing to the Internet Archive today!

Of course, this is nowhere near everything! With Classic TV CommercialsGrateful Dead, Community Software and more, it’s definitely worth your time to see what on the Internet Archive will help you!

Got Bad Data? Check Out The Quartz Guide

Photograph of a man working on a computer.

If you’re working with data, chances are, there will be at least a few times where you encounter the “nightmare scenario”. Things go awry — values are missing, your sample is biased, there are inexplicable outliers, or the sample wasn’t as random as you thought. Some issues you can solve, other issues are less clear. But before you tear your hair out — or, before you tear all of your hair out — check out The Quartz guide to bad data. Hosted on GitHub,The Quartz guide lists out possible problems with data, and how to solve them, so that researchers have an idea of what next-steps can be when their data doesn’t work as planned.

With translations into six languages and a CreativeCommons 4.0 license, The Quartz guide divides problems into four categories: issues that your source should solve, issues that you should solve, issues a third-party expert should help you solve, and issues a programmer should help you solve. From there, the guide lists specific issues and explains how they can or cannot be solved.

One of the greatest things about The Quartz guide is the language. Rather than pontificating and making an already frustrating problem more confusing, the guide lays out options in plain terms. While you may not get everything you need for fixing your specific problem, chances are you will at least figure out how you can start moving forward after this setback.

The Quartz guide does not mince words. For example, in the “Data were entered by humans” example, it gives an example of messy data entry then says, “Even with the best tools available, data this messy can’t be saved. They are effectively meaningless… Beware human-entered data.” Even if it’s probably not what a researcher wants to hear, sometimes the hard, cold truth can lead someone to a new step in their research.

So if you’ve hit a block with your data, check out The Quartz guide. It may be the thing that will help you move forward with your data! And if you’re working with data, feel free to contact the Scholarly Commons or Research Data Service with your questions!

Open Source Tools for Social Media Analysis

Photograph of a person holding an iPhone with various social media icons.

This post was guest authored by Kayla Abner.


Interested in social media analytics, but don’t want to shell out the bucks to get started? There are a few open source tools you can use to dabble in this field, and some even integrate data visualization. Recently, we at the Scholarly Commons tested a few of these tools, and as expected, each one has strengths and weaknesses. For our exploration, we exclusively analyzed Twitter data.

NodeXL

NodeXL’s graph for #halloween (2,000 tweets)

tl;dr: Light system footprint and provides some interesting data visualization options. Useful if you don’t have a pre-existing data set, but the one generated here is fairly small.

NodeXL is essentially a complex Excel template (it’s classified as a Microsoft Office customization), which means it doesn’t take up a lot of space on your hard drive. It does have advantages; it’s easy to use, only requiring a simple search to retrieve tweets for you to analyze. However, its capabilities for large-scale analysis are limited; the user is restricted to retrieving the most recent 2,000 tweets. For example, searching Twitter for #halloween imported 2,000 tweets, every single one from the date of this writing. It is worth mentioning that there is a fancy, paid version that will expand your limit to 18,000, the maximum allowed by Twitter’s API, or 7 to 8 days ago, whichever comes first. Even then, you cannot restrict your data retrieval by date. NodeXL is a tool that would mostly be most successful in pulling recent social media data. In addition, if you want to study something besides Twitter, you will have to pay to get any other type of dataset, i.e., Facebook, Youtube, Flickr.

Strengths: Good for a beginner, differentiates between Mentions/Retweets and original Tweets, provides a dataset, some light data visualization tools, offers Help hints on hover

Weaknesses: 2,000 Tweet limit, free version restricted to Twitter Search Network

TAGS

TAGSExplorer’s data graph (2,902 tweets). It must mean something…

tl;dr: Add-on for Google Sheets, giving it a light system footprint as well. Higher restriction for number of tweets. TAGS has the added benefit of automated data retrieval, so you can track trends over time. Data visualization tool in beta, needs more development.

TAGS is another complex spreadsheet template, this time created for use with Google Sheets. TAGS does not have a paid version with more social media options; it can only be used for Twitter analysis. However, it does not have the same tweet retrieval limit as NodeXL. The only limit is 18,000 or seven days ago, which is dictated by Twitter’s Terms of Service, not the creators of this tool. My same search for #halloween with a limit set at 10,000 retrieved 9,902 tweets within the past seven days.

TAGS also offers a data visualization tool, TAGSExplorer, that is promising but still needs work to realize its potential. As it stands now in beta mode, even a dataset of 2,000 records puts so much strain on the program that it cannot keep up with the user. It can be used with smaller datasets, but still needs work. It does offer a few interesting additional analysis parameters that NodeXL lacked, such as ability to see Top Tweeters and Top Hashtags, which works better than the graph.

Image of hashtag searchThese graphs have meaning!

Strengths: More data fields, such as the user’s follower and friend count, location, and language (if available), better advanced search (Boolean capabilities, restrict by date or follower count), automated data retrieval

Weaknesses: data visualization tool needs work

Hydrator

Simple interface for Documenting the Now’s Hydrator

tl;dr: A tool used for “re-hydrating” tweet IDs into full tweets, to comply with Twitter’s Terms of Service. Not used for data analysis; useful for retrieving large datasets. Limited to datasets already available.

Documenting the Now, a group focused on collecting and preserving digital content, created the Hydrator tool to comply with Twitter’s Terms of Service. Download and distribution of full tweets to third parties is not allowed, but distribution of tweet IDs is allowed. The organization manages a Tweet Catalog with files that can be downloaded and run through the Hydrator to view the full Tweet. Researchers are also invited to submit their own dataset of Tweet IDs, but this requires use of other software to download them. This tool does not offer any data visualization, but is useful for studying and sharing large datasets (the file for the 115th US Congress contains 1,430,133 tweets!). Researchers are limited to what has already been collected, but multiple organizations provide publicly downloadable tweet ID datasets, such as Harvard’s Dataverse. Note that the rate of hydration is also limited by Twitter’s API, and the Hydrator tool manages that for you. Some of these datasets contain millions of tweet IDs, and will take days to be transformed into full tweets.

Strengths: Provides full tweets for analysis, straightforward interface

Weaknesses: No data analysis tools

Crimson Hexagon

If you’re looking for more robust analytics tools, Crimson Hexagon is a data analytics platform that specializes in social media. Not limited to Twitter, it can retrieve data from Facebook, Instagram, Youtube, and basically any other online source, like blogs or forums. The company has a partnership with Twitter and pays for greater access to their data, giving the researcher higher download limits and a longer time range than they would receive from either NodeXL or TAGS. One can access tweets starting from Twitter’s inception, but these features cost money! The University of Illinois at Urbana-Champaign is one such entity paying for this platform, so researchers affiliated with our university can request access. One of the Scholarly Commons interns, Matt Pitchford, uses this tool in his research on Twitter response to terrorism.

Whether you’re an experienced text analyst or just want to play around, these open source tools are worth considering for different uses, all without you spending a dime.

If you’d like to know more, researcher Rebekah K. Tromble recently gave a lecture at the Data Scientist Training for Librarians (DST4L) conference regarding how different (paid) platforms influence or bias analyses of social media data. As you start a real project analyzing social media, you’ll want to know how the data you have gathered may be limited to adjust your analysis accordingly.