Introductions: What is Data Analysis, anyway?

This post is part of a series where we introduce you to the various topics that we cover in the Scholarly Commons. Maybe you’re new to the field or you’re just to the point where you’re just too afraid to ask… Fear not! We are here to take it back to the basics!

So, what is Data Analysis, anyway?

Data analysis is the process of examining, cleaning, transforming, and modeling data in order to make discoveries and, in many cases, support decision making. One key part of the data analysis process is separating the signal (meaningful information you are trying to discover) from the noise (random, meaningless variation) in the data.

The form and methods of data analysis can vary widely, and some form of data analysis is present in nearly every academic field. Here are some examples of data analysis projects:

  • Taylor Arnold, Lauren Tilton, and Annie Berke in “Visual Style in Two Network Era Sitcoms” (2019) used large-scale facial recognition and image analysis to examine the centrality of characters in the 1960s sitcoms Bewitched and I Dream of Jeannie. They found that Samantha is the distinctive lead character of Bewitched, while Jeannie is positioned under the domination of Tony in I Dream of Jeannie.
  • Allen Kim, Charuta Pethe, Steven Skiena in “What time is it? Temporal Analysis of Novels(2020) used the full text of 52,183 fiction books from Project Gutenberg and the HaithiTrust to examine the time of day that events in the book took place during. They found that events from 11pm to 1am became more common after 1880, which the authors attribute to the invention of electric lighting.
  • Wouter Haverals and Lindsey Geybels in “A digital inquiry into the age of the implied readership of the Harry Potter series” (2021) used various statistical methods to examine whether the Harry Potter books did in fact progressively become more mature and adult with successive books, as often believed by literature scholars and reviewers. While they did find that the text of the books implied a more advanced reader with later books, the change was perhaps not as large as would be expected.

How can Scholarly Commons help?

If all of this is new to you, don’t worry! The Scholarly Commons can help you get started.

Here are various aspects of our data services in the Scholarly Commons:

As always, if you’re interested in learning more about data analysis and how to support your own projects you can fill out a consultation request form, attend a Savvy Researcher Workshop, Live Chat with us on Ask a Librarian, or send us an email. We are always happy to help!

Comparison: Human vs. Computer Transcription of an “It Takes a Campus” Episode

Providing transcripts of audio or video content is critical for making these experiences accessible to a wide variety of audiences, especially those who are deaf or hard of hearing. Even those with perfect hearing might prefer to skim over a transcript of text rather than listen to audio sometimes. However, often times the slowest part of the audio and video publishing process is the transcribing portion of the workflow. This was certainly true with the recent interview I did with Ted Underwood, which I conducted on March 2 but did not release until March 31. The majority of that time was spent transcribing the interview; editing and quality control were significantly less time consuming.

Theoretically, one way we could speed up this process is to have computers do it for us. Over the years I’ve had many people ask me whether automatic speech-to-text transcription is a viable alternative to human transcription in dealing with oral history or podcast transcription. The short answer to that question is: “sort of, but not really.”

Speech to text or speech recognition technology has come a long way particularly in recent years. Its performance has improved to the point where human users can give auditory commands to a virtual assistant such as Alexa, Siri, or Google Home, and the device usually gives an appropriate response to the person’s request. However, recognizing a simple command like “Remind me at 5 pm to transcribe the podcast” is not quite the same as correctly recognizing and transcribing a 30-minute interview. It has to handle differences between two speakers and lengthy blocks of text.

To see how good of a job the best speech recognition tools do today, I decided to have one of these tools attempt to transcribe the Ted Underwood podcast interview and compare it to the actual transcript I did by hand. The specific tool I selected was Amazon Transcribe, which is part of the Amazon Web Services (AWS) suite of tools. This service is considered one of the best options available and uses cloud computing to convert audio data to textual data, presumably like how Amazon’s Alexa works.

It’s important to note that Amazon Transcribe is not free, however, it only costs $0.0004 per second of text, so Ted Underwood’s interview only cost me 85 cents to transcribe. For more on Amazon Transcribe’s costs, see this page.

In any case, here is a comparison between my manual transcript vs. Amazon Transcribe. To begin, here is the intro to the podcast as spoken and later transcribed by me:

Ben Ostermeier: Hello and welcome back to another episode of “It Takes
a Campus.” My name is Ben, and I am currently a graduate assistant at
the Scholarly Commons, and today I am joined with Dr. Ted Underwood,
who is a professor at the iSchool here at the University of Illinois.
Dr. Underwood, welcome to the podcast and thank you for taking time
to talk to me today.

And here is Amazon Transcribe’s interpretation of that same section of audio, with changes highlighted:

Hello and welcome back to another episode of it takes a campus. My
name is Ben, and I am currently a graduate assistant at Scali Commons.
And today I'm joined with Dr Ted Underwood, who is a professor 
at the high school here at the University of Illinois. 
Dr. Underwood, welcome to the podcast. Thank you for taking 
time to talk to me today.

As you can see, Amazon Transcribe did a pretty good job, but there are some mistakes and changes from the transcript I hand wrote. It particularly had trouble with proper nouns like “Scholarly Commons” and “iSchool,” along with some minor issues like not putting a dot after “Dr” and missing an “and” conjunction in the last sentence.

Screenshot of text comparison between Amazon-generated and human-generated transcripts.

Screenshot of text comparison between Amazon-generated (left) and human-generated (right) transcripts of the podcast episode.

You can see the complete changes between the two transcripts at this link.

Please note that the raw text I received from Amazon Transcribe was not separated into paragraphs initially. I had to do that myself in order to make the comparison easier to see.

In general, Amazon Transcribe does a pretty good job in recognizing speech but makes a decent number of mistakes that require cleaning up afterwards. For me, I actually find it faster and less frustrating to transcribe by hand instead of correcting a ‘dirty’ transcript, but others may prefer the alternative. Additionally, in some cases an institution may have a very large number of untranscribed oral histories, for example, and if the choice is between having a dirty transcript vs. no transcript at all, a dirty transcript is naturally preferable.

Also, while I did not have time to do this, there are ways to train Amazon Transcribe to do a better job with your audio, particularly with proper nouns like “Scholarly Commons.” You can read more about it on the AWS blog.

That said, there is very much an art to transcription, and I’m not sure if computers will ever be able to totally replicate it. When transcribing, I often have to make judgement calls about whether to include aspects of speech like “um”s and “uh”s. People also tend to start a thought and then stop and say something else, so I have to decide whether to include “false starts” like these or not. All of these judgement calls can have a significant impact on how researchers interpret a text, and to me it is crucial that a human sensitive to their implications makes these decisions. This is especially critical when transcribing an oral history that involves a power imbalance between the interviewer and interviewee.

In any case, speech to text technology is becoming increasingly powerful, and there may come a day, perhaps very soon, when computers can do just as good of a job as humans. In the meantime, though, we will still need to rely in at least some human input to make sure transcripts are accurate.

Meet our Graduate Assistants: Ben Ostermeier

What is your background education and work experience?

I graduated from Southern Illinois University Edwardsville with a Bachelor of Arts in History, with a minor in Computer Science. I was also the first SIUE student to receive an additional minor in Digital Humanities and Social Sciences. In undergrad I worked on a variety of digital humanities projects with the IRIS Center for the digital humanities, and after graduating I was hired as the technician for the IRIS Center. In that role, I was responsible for supporting the technical needs of digital humanities projects affiliated with the IRIS Center and provided guidance to professors and students starting their own digital scholarship projects.

What led you to your field?

I have been drawn to applied humanities, particularly history, since high school, and I have long enjoyed tinkering with software and making information available online. When I was young this usually manifested in reading and writing information on fan wikis. More recently, I have particularly enjoyed working on digital archives that focus on local community history, such as the SIUE Madison Historical project at madison-historical.siue.edu.

What are your favorite projects you’ve worked on?

While working for the Scholarly Commons, I have had the opportunity to work with my fellow graduate assistant Mallory Untch to publish our new podcast, It Takes a Campus, on iTunes and other popular podcast libraries. Recently, I recorded and published an episode with Dr. Ted Underwood. Mallory and I also created an interactive timeline showcasing the history of the Scholarly Commons for the unit’s tenth anniversary last fall.

What are some of your favorite underutilized Scholarly Commons resources that you would recommend?

We offer consultations to patrons looking for in-depth assistance with their digital scholarship. You can request a consultation through our online form!

When you graduate, what would your ideal job position look like?

I would love to work as a Digital Archivist in some form, responsible for ensuring the long term preservation of digital artifacts, as well as the best way to make these objects accessible to users. It is especially important to me that these digital spaces relate to and are accessible to the people and cultures represented in the items, so I hope I am able to make these sorts of community connections wherever I end up working.

The Art Institute of Chicago Launches Public API

Application Programming Interfaces, or APIs, are a major feature of the web today. Almost every major website has one, including Google Maps, Facebook, Twitter, Spotify, Wikipedia, and Netflix. If you Google the name of your favorite website and API, chances are you will find an API for it.

Last week, another institution joined the millions of public APIs available today: The Art Institute of Chicago. While they are not the first museum to release a public API, their blog article announcing the release of the API states that it holds the largest amount of data released to the public through an API from a museum. It is also the first museum API to hold all of their public data in one location, including data about their art collection, every exhibition ever held by the Institute since 1879, blog articles, full publication texts, and more than 1,000 gift shop products.

But what exactly is an API, and why should we be excited that we can now interact with the Art Institute of Chicago in this way? An API is basically a particular way to interact with a software application, usually a website. Normally when you visit a website in a browser, such as wikipedia.org, the browser requests an HTML document in order to render the images, fonts, text, and many other bits of data related to the appearance of the web page. This is a useful way to interact as a human consuming information, but if you wanted to perform some sort of data analysis on the data it would be much more difficult to do it this way. For example, if you wanted to answer even a simple question like “Which US president has the longest Wikipedia article?” it would be time consuming to do it the traditional way of viewing webpages.

Instead, an API allows you or other programs to request just the data from a web server. Using a programming language, you could use the Wikipedia API to request the text of each US President’s Wikipedia page and then simply calculate which text is the longest. API responses usually come in the form of data objects with various attributes. The format of these objects vary between websites.

“A Sunday on La Grande Jatte” by Georges Seurat, the data for which is now publicly available from the Art Institute of Chicago’s API.

The same is now true for the vast collections of the Art Institute of Chicago. As a human user you can view the web page for the work “A Sunday on La Grande Jatte” by Georges Seurat at this URL:

 https://www.artic.edu/artworks/27992/a-sunday-on-la-grande-jatte-1884

If you wanted to get the data for this work through an API to do data analysis though, you could make an API request at this URL:

https://api.artic.edu/api/v1/artworks/27992

Notice how both URLs contain “27992”, which is the unique ID for that artwork.

If you open that link in a browser, you will get a bunch of formatted text (if you’re interested, it’s formatted as JSON, a format that is designed to be manipulated by a programming language). If you were to request this data in a program, you could then perform all sorts of analysis on it.

To get an idea of what’s possible with an art museum API, check out this FiveThirtyEight article about the collections of New York’s Metropolitan Museum of Art, which includes charts of which countries are most represented at the Met and which artistic mediums are most popular.

It is possible now to ask the same questions about the Art Institute of Chicago’s collections, along with many others, such as “what is the average size of an impressionist painting?” or “which years was surrealist art most popular?” The possibilities are endless.

To get started with their API, check out their documentation. If you’re familiar with Python and possibly python’s data analysis library pandas, you could check out this article about using APIs in python to perform data analysis to start playing with the Art Institute’s API. You may also want to look at our LibGuide about qualitative data analysis to see what you could do with the data once you have it.

Holiday Data Visualizations

The fall 2020 semester is almost over, which means that it is the holiday season again! We would especially like to wish everyone in the Jewish community a happy first night of Hanukkah tonight.

To celebrate the end of this semester, here are some fun Christmas and Hanukkah-related data visualizations to explore.

Popular Christmas Songs

First up, in 2018 data journalist Jon Keegan analyzed a dataset of 122 hours of airtime from a New York radio station in early December. He was particularly interested in discovering if there was a particular “golden age” of Christmas music, since nowadays it seems that most artists who release Christmas albums simply cover the same popular songs instead of writing a new song. This is a graph of what he discovered:

Based on this dataset, 65% of popular Christmas songs were originally released in the 1940s, 50s, and 60s. Despite the notable exception of Mariah Carey’s “All I Want for Christmas is You” from the 90s, most of the beloved “Holiday Hits” come from the mid-20th century.

As for why this is the case, the popular webcomic XKCD claims that every year American culture tries to “carefully recreate the Christmases of Baby Boomers’ childhoods.” Regardless of whether Christmas music reflects the enduring impact of the postwar generation on America, Keegan’s dataset is available online to download for further exploration.

Christmas Trees

Last year, Washington Post reporters Tim Meko and Lauren Tierney wrote an article about where Americans get their live Christmas trees from. The article includes this map:

The green areas are forests primarily composed of evergreen Christmas trees, and purple dots represent Choose-and-cut Christmas tree farms. 98% of Christmas trees in America are grown on farms, whether it’s a choose-and-cut farm where Americans come to select themselves or a farm that ships trees to stores and lots.

This next map shows which counties produce the most Christmas trees:

As you can see, the biggest Christmas tree producing areas are New England, the Appalachians, the Upper Midwest, and the Pacific Northwest, though there are farms throughout the country.

The First Night of Hanukkah

This year, Hanukkah starts tonight, December 10, but its start date varies every year. However, this is not the case on the primarily lunar-based Hebrew Calendar, in which Hanukkah starts on the 25th night of the month of Kislev. As a result, the days of Hanukkah vary year-to-year on other calendars, particularly the solar-based Gregorian calendar. It can occur as early as November 28 and as late as December 26.

In 2016, Hannukah began on December 24, Christmas Eve, so Vox author Zachary Crockett created this graphic to show the varying dates on which the first night of Hannukah has taken place from 1900 to 2016:

The Spelling of Hanukkah

Hanukkah is a Hebrew word, so as a result there is no definitive spelling of the word in the Latin alphabet I am using to write this blog post. In Hebrew it is written as חנוכה and pronounced hɑːnəkə in the phonetic alphabet.

According to Encyclopædia Britannica, when transliterating the pronounced word into English writing, the first letter ח, for example, is pronounced like the ch in loch. As a result, 17th century transliterations spell the holiday as Chanukah. However, ח does not sounds like the way ch does when its at the start of an English word, such as in chew, so in the 18th century the spelling Hanukkah became common. However, the H on its own is not quite correct either. More than twenty other spelling variations have been recorded due to various other transliteration issues.

It’s become pretty common to use Google Trends to discover which spellings are most common, and various journalists have explored this in past years. Here is the most recent Google search data comparing the two most commons spellings, Hanukkah and Chanukah going back to 2004:

You can also click this link if you are reading this article after December 2020 and want even more recent data.

As you would expect, the terms are more common every December. It warrants further analysis, but it appears that Chanukah is becoming less common in favor of Hanukkah, possibly reflecting some standardization going on. At some point, the latter may be considered the standard term.

You can also use Google Trends to see what the data looks like for Google searches in Israel:

Again, here is a link to see the most recent version of this data.

In Israel, it also appears as though the Hanukkah spelling is also becoming increasingly common, though early on there were years in which Chanukah was the more popular spelling.


I hope you’ve enjoyed seeing these brief explorations into data analysis related to Christmas and Hanukkah and the quick discoveries we made with them. But more importantly, I hope you have a happy and relaxing holiday season!

Free, Open Source Optical Character Recognition with gImageReader

Optical Character Recognition (OCR) is a powerful tool to transform scanned, static images of text into machine-readable data, making it possible to search, edit, and analyze text. If you’re using OCR, chances are you’re working with either ABBYY FineReader or Adobe Acrobat Pro. However, both ABBYY and Acrobat are propriety software with a steep price tag, and while they are both available in the Scholarly Commons, you may want to perform OCR beyond your time at the University of Illinois.

Thankfully, there’s a free, open source alternative for OCR: Tesseract. By itself, Tesseract only works through the command line, which creates a steep learning curve for those unaccustomed to working with a command-line interface (CLI). Additionally, it is fairly difficult to transform a jpg into a searchable PDF with Tesseract.

Thankfully, there are many free, open source programs that provide Tesseract with a graphical user interface (GUI), which not only makes Tesseract much easier to use, some of them come with layout editors that make it possible to create searchable PDFs. You can see the full list of programs on this page.

The program logo for gImageReader

The program logo for gImageReader

In this post, I will focus on one of these programs, gImageReader, but as you can see on that page, there are many options available on multiple operating systems. I tried all of the Windows-compatible programs and decided that gImageReader was the closest to what I was looking for, a free alternative to ABBYY FineReader that does a pretty good job of letting you correct OCR mistakes and exporting to a searchable PDF.

Installation

gImageReader is available for Windows and Linux. Though they do not include a Mac compatible version in the list of releases, it may be possible to get it to work if you use a package manager for Mac such as Homebrew. I have not tested this though, so I do not make any guarantees about how possible it is to get a working version of gImageReader on Mac.

To install gImageReader on Windows, go to the releases page on Windows. From there, go to the most recent release of the program at the top and click Assets to expand the list of files included with the release. Then select the file that has the .exe extension to download it. You can then run that file to install the program.

Manual

The installation of gImageReader comes with a manual as an HTML file that can be opened by any browser. As of the date of this post, the Fossies software archive is hosting the manual on its website.

Setting OCR Mode

gImageReader has two OCR modes: “Plain Text” and “hOCR, PDF”. Plain Text is the default mode and only recognizes the text itself without any formatting or layout detection. You can export this to a text file or copy and paste it into another program. This may be useful in some cases, but if you want to export a searchable PDF, you will need to use hOCR, PDF mode. hOCR is a standard for formatting OCR text using either XML or HTML and includes layout information, font, OCR result confidence, and other formatting information.

To set the recognition to hOCR, PDF mode, go to the toolbar at the top. It includes a section for “OCR mode” with a dropdown menu. From there, click the dropdown and select hOCR, PDF:

gImageReader Toolbar

This is the toolbar for gImageReader. You can set OCR mode by using the dropdown that is the third option from the right.

Adding Images, Performing Recognition, and Setting Language

If you have images already scanned, you can add them to be recognized by clicking the Add Images button on the left panel, which looks like a folder. You can then select multiple images if you want to create a multipage PDF. You can always add more images later by clicking that folder button again.

On that left panel, you can also click the Acquire tab button, which allows you to get images directly from a scanner, if the computer you’re using has a scanner connected.

Once you have the images you want, click the Recognize button to recognize the text on the page. Please note that if you have multiple images added, you’ll need to click this button for every page.

If you want to perform recognition on a language other than English, click the arrow next to Recognize. You’ll need to have that language installed, but you can install additional languages by clicking “Manage Languages” in the dropdown appears. If the language is already installed, you can go to the first option listed in the dropdown to select a different language.

Viewing the OCR Result

In this example, I will be performing OCR on this letter by Franklin D. Roosevelt:

Raw scanned image of a typewritten letter signed by Franklin Roosevelt

This 1928 letter from Franklin D. Roosevelt to D. H. Mudge Sr. is courtesy of Madison Historical: The Online Encyclopedia and Digital Archive for Madison County Illinois. https://madison-historical.siue.edu/archive/items/show/819

Once you’ve performed OCR, there will be an output panel on the right. There are a series of buttons above the result. Click the button on the far right to view the text result overlaid on top of the image:

The text result of performing OCR on the FDR letter overlaid on the original scan.

Here is the the text overlaid on an image of the original scan. Note how the scan is slightly transparent now to make the text easier to read.

Correcting OCR

The OCR process did a pretty good job with this example, but it there are a handful of errors. You can click on any of the words of text to show them on the right panel. I will click on the “eclnowledgment” at the end of the letter to correct it. It will then jump to that part of the hOCR “tree” on the right:

hOCR tree in gImageReader, which shows the recognition result of each word in a tree-like structure.

The hOCR tree in gImageReader, which also shows OCR result.

Note in this screenshot I have clicked the second button from the right to show the confidence values, where the higher the number, the higher the confidence Tesseract has with the result. In this case, it is 67% sure that eclnowledgement is correct. Since it obviously isn’t correct, we can type new text by double-clicking on the word in this panel and type “acknowledgement.” You can do this for any errors on the page.

Other correction tips:

  1. If there are any regions that are not text that it is still recognizing, you can right click them on the right and delete them.
  2. You can change the recognized font and its size by going to the bottom area labeled “Properties.” Font size is controlled by the x_fsize field, and x_font has a dropdown where you can select a font.
  3. It is also possible to change the area of the blue word box once it is selected, simply by clicking and dragging the edges and corners.
  4. If there is an area of text that was not captured by the recognition, you can also right click in the hOCR “tree” to add text blocks, paragraphs, textlines, and words to the document. This allows you to draw a box on image and then type what the text says.

Exporting to PDF

Once you are done making OCR corrections, you can export to a searchable PDF. To do so, click the Export button above the hOCR “tree,” which is the third button from the left. Then, select export to PDF. It then gives you several options to set the compression and quality of the PDF image, and once you click OK, it should export the PDF.

Conclusion

Unfortunately, there are some limitations to gImageViewer, as can often be the case with free, open source software. Here are some potential problems you may have with this program:

  1. While you can add new areas to recognize with OCR, there is not a way to change the order of these elements inside the hOCR “tree,” which could be an issue if you are trying to make the reading order clear for accessibility reasons. One potential workaround could be to use the Reading Order options on Adobe Acrobat, which you can read about in this libguide.
  2. You cannot show the areas of the document that are in a recognition box unless you click on a word, unlike ABBYY FineReader which shows all recognition areas at once on the original image.
  3. You cannot perform recognition on all pages at once. You have to click the recognition button individually for each page.
  4. Though there are some image correction options to improve OCR, such as brightness, contrast, and rotation, it does not have as many options as ABBYY FineReader.

gImageViewer is not nearly as user friendly or have all of the features that ABBYY FineReader has, so you will probably want to use ABBYY if it is available to you. However, I find gImageViewer a pretty good program that can meet most general OCR needs.