Cool Text Data – Music, Law, and News!

Computational text analysis can be done in virtually any field, from biology to literature. You may use topic modeling to determine which areas are the most heavily researched in your field, or attempt to determine the author of an orphan work. Where can you find text to analyze? So many places! Read on for sources to find unique text content.

Woman with microphone

Genius – the song lyrics database

Genius started as Rap Genius, a site where rap fans could gather to annotate and analyze rap lyrics. It expanded to include other genres in 2014, and now manages a massive database covering Ariana Grande to Fleetwood Mac, and includes both lyrics and fan-submitted annotations. All of this text can be downloaded and analyzed using the Genius API. Using Genius and a text mining method, you could see how themes present in popular music changed over recent years, or understand a particular artist’s creative process.

homepage of case.law, with Ohio highlighted, 147,692 unique cases. 31 reporters. 713,568 pages scanned.

Homepage of case.law

Case.law – the case law database

The Caselaw Access Project (CAP) is a fairly recent project that is still ongoing, and publishes machine-readable text digitized from over 40,000 bound volumes of case law from the Harvard Law School Library. The earliest case is from 1658, with the most recent cases from June 2018. An API and bulk data downloads make it easy to get this text data. What can you do with huge amounts of case law? Well, for starters, you can generate a unique case law limerick:

Wheeler, and Martin McCoy.
Plaintiff moved to Illinois.
A drug represents.
Pretrial events.
Rocky was just the decoy.

Check out the rest of their gallery for more project ideas.

Newspapers and More

There are many places you can get text from digitized newspapers, both recent and historical. Some newspaper are hundreds of years old, so there can be problems with the OCR (Optical Character Recognition) that will make it difficult to get accurate results from your text analysis. Making newspaper text machine readable requires special attention, since they are printed on thin paper and have possibly been stacked up in a dusty closet for 60 years! See OCR considerations here, but the newspaper text described here is already machine-readable and ready for text mining. However, with any text mining project, you must pay close attention to the quality of your text.

The Chronicling America project sponsored by the Library of Congress contains digital copies of newspapers with machine-readable text from all over the United States and its territories, from 1690 to today. Using newspaper text data, you can analyze how topics discussed in newspapers change over time, among other things.

newspapers being printed quickly on a rolling press

Looking for newspapers from a different region? The library has contracts with several vendors to conduct text mining, including Gale and ProQuest. Both provide newspaper text suitable for text mining, from The Daily Mail of London (Gale), to the Chinese Newspapers Collection (ProQuest). The way you access the text data itself will differ between the two vendors, and the library will certainly help you navigate the collections. See the Finding Text Data library guide for more information.

The sources mentioned above are just highlights of our text data collection! The Illinois community has access to a huge amount of text, including newspapers and primary sources, but also research articles and books! Check out the Finding Text Data library guide for a more complete list of sources. And, when you’re ready to start your text mining project, contact the Scholarly Commons (sc@library.illinois.edu), and let us help you get started!

Lightning Review: Optical Character Recognition: An Illustrated Guide to the Frontier

Lightning Review: Optical Character Recognition: An Illustrated Guide to the Frontier

Picture of OCR Book

Stephen V. Rice, George Nagy, and Thomas A. Nartaker’s work on OCR, though written in 1999, is still a remarkably valuable bedrock text for diving into the technology. Though OCR systems have, and continue to, evolve with each passing day, the study presented within their book still highlights some of the major issues one faces when performing optical character recognition. Text is in an unusual typeface or contains stray marks, print is too heavy or too light. This text gives those interested in learning the general problems that arise in OCR a great guide to what they and their patrons might encounter.

The book opens with a quote from C-3PO, and a discussion of how our collective sci-fi imagination believe technology will have “cognitive and linguistic abilities” that match and perhaps even exceed our own (Rice et al., 1999, p. 1).

C3PO Gif

 

The human eye is the most powerful character identifier to exist. As the authors note “A seven year old child can identify characters with far greater accuracy than the leading OCR systems” (Rice et al., 1999, 165). I found this simple explanation so helpful for when I get questions here in the Scholarly Commons from patron who are confused as to why their document, even after been run through and  OCR software, is not perfectly recognized. It is very easy, with our human eyes, to discern when a mark on a page is nothing of importance, and when it is a letter. Ninety-nine percent character accuracy doesn’t mean ninety-nine percent page accuracy.

Look with your special eyes Gif

In summary, this work presents a great starting point for those with an interest in understanding OCR technology, even at almost two decades old.

Give it, and the many other fabulous books in our reference collection, a read!

What To Do When OCR Software Doesn’t Seem To Be Working

Optical character recognition can enhance your research!

While optical character recognition (OCR) is a powerful tool, it’s not a perfect one. Inputting a document into an OCR software doesn’t necessarily mean that the software will actually output something useful 100% of the time. Though most documents come out without a hitch, we have a few tips on what to do if your document just isn’t coming out.

Scanning Issues

The problem may be less with your program and more with your initial scan. Low-quality scans are less likely to be read by OCR software. Here are a few considerations to keep in mind when scanning a document you will be using OCR on:

  • Make sure your document is scanned at 300 DPI
  • Keep your brightness level at 50%
  • Try to keep your scan as straight as possible

If you’re working with a document that you cannot create another scan for, there’s still hope! OCR engines with a GUI tend to have photo editing tools in them. If your OCR software doesn’t have those tools, or if their provided tools aren’t cutting it, try using a photo manipulation tool such as Photoshop or GIMP to edit your document. Also, remember OCR software tends to be less effective when used on photographs than on scans.

Textual Issues

The issues you’re having may not stem from the scanning, but from the text itself. These issues can be more difficult to solve, because you cannot change the content of the original document, but they’re still good tips to know, especially when diagnosing issues with OCR.

  • Make sure that your document is in a language, and from a period that your OCR software recognizes; not all engines are trained to recognize all languages
  • Low contrast in documents can reduce OCR accuracy; contrast can be adjusted in a photo manipulation tool
  • Text created prior to 1850 or with a typewriter can be more difficult for OCR software to read
  • OCR software cannot read handwriting; while we’d all like to digitize our handwritten notes, OCR software just isn’t there yet

Working with Digital Files

Digital files can, in many ways, be more complicated to use OCR software on, just because someone else may have made the file. This means that a file is lower-quality to begin with, or that whoever scanned the file may have made errors. Most likely, you will run into scenarios that are easy fixes using photo manipulation tools. But there will be times that the images you come across just won’t work. It’s frustrating, but you’re not alone. Check out your options!

Always Remember that OCR is Imperfect

Even with perfect documents that you think will yield perfect results, there will be a certain percentage of mistakes. Most OCR software packages have an error rate between 97-99% per character. While this may seem like it’s not many errors, in a page with 1,800 characters, there will be between 18 and 54 errors. In a 300 page book with 1,800 characters per page, that’s between 5,400 and 16,200. So always be diligent and clean up your OCR!

The Scholarly Commons

Here at the Scholarly Commons, we have Adobe Acrobat Pro installed on every computer, and ABBYY FineReader installed on several. We can also help you set up Tesseract on your own computer. If you would like to learn more about OCR, check out our LibGuide and keep your eye open for our next Making Scanned Text Machine Readable through Optical Character Recognition Savvy Researcher workshop!

Learning to Make Documents Accessible with OCR Software

Photo via pexels.com.

Accessibility in the digital age can be difficult for people to understand, especially given the sheer amount of ways to present information on the computer. However, creating content that is accessible to all individuals should be a priority for researchers. Creating accessible documents is an easy process, and the Scholarly Commons has the software you need to make that happen.

Optical character recognition software (otherwise known as OCR) has the ability to convert scanned documents, PDF documents, and image documents into editable and searchable documents. Documents that have gone through OCR software can then be recognized by, and read through screen reader software. Screen readers are tools oftentimes used by those with visual impairments; they convert textual content into ‘synthesized’ speech, which is then read aloud to the user.

One trick to see whether or not a digital document is accessible is to try to highlight a line of text and then copy-paste it into another document. If you can successfully do that, your document is ready to be read by a screen reader. If you cannot highlight a single line of text and/or copy-paste it, you may want to consider putting your document through OCR software. However, if you have a “protected” PDF, you will not be able to reformat the document for accessibility.

OCR readers can read more than just digital documents – they are powerful tools that can also perform their function on scanned documents, either typed or handwritten. That is not to say that they are infallible, however. OCR software may have difficulties reading documents created before 1850, and may not always be 100% accurate. The user must be vigilant to make sure that mistakes don’t creep their way into the final product.

The Scholarly Commons is outfitted with two OCR programs: ABBY FineReader, and Adobe Acrobat. To read more on the specifics of each software, see the ABBY FineReader LibGuide or Adobe Acrobat’s Guide to OCR. There are also numerous options online for PDF readers online — look around and find the option that works best for you. Just a little time with this user-friendly software can make not only your research accessible, but to make the world a little more accessible as a whole.