What To Do When OCR Software Doesn’t Seem To Be Working

Optical character recognition can enhance your research!

While optical character recognition (OCR) is a powerful tool, it’s not a perfect one. Inputting a document into an OCR software doesn’t necessarily mean that the software will actually output something useful 100% of the time. Though most documents come out without a hitch, we have a few tips on what to do if your document just isn’t coming out.

Scanning Issues

The problem may be less with your program and more with your initial scan. Low-quality scans are less likely to be read by OCR software. Here are a few considerations to keep in mind when scanning a document you will be using OCR on:

  • Make sure your document is scanned at 300 DPI
  • Keep your brightness level at 50%
  • Try to keep your scan as straight as possible

If you’re working with a document that you cannot create another scan for, there’s still hope! OCR engines with a GUI tend to have photo editing tools in them. If your OCR software doesn’t have those tools, or if their provided tools aren’t cutting it, try using a photo manipulation tool such as Photoshop or GIMP to edit your document. Also, remember OCR software tends to be less effective when used on photographs than on scans.

Textual Issues

The issues you’re having may not stem from the scanning, but from the text itself. These issues can be more difficult to solve, because you cannot change the content of the original document, but they’re still good tips to know, especially when diagnosing issues with OCR.

  • Make sure that your document is in a language, and from a period that your OCR software recognizes; not all engines are trained to recognize all languages
  • Low contrast in documents can reduce OCR accuracy; contrast can be adjusted in a photo manipulation tool
  • Text created prior to 1850 or with a typewriter can be more difficult for OCR software to read
  • OCR software cannot read handwriting; while we’d all like to digitize our handwritten notes, OCR software just isn’t there yet

Working with Digital Files

Digital files can, in many ways, be more complicated to use OCR software on, just because someone else may have made the file. This means that a file is lower-quality to begin with, or that whoever scanned the file may have made errors. Most likely, you will run into scenarios that are easy fixes using photo manipulation tools. But there will be times that the images you come across just won’t work. It’s frustrating, but you’re not alone. Check out your options!

Always Remember that OCR is Imperfect

Even with perfect documents that you think will yield perfect results, there will be a certain percentage of mistakes. Most OCR software packages have an error rate between 97-99% per character. While this may seem like it’s not many errors, in a page with 1,800 characters, there will be between 18 and 54 errors. In a 300 page book with 1,800 characters per page, that’s between 5,400 and 16,200. So always be diligent and clean up your OCR!

The Scholarly Commons

Here at the Scholarly Commons, we have Adobe Acrobat Pro installed on every computer, and ABBYY FineReader installed on several. We can also help you set up Tesseract on your own computer. If you would like to learn more about OCR, check out our LibGuide and keep your eye open for our next Making Scanned Text Machine Readable through Optical Character Recognition Savvy Researcher workshop!

Learning to Make Documents Accessible with OCR Software

Photo via pexels.com.

Accessibility in the digital age can be difficult for people to understand, especially given the sheer amount of ways to present information on the computer. However, creating content that is accessible to all individuals should be a priority for researchers. Creating accessible documents is an easy process, and the Scholarly Commons has the software you need to make that happen.

Optical character recognition software (otherwise known as OCR) has the ability to convert scanned documents, PDF documents, and image documents into editable and searchable documents. Documents that have gone through OCR software can then be recognized by, and read through screen reader software. Screen readers are tools oftentimes used by those with visual impairments; they convert textual content into ‘synthesized’ speech, which is then read aloud to the user.

One trick to see whether or not a digital document is accessible is to try to highlight a line of text and then copy-paste it into another document. If you can successfully do that, your document is ready to be read by a screen reader. If you cannot highlight a single line of text and/or copy-paste it, you may want to consider putting your document through OCR software. However, if you have a “protected” PDF, you will not be able to reformat the document for accessibility.

OCR readers can read more than just digital documents – they are powerful tools that can also perform their function on scanned documents, either typed or handwritten. That is not to say that they are infallible, however. OCR software may have difficulties reading documents created before 1850, and may not always be 100% accurate. The user must be vigilant to make sure that mistakes don’t creep their way into the final product.

The Scholarly Commons is outfitted with two OCR programs: ABBY FineReader, and Adobe Acrobat. To read more on the specifics of each software, see the ABBY FineReader LibGuide or Adobe Acrobat’s Guide to OCR. There are also numerous options online for PDF readers online — look around and find the option that works best for you. Just a little time with this user-friendly software can make not only your research accessible, but to make the world a little more accessible as a whole.