While optical character recognition (OCR) is a powerful tool, it’s not a perfect one. Inputting a document into an OCR software doesn’t necessarily mean that the software will actually output something useful 100% of the time. Though most documents come out without a hitch, we have a few tips on what to do if your document just isn’t coming out.
The problem may be less with your program and more with your initial scan. Low-quality scans are less likely to be read by OCR software. Here are a few considerations to keep in mind when scanning a document you will be using OCR on:
- Make sure your document is scanned at 300 DPI
- Keep your brightness level at 50%
- Try to keep your scan as straight as possible
If you’re working with a document that you cannot create another scan for, there’s still hope! OCR engines with a GUI tend to have photo editing tools in them. If your OCR software doesn’t have those tools, or if their provided tools aren’t cutting it, try using a photo manipulation tool such as Photoshop or GIMP to edit your document. Also, remember OCR software tends to be less effective when used on photographs than on scans.
The issues you’re having may not stem from the scanning, but from the text itself. These issues can be more difficult to solve, because you cannot change the content of the original document, but they’re still good tips to know, especially when diagnosing issues with OCR.
- Make sure that your document is in a language, and from a period that your OCR software recognizes; not all engines are trained to recognize all languages
- Low contrast in documents can reduce OCR accuracy; contrast can be adjusted in a photo manipulation tool
- Text created prior to 1850 or with a typewriter can be more difficult for OCR software to read
- OCR software cannot read handwriting; while we’d all like to digitize our handwritten notes, OCR software just isn’t there yet
Working with Digital Files
Digital files can, in many ways, be more complicated to use OCR software on, just because someone else may have made the file. This means that a file is lower-quality to begin with, or that whoever scanned the file may have made errors. Most likely, you will run into scenarios that are easy fixes using photo manipulation tools. But there will be times that the images you come across just won’t work. It’s frustrating, but you’re not alone. Check out your options!
Always Remember that OCR is Imperfect
Even with perfect documents that you think will yield perfect results, there will be a certain percentage of mistakes. Most OCR software packages have an error rate between 97-99% per character. While this may seem like it’s not many errors, in a page with 1,800 characters, there will be between 18 and 54 errors. In a 300 page book with 1,800 characters per page, that’s between 5,400 and 16,200. So always be diligent and clean up your OCR!
The Scholarly Commons
Here at the Scholarly Commons, we have Adobe Acrobat Pro installed on every computer, and ABBYY FineReader installed on several. We can also help you set up Tesseract on your own computer. If you would like to learn more about OCR, check out our LibGuide and keep your eye open for our next Making Scanned Text Machine Readable through Optical Character Recognition Savvy Researcher workshop!