What is OCR? OCR stands for Optical Character Recognition. This is the electronic identification and digital encoding of typed or printed text by means of an optical scanner or a specialized software. Performing OCR allows computers to read static images of text to convert them to readable, editable, and searchable data on a page. There are many applications of OCR including the creation of more accessible documents for the blind and visually-impaired, text/data mining projects, textual comparisons, and large-scale digitization projects.
There are a different software options to consider when you are performing OCR on you documents and it can be challenging to understand which one is best for you. So let’s break it down.
When deciding which OCR software to use you should first think of what the outcome of your project should be. Here are a few questions to ask yourself before you get started:
- What do I want my output to be?
- What type of file do I need and how will it be used?
- Is precision and accuracy a priority?
- Are there standards for your project/repository?
Answering these questions will help you determine which software will be best for your project.
Let’s now look at some software!
I will be speaking on the two most popular software, ABBYY FineReader and Adobe Acrobat Pro.
ABBYY is seen as the more advanced option. If precision is a necessity, ABBYY is a good tool for you. While it does not have 100% accuracy, this software includes lots of editing options to help fix any mistakes. ABBYY links well to text analysis programs such as ATLAS.ti and Nvivo. ABBYY can analyze documents from documents/images scanned straight into the program or from already existing images/PDF files.
– Can process documents for over 190 languages
– Text/Image are displayed side-by-side in separate boxes to display/fix “uncertain” items
– Uncertain items are highlighted
– Can identify structural components of documents, such as images or tables
– “Training Mode” to recognize decorative or special fonts or symbols
Abobe Acrobat Pro
More people are usually already familiar with Adobe interfaces, therefore making the learning process much easier than with ABBYY. You are more likely to see Adobe used outside of an academic setting and can be best used when you just need to make a PDF searchable. Adobe can analyze documents from documents/images scanned straight into the program or from already existing images/PDF files.
-Connects with Adobe suite and document editing tools
-Can identify multiple languages
-Easy to use interface
-Converts documents of almost any type into tagged PDFs
While neither of these software are free, both are available for your use in the Scholarly Commons!
Once you choose which software is best for your project, you can have your documents read through OCR!
For more information on OCR, check out this guide on OCR Basics or attend a Savvy Researcher Workshop on “Making Searchable PDFs with OCR.”
If you’re having trouble with your OCR, visit this previous post, “What To Do When OCR Software Doesn’t Seem To Be Working” or reach out to the Scholarly Commons directly at firstname.lastname@example.org.