Free, Open Source Optical Character Recognition with gImageReader

Optical Character Recognition (OCR) is a powerful tool to transform scanned, static images of text into machine-readable data, making it possible to search, edit, and analyze text. If you’re using OCR, chances are you’re working with either ABBYY FineReader or Adobe Acrobat Pro. However, both ABBYY and Acrobat are propriety software with a steep price tag, and while they are both available in the Scholarly Commons, you may want to perform OCR beyond your time at the University of Illinois.

Thankfully, there’s a free, open source alternative for OCR: Tesseract. By itself, Tesseract only works through the command line, which creates a steep learning curve for those unaccustomed to working with a command-line interface (CLI). Additionally, it is fairly difficult to transform a jpg into a searchable PDF with Tesseract.

Thankfully, there are many free, open source programs that provide Tesseract with a graphical user interface (GUI), which not only makes Tesseract much easier to use, some of them come with layout editors that make it possible to create searchable PDFs. You can see the full list of programs on this page.

The program logo for gImageReader

The program logo for gImageReader

In this post, I will focus on one of these programs, gImageReader, but as you can see on that page, there are many options available on multiple operating systems. I tried all of the Windows-compatible programs and decided that gImageReader was the closest to what I was looking for, a free alternative to ABBYY FineReader that does a pretty good job of letting you correct OCR mistakes and exporting to a searchable PDF.

Installation

gImageReader is available for Windows and Linux. Though they do not include a Mac compatible version in the list of releases, it may be possible to get it to work if you use a package manager for Mac such as Homebrew. I have not tested this though, so I do not make any guarantees about how possible it is to get a working version of gImageReader on Mac.

To install gImageReader on Windows, go to the releases page on Windows. From there, go to the most recent release of the program at the top and click Assets to expand the list of files included with the release. Then select the file that has the .exe extension to download it. You can then run that file to install the program.

Manual

The installation of gImageReader comes with a manual as an HTML file that can be opened by any browser. As of the date of this post, the Fossies software archive is hosting the manual on its website.

Setting OCR Mode

gImageReader has two OCR modes: “Plain Text” and “hOCR, PDF”. Plain Text is the default mode and only recognizes the text itself without any formatting or layout detection. You can export this to a text file or copy and paste it into another program. This may be useful in some cases, but if you want to export a searchable PDF, you will need to use hOCR, PDF mode. hOCR is a standard for formatting OCR text using either XML or HTML and includes layout information, font, OCR result confidence, and other formatting information.

To set the recognition to hOCR, PDF mode, go to the toolbar at the top. It includes a section for “OCR mode” with a dropdown menu. From there, click the dropdown and select hOCR, PDF:

gImageReader Toolbar

This is the toolbar for gImageReader. You can set OCR mode by using the dropdown that is the third option from the right.

Adding Images, Performing Recognition, and Setting Language

If you have images already scanned, you can add them to be recognized by clicking the Add Images button on the left panel, which looks like a folder. You can then select multiple images if you want to create a multipage PDF. You can always add more images later by clicking that folder button again.

On that left panel, you can also click the Acquire tab button, which allows you to get images directly from a scanner, if the computer you’re using has a scanner connected.

Once you have the images you want, click the Recognize button to recognize the text on the page. Please note that if you have multiple images added, you’ll need to click this button for every page.

If you want to perform recognition on a language other than English, click the arrow next to Recognize. You’ll need to have that language installed, but you can install additional languages by clicking “Manage Languages” in the dropdown appears. If the language is already installed, you can go to the first option listed in the dropdown to select a different language.

Viewing the OCR Result

In this example, I will be performing OCR on this letter by Franklin D. Roosevelt:

Raw scanned image of a typewritten letter signed by Franklin Roosevelt

This 1928 letter from Franklin D. Roosevelt to D. H. Mudge Sr. is courtesy of Madison Historical: The Online Encyclopedia and Digital Archive for Madison County Illinois. https://madison-historical.siue.edu/archive/items/show/819

Once you’ve performed OCR, there will be an output panel on the right. There are a series of buttons above the result. Click the button on the far right to view the text result overlaid on top of the image:

The text result of performing OCR on the FDR letter overlaid on the original scan.

Here is the the text overlaid on an image of the original scan. Note how the scan is slightly transparent now to make the text easier to read.

Correcting OCR

The OCR process did a pretty good job with this example, but it there are a handful of errors. You can click on any of the words of text to show them on the right panel. I will click on the “eclnowledgment” at the end of the letter to correct it. It will then jump to that part of the hOCR “tree” on the right:

hOCR tree in gImageReader, which shows the recognition result of each word in a tree-like structure.

The hOCR tree in gImageReader, which also shows OCR result.

Note in this screenshot I have clicked the second button from the right to show the confidence values, where the higher the number, the higher the confidence Tesseract has with the result. In this case, it is 67% sure that eclnowledgement is correct. Since it obviously isn’t correct, we can type new text by double-clicking on the word in this panel and type “acknowledgement.” You can do this for any errors on the page.

Other correction tips:

  1. If there are any regions that are not text that it is still recognizing, you can right click them on the right and delete them.
  2. You can change the recognized font and its size by going to the bottom area labeled “Properties.” Font size is controlled by the x_fsize field, and x_font has a dropdown where you can select a font.
  3. It is also possible to change the area of the blue word box once it is selected, simply by clicking and dragging the edges and corners.
  4. If there is an area of text that was not captured by the recognition, you can also right click in the hOCR “tree” to add text blocks, paragraphs, textlines, and words to the document. This allows you to draw a box on image and then type what the text says.

Exporting to PDF

Once you are done making OCR corrections, you can export to a searchable PDF. To do so, click the Export button above the hOCR “tree,” which is the third button from the left. Then, select export to PDF. It then gives you several options to set the compression and quality of the PDF image, and once you click OK, it should export the PDF.

Conclusion

Unfortunately, there are some limitations to gImageViewer, as can often be the case with free, open source software. Here are some potential problems you may have with this program:

  1. While you can add new areas to recognize with OCR, there is not a way to change the order of these elements inside the hOCR “tree,” which could be an issue if you are trying to make the reading order clear for accessibility reasons. One potential workaround could be to use the Reading Order options on Adobe Acrobat, which you can read about in this libguide.
  2. You cannot show the areas of the document that are in a recognition box unless you click on a word, unlike ABBYY FineReader which shows all recognition areas at once on the original image.
  3. You cannot perform recognition on all pages at once. You have to click the recognition button individually for each page.
  4. Though there are some image correction options to improve OCR, such as brightness, contrast, and rotation, it does not have as many options as ABBYY FineReader.

gImageViewer is not nearly as user friendly or have all of the features that ABBYY FineReader has, so you will probably want to use ABBYY if it is available to you. However, I find gImageViewer a pretty good program that can meet most general OCR needs.

Scholarly Commons Software: Open Source Alternatives

Hello from home to all my fellow (new) work-from-homers!

In light of measures taken to protect public health, it can feel as though our work schedules have been shaken up. However, we are here to help you get back on track and the first thing to do is make sure you have all the tools necessary to be successful at home.

Continue reading

Featured Resource: QGIS, a Free, Open Source Mapping Platform

This week, geographers around the globe took some time to celebrate the software that allows them to analyze, well, that very same globe. November 13th marked the 20th annual GIS Day,  an “international celebration of geographic information systems,” as the official GIS Day website puts it.

the words "GIS day" in a stylized font appear below a graphic of a globe with features including buildings, trees, and water

But while GIS technology has revolutionized the way we analyze and visualize maps over the past two decades, the high cost of ArcGIS products, long recognized as the gold standard for cartographic analysis tools, is enough to deter many people from using it. At the University of Illinois and other colleges and universities, access to ArcGIS can be taken for granted, but many of us will not remain in the academic world forever. Luckily, there’s a high-quality alternative to ArcGIS for those who want the benefits of mapping software without the pricetag!

the QGIS logo

QGIS is a free, open source mapping software that has most of the same functionality as ArcGIS. While some more advanced features included in ArcGIS do not have analogues in QGIS, developers are continually updating the software and new features are always being added. As it stands now, though, QGIS includes everything that the casual GIS practitioner could want, along with almost everything more advanced users need.

As is often the case with open source software alternatives, QGIS has a large, vibrant community of supporters, and its developers have put together tons of documentation on how to use the program, such as this user guide. Generally speaking, if you have any experience with ArcGIS it’s very easy to learn QGIS—for a picture of the learning curve, think somewhere along the lines of switching from Microsoft Word to Google Docs. And if you don’t have experience, the community is there to help! There are many guides to getting started, including the one listed in the above link, and more forum posts of users working through questions together than anyone could read in a lifetime. 

For more help, stop by to take a look at one of the QGIS guidebooks in our reference collection, or send us an email at sc@library.illinois.edu!

Have you made an interesting map in QGIS? Send us pictures of your creations on Twitter @ScholCommons!

 

Review: Docear

We’ve talked about Docear the Visual Citation Manager on the blog before, before my time, but it’s been a while we’ll revisit it. Though, the most recent major update to the software was in 2015, and based on the forums it seems that Docear has struggled with finding funding. However, the researchers behind this project are still active. That being said, in the worst case scenario, Docear is an open source project and if things went south, you could still get your information out. If you are considering relying on this software for organizing very long term research projects you need to use an external cloud backup service as their My Docear service is no longer available and supported if it ever existed at all.

Docear

Screenshot of Docear demo mindmap

Docear paper demo mindmap showing linked annotated PDF

Docear is an open source mind mapping, reference, and citation management software for those who want a visual way to keep their research organized. It is available for Windows, Mac, and Linux computers. Docear provides plenty of support and useful instructions through their official user manual. The examples on the app itself for trying out the mind map and PDF capability incorporate some of the research behind the product itself and makes for an informative, if somewhat meta, experience. Docear staff like to compare the software to Zotero and Mendeley, but it’s a very different type of beast. Specifically, a combination of Jabref (without the OpenOffice support) and Freeplane for mind maps, and, depending on what type of PDF viewer you use, a document annotation software. To enjoy the full capability of this software you also have to download PDF X-change viewer, though you can still do some annotating with other less supported PDF editors. Docear also uses Mr. DLib or Machine-readable digital library cataloging. While Mr. DLib has not really caught on elsewhere, it is featured as part of JabRef and specifically powers the article recommendation function. If they ever get their funding together, Docear could become a space where you can research, organize, and write an article. And unlike some of the software options discussed on this blog and in our LibGuides, you can download Docear from a zip file and run it to full capacity on Scholarly Commons computers.

Although Docear is not quite the all-encompassing research suite the creators envisioned, there are still lots of funky little features not found in other services. For example, in the Tools and Settings tab you can add map locations with OpenMaps (unfortunately there is no search function — you have to zoom and select your location) to add a geographic component to your otherwise mental map,which you can see by clicking on “View Open Maps Location” later.

Screenshot of Docear Open Maps features

You can also add time alerts for time management in Tools and Settings. But before we get ahead of ourselves, it’s easy to add a node with keyboard shortcuts and the node panel in the toolbar. You can add links to websites and other nodes right in your mind map by right clicking on a node. Apparently, you can add formulas to your mind map using LaTex but I didn’t try it, as I am not one of the people who cares about that sort of thing.

And while you do have the option of writing in Docear itself, there is a plugin for MS Word, but only on Windows. On the one hand, the plugin is old and hasn’t been updated in a few years, and it doesn’t work on the computers at Scholarly Commons. But on the other hand, since it’s based in BibTeX, if it actually does work the way they say it does, you should be able to use it with any BibTeX bibliography, and not just Docear. This means, it could give you that MS Word integration that you might be lacking with another reference manager.

Overall, if you wanted a reference manager and document annotator that is easy to get started on this is NOT the one for you, but for those patient enough to deal with the learning curve, Docear can be a good addition to your research strategy. I really hope this project gets the funding it needs to fully live up to its potential, but for now it’s still a solid option for researchers looking for a unique way to organize their work.

Choosing GIMP as a Photoshop Alternative

The GIMP logo.

Image manipulation is a handy skill, but sinking time and money into Adobe Photoshop may not be an option for some people. If you’re looking for an alternative to Photoshop, GIMP is a great bet. Available for almost every operating system, GIMP is open source and free with lots of customization and third party plugin options.

One of the major aspects you lose when moving from Photoshop to GIMP is the loss of a major community and widespread knowledge of the software. While GIMP has its dedicated loyalists and a staff, they lack the same kind of institutional power that Adobe has to answer questions, fix bugs, and provide support. While Lynda.com does provide tutorials on GIMP, there are fewer overall resources for tutorials and help than Photoshop.

That being said, GIMP can still be a more powerful tool than Photoshop, especially if you have a programming background (or can convince someone else to do some programming for you). Theoretically, you could add or subtract any features that you so choose by changing the GIMP source code, and you are free to distribute a version of GIMP with those changes to whomever you choose.

There are a number of pros/cons for choosing GIMP over Photoshop, so here’s a handy list.

GIMP Pros:

  • Free
  • Highly customizable and flexible (with coding expertise)
  • Motivated user community run by volunteers
  • High usability
  • Easier to contact leadership regarding issues

GIMP Cons:

  • Less recognized
  • Changes are more slowly implemented
  • No promise that the software will always be maintained in perpetuity

Of course, there are more pros and cons to using GIMP, but this will give you a basic idea of the pros and cons of switching over to this open-source software.

For more information on GIMP, you can check out the GIMP Wiki, which is maintained by GIMP developers, or The GTK+ Project, which is a toolkit for the creation of graphical user interfaces (GUI). GIMP also provides a series of Tutorials. If you’re still loyal to Adobe, you can look at the Adobe products available on the UIUC WebStore, as well as tutorials on Lynda.com.

Do you have opinions on GIMP vs. Photoshop? Let us know in the comments! And stop by the Scholarly Commons, where you can use either (or both!) software for free.

Spotlight on DiRT Directory: Digital Research Tools

The DiRT logo.

As a researcher, it can sometimes be frustrating knowing that someone out there has created a useful tool that will help you with what you’re working on, but being unable to find it. Google searches prove fruitless, and your network of friends don’t necessarily know what you’re talking about. In that moment of panic and frustration, you may just need to get a little DiRT-y.

DiRT Directory: Digital Research Tools is a directory of research tools for scholarly use. Using TaDiRAH (the Taxonomy of Digital Research Activities in the Humanities), DiRT breaks down the stages of a research project, and groups tools that are relevant to each stage: Capture, Creation, Enrichment, Analysis, Interpretation, Storage, and Dissemination. Users can either search for tools using these categories — broken down into subcategories whose specificity helps to narrow down the many tools found in the DiRT Directory — through a search box or by tag. Personally, I feel that searching through the TaDiRAH categories allows you to find relevant tools, but also allows you to explore options that you may not have previously thought of as being available, making it the most fruitful way to browse tools.

One nice aspect of DiRT is its search platform. After you choose your category, you have the option to search within the category for these criteria: Platform, Cost, Exclude, License, and Research Objects, as well as sort order. For researchers concerned with cost, this tool is especially useful, as you can limit your search to what is in your budget.

After you complete your search, you are offered a list of different tools. Tools range from well-known sources, like Google Docs, to things you have probably never heard of before. Each source includes a description, outlining what kind of tool it is — online, software, etc. — what its capabilities are, and in many cases, a note on its past or future development. Each entry also includes a link to the tool’s website, their license, and the date of DiRT’s most recent update on the source information.

An example tool entry on DiRT for Scrivener writing software.

An example tool entry on DiRT for Scrivener writing software on the search page.

Finally, each tool has its own page that you can access from the search function. This page holds a wealth of information, including an expanded description that outlines the nitty gritty aspects of the tool — from platforms to cost bracket to tags. It also includes screenshots of the tool in action, a list of recent edits to the page, and a comments section. However, not all tools have the same level of detail in their pages.

capture2

Scrivener’s page, which includes a description, screenshots, a list of contributors, and a comments section.

While the selection presented on DiRT can be almost overwhelming, digging through DiRT can help you find the perfect tools for your project.

If you still can’t find what you want in DiRT Directory, or need some guidance in what to search for in the first place, stop by the Scholarly Commons, located in Main Library Room 306, open from 9am-6pm on weekdays. Or, email us! We are always happy to help you with your research needs.