I recently ran to a need to have several smart phone photographed book pages turned into text searchable format. So I needed to run a text recognition on the JPG files, usually resulting in a PDF file that looks the same but has a capatibility to make searches. In this case, the files are far from optimal quality as the lightning is not uniform and the pages are bent, so it makes an execellent “worst case” testing. The text is still easily readable by human eye.
I remembered from the past that Adobe has such a product, but I also know that Adobe’s products aren’t always the best even though some of them are considered to be good. I found a random web page with review of nine OCR applications present and updated in 2016, so I begun to download Trial versions of them to find a good one. I test the programs in the same order but draw my own conclusions of the applications. I also try some free and open source choices.
1. Omnipage Standard
This was the best in the review, but I tried to obtain a Trial version using two different e-mail addresses but never got any e-mail with a download link. I also checked they were not caught as spam.
2. Adobe Acrobat
The download time was long but the product basically worked. Some pages had slightly darker areas where OCR didn’t work even thought they were not THAT dark. Most of the text was searchable which was better than nothing, but I expected more from so big company name and the highest price tag. Changing the settings did not help and only made the recognition worse, so the defaults were pretty good.
3. ABBYY FineReader
Download was fast, but the installer left a 600MB temporary folder laying around that I deleted myself after making sure it was no longer needed. The software makes use of all four CPU cores by recognizing four pages a time and consuming 100% CPU time. Adobe did only one page a time and CPU usage was close to 25% (one of four cores consumed.)
Abbyy FineReader also was able to recognize even the darkest areas so it was “perfect” and you could not expect more from the recognition quality. That is why I considerer this to be the winner! It can however save only 3 pages a time in Trial mode, but it also on the cheaper side, less than half the price of Adobe Acrobat.
4. ReadIris Pro
The application performed so far most poorly with cell phone photographed document, producing only gibberish and the settings were quite limited. It’s one of the cheapest.
5. Nuance Power PDF Advanced
So far the largest download, over 850MB in size. I wonder, why a text recognition algorithm with some additional data that it uses would need to be this large. It appears to be because it has a huge collection of integrations to many office suites and business e-mail applications. It has also text-to-speech for a few languages. I do not want those and it is possible to uncheck them in the installer. The installer is also so far first that wants you to restart your computer, just for installing a new application. The application seems to start up fine anyway. It is using only one core while processing multiple pages. The program performs amazingly well and is a top choice so far.
6. ABBYY PDF Transformer+
Installing was easy. Uses a single thread. This seems to be a cheaper version of ABBYY FineReader, which was very good in OCR. This version is also as good, and is actually able to save more than 3 pages while in trial!
7. Soda PDF
A nice installer with good design. The application requires a free registration in order to create PDF documents, which takes additional time. This program, again, uses only one core a time even though the process consumes only 200MB of memory. I wonder why it is so difficult to run the recognition on several threads for different pages at the same time for total OCR duration signitifically shorter. This application is the slowest so far, but at least OCR is automatic in nature so you can go for an extended cofffee break when processing over 50 pages on a modern computer. The OCR operation crashed when it was about 50% done, so I did not try again because I had at this point already found two good applications.
8. Presto PageManager
After installing, trying to launch the applications opens up a web registration form right away. The form has multiple pages with a lot of questions, which seems a bit odd, and the layout reminds me of earlier days of Internet. I filled out the form to the best of my ability and it was submitted. Nothing happens. I try to restart the application but simply – nothing happens, and it neither opens up any web pages anymore. It is just totally dead. I guess it might not Windows 10 compatible, which would not seem odd because of the outdated looks of the web form. I wish there would be an error message instead of just nothing happening. I am not able to review this one.
9. PaperPort Professional
As with Omnipage Standard, I never got an e-mail when I requested a trial, not even in the spam filter. I wonder if these companies have something wrong in their e-mailing system or if they’re actually using manual labor to deliver the trials. In any case, it is taking far too long so I can not review this one either.
Free or open source applications
It seems that the many free or open source OCR applications do the actual recognition with same libraries, called Tesseract and GOCR. Tesseract has a history of being purchased from HP by Google, which then took over the development and also opened the source code to public. There are many frontends available for Tesseract, some being promoted and designed like many of the commercial applications, while some being more open source project like in style. I tried one of them and compared the results to running the same photo through Google Document’s OCR on the Google’s cloud service. Both turned out rather poorly, being gibberish with a few recognizable words. With both, Google Docs and a local Tesseract front end program, the result turned about equally slightly better when I manually fine tuned the photo in an image editor before processing. I think because of that fact and that Google owns Tesseract, that Google Docs is internally using Tesseract as well, even though it is not clearly advertised. Tesseract unfortunately is not a choice for photographed material at this moment, unlike few of the commercial non-free applications were, but it is totally free and open. GOCR seemed to be no better than Tesseract. You might want to seriously give these a go though if your source material is finely scanned papers instead of a quickly photographed book page! These are open sourced solutions, which is always a good thing, especially if it works well enough.
ABBYY’s FineReader and PDF Transfomer are both great choices, as well as Nuance Power PDF, which should be added to the trio of the greatest as equal. It’s difficult to say which of the three is best, because they all performed OCR in poor conditions excellently. Instead, there are some other points than technical OCR ability: Nuance Power PDF is fully functional as a trial version, but FineReader converts only three pages a time. PDF Transformer of the same company did not have that limitation and it is also the cheapest, so I can only recommend PDF transformer for trying out and then purchasing if you want to save some money! Technically, all three are good products. Also, if you have problems while trialing any of them, it is good to have two other options.
If your source material is very clean, well scanned and cropped paper, mint quality, you might want to seriously give a go to Google Docs cloud OCR or many of the free front end applications that use the open sourced libraries. It is open and free and might be just good enough!
The commercial programs don’t have any Linux versions even as closed source. It is also very difficult to find free trial versions from most of the web pages of the companies or web pages the products, so I recommend typing to Google search the name of the application and the word “trial” to easily find the right page from the mess.
A common hint to using the applications – they usually don’t open a bunch of JPG photos and perform OCR directly, but instead you have to first combine the photos to a single PDF file with a functionality in the application, and only then you can choose to do the OCR. It is sometimes possible to replace parts of the page with the recognized text, but for more beautiful and consistent look it is better to keep the photos intact for the reader and instead just add an overlayed search possibility.