12/10/10

Does Google Docs OCR work?



Google announced in June that you could perform OCR - optical character recognition - on files when you upload them to Google Docs (as shown above, just check the box).

Does Google Docs OCR work? Yes, but we recommend it with serious qualifications.

We've tried the feature now and have a brief report. Our test was on 22 pages scanned into a pdf from a book. Our scans contained two pages side by side with one side in Hebrew, except for some footnotes on the bottom in English, and the other side in English.

We chose to upload the pages individually, a single page in each PDF file. Google imposes a 10 page limit so that you cannot just upload a large book and have google scan it.

Our test pages were admittedly more complex than average. The results were acceptable but not great.

The original scan was replicated as an image at the top of each resulting page in Google docs. Just above that Google inserted the disclaimer, "This document contains text automatically extracted from a PDF or image file. Formatting may have been lost and not all text may have been recognized."

The text that was recognized appeared below it. Not surprising - none of the Hebrew text was recognized. Somewhat disconcerting, in the English blocks, whole lines were skipped in no apparent pattern about 5-10% of the time. About 5% of the time individual words were skipped. Some paragraphing was preserved. But the remainder of the formatting, including bold and italics, was gone.

To begin with, we did have easy access to a copier with a feeder that scanned and mailed the 22 pages to us in PDF format. So that part of the process was not onerous. Our investment in time and effort to get the pages scanned out of the book was not immense. Still the question is did using this facility result in any net gain in time or effort for us?

We had to go over all of the text and edit it with some care, comparing it against the original. Could we have saved time by just sitting down and brute force typing in the text? For this sample, we think the answer is yes.

A larger question comes to mind, based on this small experiment. If for its own Google Books scanning, Google uses the same technology that it makes available to us end-users, then we are missing lots of text when we do a search on the scanned Google books. That's not good.

1 comment:

traintalk said...

See more about this here:
http://www.google.com/support/forum/p/Google%20Docs/thread?tid=41a2b66e05570680&hl=en

When I checked this, I found that searching the OCR'd error also returns zero hits however I found better search results in general searching target pdf's within Google Docs than using the universal Google search engine so those who have download rights might experiment uploading a favorite Google Book or two to Google Docs and seeing if there may be an improvement.