Saturday, November 01, 2008

Convert Scanned PDF Documents to Text with Google OCR

Google has started to index scanned pdf files on the web. They made use of a technique called Optical Character Recognition (OCR) to do just that.

Digital Inspiration has tip on how to get Google to convert your scanned PDF files into recognizable without having to use a 3rd-party OCR software.
  1. Create a folder in your website (say abc.com/pdf) and upload all the PDF images to that folder. Now create a public web page that links to all the PDF files. Wait for the Google bots to spider your stuff.
  2. Once done, type the query "site:abc.com/pdf filetype:pdf" to see the PDF documents as HTML.
Convert Scanned PDF Documents to Text with Google OCR [via]

No comments:

Post a Comment

Do provide your constructive comment. I appreciate that.