Converting scanned documents to editable text

Conversion, Images

One of the things I get people asking to me to do most often is to convert a scanned document into an editable text document in a format such as Microsoft Word. In this post I will go through a simple algorithm to get this task done that you can do for yourself without spending any money.

When a document is scanned in a scanner, the default output is an image of the scanned page. In some of the more recent scanners, you can specify your settings so that the scanner reads the document straight into a Word or Excel file. But sometimes you don’t have that option. What you need is a way to convert that scanned image into a text file, and the technology that does this is called “Optical Character Recognition” or OCR for short.

The OCR program will take your scanned image and attempt to read it to get you the best estimate of the original in text format. You then save the output and check it against the original for integrity.

Most OCR programs work pretty well., and the accuracy of the output, in my experience, is dependent on two main factors:

  1. The clarity of the scanned image – as long as the original scanned file (image) is clear and of a high resolution the output will be good. The higher the resolution, the better the match.
  2. Font size – The other factor that comes into play is the size of the font. The larger the font on your image the higher the accuracy will be.

There are several different OCR programs. I will list a few for you:

  1. Simple OCR – One of the ones I have used before and would recommend for simple jobs is Simple OCR. This is a free program with a good OCR engine.
  2. Abbyy Finereader 8.0 Professional – This is my personal favorite, and the one that I use most often. ABBYY FineReader has a powerful OCR engine, and a lot of features and options to work with different files. I find it to have very good accuracy even with documents that aren’t very clear. It’s also good with pdf files.
  3. Readiris Pro 11 – This is yet another popular OCR program which I recommend. Has great features also.
  4. Other popular options that I have not used include ScanSoft OmniPage 15 OCR and Textbridge Pro 11.

A glance at these programs will show you that they are somewhat costly to purchase, so you would need to decide if you do enough conversion to justify the cost.

And remember, if you just want someone to do the task for you, contact us, it’s what we do.

12 Comments

12 Responses

  1. arnie  •  September 18, 2007 @12:50 pm

    Thanks for the useful info. This will come in handy!! Can you tell me how to convert an ordinary pdf file to word? The document I have doesn’t allow me to copy and paste. I googled pdf convertors but nothing useful came up.

    thanks

  2. admin  •  September 19, 2007 @5:01 pm

    arnie,
    Thanks for your comment. What you are asking is actually not that difficult to do. If you have the complete Adobe Acrobat program installed, and the pdf file you have was created from Adobe, then it would simply a matter of opening the file and doing a “save as” operation, and then getting it into the format you want from the options offered. Bear in mind though that you can’t use the free Adobe Reader for this. You need to actually have the full Adobe program installed.

    If you don’t have the Adobe Acrobat program and are not interested in buying it, then your second option is to get a converter program such as pdf2word. There are many of them out there, just search pdf to word on Google and you’ll see many results. A lot of them are shareware, so you have a limited trial edition that you can use and then buy if you decide to keep it. I’m not sure what kind of limitations you face with these programs, but make sure you get it from a trusted source, and have your antivirus and trojan detectors running before you install freeware or shareware.

    Your third option, and then one that I use most of the time when I am on a computer other than the one I have Adobe Acrobat installed in, is to use OCR which I discussed above. I have used SimpleOCR for a long time and I trust the source, so I don’t hesitate to install it on any computer that I am running.

    There is another option, but I just can’t think of it right at this moment, but if it comes to me I will add it here.

    Good luck to you!

  3. Chris  •  November 4, 2007 @7:08 am

    Hello,
    I wonder if you might be able to help with this one. When I OCR a doc into Word 2003, the File Conversion window asks to Select Encoding to Make the Document Readable. I cannot find any selection that makes the doc readable. Any ideas? Cheers, Chris.

  4. admin  •  November 4, 2007 @9:26 am

    Chris,
    Thanks for your question. What OCR program are you using to do the conversion? Can you walk me through the steps of what you’re doing? The fact that you’re getting this message may suggest that the file has not been converted to text by the OCR program.

    Just a hint, when you’re doing OCR, and you select portions of your document, there is usually the option to designate tables as tables, images as images, and text as text, etc., so that the OCR program doesn’t try to convert a picture to text, for example.

    Tell me what program you’re using and we’ll try and solve it from there.

  5. grace  •  December 20, 2007 @5:18 pm

    I have the Adobe Pro version, which has the OCR operation. i had to convert a scanned pdf doc to Word, so of course the entire doc becomes an image. So I went thru the entire OCR operation. Went to Save As so i can save to Word. But i get error-’Save as Failed to process document. No file was created’. What am i doing wrong?

  6. admin  •  December 20, 2007 @11:44 pm

    Grace,
    What version of Adobe Pro are you using?

    My personal preference is to not use Adobe for trying to OCR documents, especially scanned files. However, if you want to use Adobe, we can try to figure out why it’s not working for you. Depending on the version of Adobe Pro that you have you may need the Paper Capture plugin.

    If this is a one-time job and you don’t care how it gets done, what you probably want to is to take your pdf file and try to use one of the tools I mentioned above to open and recognize it, or contact us and we can do it for you.

    Let me know what version you’re using either way, and I’ll walk you through the OCR process for now and for future reference.

  7. Pradeep Reddy  •  April 4, 2008 @2:36 am

    Hi, I am new to this blog, can you please suggest me some algorithm or pseudo code kind of thing, so that i can wirite my own OCR engine.

  8. admin  •  April 4, 2008 @6:59 pm

    Pradeep,
    Thanks for your comment. I am not really sure how to program an OCR engine and have never tried it myself. An article that might help or at least start you in the right direction:
    http://www.codeproject.com/KB/dotnet/simple_ocr.aspx

  9. vidhi  •  December 14, 2008 @7:19 am

    hiii

    Is this OCR prodgram free to download to convert scanned document to editable text document ?? what kinda OCR program should I actually download from google ?

    thanks !

  10. admin  •  December 14, 2008 @8:30 am

    Vidhi,
    The only free one to download on the list is SimpleOCR, available at http://www.simpleocr.com/

    The other free ones you might want to check out, which I have never tried, include GOCR – http://jocr.sourceforge.net/, FreeOCR – http://softi.co.uk/freeocr.htm

    Good luck!

  11. Holly Leakey  •  May 23, 2010 @8:09 am

    Well, Thankyou lots,I have found this extremely good! :-P

  12. balu  •  April 3, 2011 @1:35 am

    hi i want to convert gif image to word editable so can you tell me how to do and what is the software is there for do it and if you know this please send me that software name to my mail id balumgn@gmail.com

Leave a Reply

Allowed tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>