Please enable JavaScript to view this site.

Document Conversion Service 3.0

Navigation: Converting Files with Document Conversion Service

OCR Images and Scanned PDF Files to Searchable PDF

Scroll Prev Top Next More

Starting with Document Conversion Service 3.0.031, the Watch Folder Service now includes a sample folder, OCR to AdobePDF Watch Folder, that will use Optical Character Recognition (OCR) on scanned PDF files and images to create searchable PDF files.

Optical Character Recognition searches for and recognizes text (characters) on scanned pages or images and extracts it as digital text. Outside factors such as image quality, the font used, and any image background on the pages will all affect the validity of the OCR results.

The PDF files created using OCR consist of each page embedded as an image, with the page text as an invisible layer over the top of the image. The invisible text layer can be searched and text content can be copied from the PDF if permissions permit.

Optical Character Recognition can only be used when creating PDF files. OCR will increase the processing time for file conversion and is supported by the following converters:

Built-in PDF Converter

Built-in Image Converter

 

Caution

This feature is not supported on Microsoft® Windows Server 2008 R2 and Microsoft® Windows 7.

OCR Languages And Adding Additional Languages

When recognizing text, the OCR engine has to know which languages to look for on the page. OCR works by analyzing the patterns, shapes, and curves of the text characters on the page and matching them to predefined information for different characters in each language. It assigns a confidence score for each language, with the highest score determining the language chosen.

Document Conversion Service comes with files to support recognizing Arabic, English, French, German, Hebrew, Hindi, Italian, and Spanish.

To download individual language files, go to Tesseract Languages Code and Traineddata Files. This link also includes a table listing the language code for each traineddata file for each language. You can download complete sets of language files by going to Traineddata Files for Tesseract.

To add them to Document Conversion Service, copy the desired *.traineddata files into the following folder:

%PROGRAMDATA%\PEERNET\Document Conversion Service\tessdata

Enabling OCR and Page Selection

OCR is disabled to start until the options are added to your watch folder definition. The sample watch folder already has these options set.

ConverterPlugIn.PNBuiltinsOCRPDF.Enabled

Set this to 1 to enable OCR, 0 to turn it off. Default value is 0.

ConverterPlugIn.PNBuiltinsOCRPDF.FirstPageOnly

Set this to 1 to only OCR the first page of any document. Set it to 0 or do not set it to OCR each page in the document. Default is 0.

Setting Languages

The OCR engine needs to know which languages you want to try to recognize on the page. The more languages listed the longer the OCR process will take as it tries to match each character against each language listed.

ConverterPlugIn.PNBuiltinsOCRPDF.Languages

To run OCR on your text and look for multiple languages, list the language code for each language you want, separated by a plus sign. For example, the sample watch folder looks only for English, "eng". To look for English, French,and Spanish, you would use the string "eng+fra+spa". The default when this is not supplied is English only, "eng".

The language codes for the provided languages are as follows.

Language

Language Code

Arabic

ara

English

eng

French

fra

German

deu

Hebrew

heb

Hindi

hin

Italian

ita

Spanish

spa