Please enable JavaScript to view this site.

Document Conversion Service 3.0

Navigation: » No topics above this level «

Optical Character Recognition (OCR) with Document Conversion Service

Scroll Prev Top Next More

Optical Character Recognition, or OCR for short, searches for and recognizes text (characters) on scanned pages or images and extracts it as digital text.

With this digital text, we can create searchable PDF files from images or PDF documents containing scanned pages. A searchable PDF is a PDF file in which you can select and copy the text on the pages and use the search function to look for specific words and phrases in the file.

When recognizing text, the OCR engine has to know which languages to look for on the page. OCR works by analyzing the patterns, shapes, and curves of the text characters on the page and matching them to predefined information for different characters in each language. It assigns a confidence score for each language, with the highest score determining the language chosen.

Outside factors such as image quality, the font used, and any image background on the pages will all affect the validity of the OCR results.

Using OCR to Create Searchable PDF Files

Adding New Languages for OCR

Using OCR to Create Searchable PDF Files

When you convert images or PDF to editable PDF, the digital text found by the OCR engine gets added as an invisible text layer to each page in the new PDF file, making the file's content searchable. The new PDF contains the original image and an invisible layer of text. It is this layer of text that makes the PDF searchable.

Optical Character Recognition can only be used when creating PDF files. OCR can increase the processing time for file conversion and is supported by the following converters:

Built-in PDF Converter

Built-in Image Converter

 

Caution

This feature is not supported on Microsoft® Windows Server 2008 R2 and Microsoft® Windows 7.

Searchable PDF in the Watch Folder Service

The Watch Folder Service includes a sample conversion folder, OCR to AdobePDF Watch Folder, that is already configured for OCR and creates searchable PDF for English, French and Spanish text. See OCR Images and Scanned PDF Files to Searchable PDF.

Creating Searchable PDF Using Profiles

When converting using the desktop conversion tools, Convert File or Drop Files Converter, the command line tools, and the PEERNET.ConvertUtility.dll, you use a profile to tell Document Conversion Service what type of file to create. A profile is a group of settings stored as a collection of name-value pairs in an XML document.

Document Conversion Service includes a collection of profiles for converting to various output formats and performing other actions when converting documents, such as e-Discovery and OCR.

Desktop Conversion

The desktop conversion tools use the selected profile to determine what type of file to create. The two profiles, Adobe PDF OCR to Searchable, and Adobe PDF OCR to Searchable Serialized, are already configured for OCR and to create searchable PDF for English text.

To OCR images and scanned PDF files using the desktop tools, choose the above profiles when converting.

Command Line Tools

When you are converting using the command line tools, they too use profiles to determine what type of file to create. On the command line, the desired profiles is specified by passing in the name of the profile XML file, with or without the XML extension. To use the OCR profiles on the command line, you would pass /P="Adobe PDF OCR to Searchable" or /P="Adobe PDF OCR to Searchable Serialized".

The PEERNET.ConvertUtility.dll

Like the desktop and command line tools,the PEERNET.ConvertUtility also uses profiles to tell Document Conversion Service what type of file to create. Pass in Adobe PDF OCR to Searchable, or Adobe PDF OCR to Searchable Serialized to create a searchable PDF file.

OCR Profile Settings

OCR is disabled until these options are added to your profile. The sample profiles already have these options set.

ConverterPlugIn.PNBuiltinsOCRPDF.Enabled - Set this to 1 to enable OCR, 0 to turn it off. Default value is 0.

ConverterPlugIn.PNBuiltinsOCRPDF.FirstPageOnly - Set this to 1 to only OCR the first page of any document. Set it to 0 or do not set it to OCR each page in the document. Default is 0.

ConverterPlugIn.PNBuiltinsOCRPDF.Languages - List which languages you want to try to recognize on the page. To look for multiple languages, list the language code for each language separated by a plus sign. For example, the sample profile only looks for English, "eng". To look for English, French,and Spanish, you would use the string "eng+fra+spa". The default when this is not supplied is English only, "eng". The more languages listed the longer the OCR process will take.

Adding New Languages for OCR

The OCR engine needs to know which languages you want to try to recognize on the page. The languages provided with Document Conversion Service, and their language codes are as follows.

Language

Language Code

Arabic

ara

English

eng

French

fra

German

deu

Hebrew

heb

Hindi

hin

Italian

ita

Spanish

spa

To download individual language files, go to Tesseract Languages Code and Traineddata Files. This link also includes a table listing the language code for each traineddata file for each language. To download complete sets of language files go to Traineddata Files for Tesseract.

To add them to Document Conversion Service, copy the desired *.traineddata files into the following folder:

%PROGRAMDATA%\PEERNET\Document Conversion Service\tessdata