Document Conversion Service 3.0

Zoom Window Out
Larger Text | Smaller Text
Hide Page Header
Show Expanding Text
Print Topic
Share This Topic
Save Permalink URL

Navigation: » No topics above this level «

Optical Character Recognition (OCR) with Document Conversion Service

Optical Character Recognition, or OCR for short, searches for and recognizes text (characters) on scanned pages or images and extracts it as digital text.

With this digital text, we can create searchable PDF files from images or PDF documents containing scanned pages. A searchable PDF is a PDF file in which you can select and copy the text on the pages and use the search function to look for specific words and phrases in the file.

When recognizing text, the OCR engine has to know which languages to look for on the page. OCR works by analyzing the patterns, shapes, and curves of the text characters on the page and matching them to predefined information for different characters in each language. It assigns a confidence score for each language, with the highest score determining the language chosen.

Outside factors such as image quality, the font used, and any image background on the pages will all affect the validity of the OCR results.

•Using OCR to Create Searchable PDF Files

•Adding New Languages for OCR

Using OCR to Create Searchable PDF Files

When you convert images or PDF to editable PDF, the digital text found by the OCR engine gets added as an invisible text layer to each page in the new PDF file, making the file's content searchable. The new PDF contains the original image and an invisible layer of text. It is this layer of text that makes the PDF searchable.

Optical Character Recognition can only be used when creating PDF files. OCR can increase the processing time for file conversion and is supported by the following converters:

•Built-in PDF Converter

•Built-in Image Converter

	Caution
This feature is not supported on Microsoft® Windows Server 2008 R2 and Microsoft® Windows 7.

Searchable PDF in the Watch Folder Service

The Watch Folder Service includes a sample conversion folder, OCR to AdobePDF Watch Folder, that is already configured for OCR and creates searchable PDF for English, French and Spanish text. See OCR Images and Scanned PDF Files to Searchable PDF.

Creating Searchable PDF Using Profiles

When converting using the desktop conversion tools, Convert File or Drop Files Converter, the command line tools, and the PEERNET.ConvertUtility.dll, you use a profile to tell Document Conversion Service what type of file to create. A profile is a group of settings stored as a collection of name-value pairs in an XML document.

Document Conversion Service includes a collection of profiles for converting to various output formats and performing other actions when converting documents, such as e-Discovery and OCR.

Desktop Conversion

The desktop conversion tools use the selected profile to determine what type of file to create. The two profiles, Adobe PDF OCR to Searchable, and Adobe PDF OCR to Searchable Serialized, are already configured for OCR and to create searchable PDF for English text.

To OCR images and scanned PDF files using the desktop tools, choose the above profiles when converting.

Command Line Tools

When you are converting using the command line tools, they too use profiles to determine what type of file to create. On the command line, the desired profiles is specified by passing in the name of the profile XML file, with or without the XML extension. To use the OCR profiles on the command line, you would pass /P="Adobe PDF OCR to Searchable" or /P="Adobe PDF OCR to Searchable Serialized".

The PEERNET.ConvertUtility.dll

Like the desktop and command line tools,the PEERNET.ConvertUtility also uses profiles to tell Document Conversion Service what type of file to create. Pass in Adobe PDF OCR to Searchable, or Adobe PDF OCR to Searchable Serialized to create a searchable PDF file.

OCR Profile Settings

OCR is disabled until these options are added to your profile. The sample profiles already have these options set.

ConverterPlugIn.PNBuiltinsOCRPDF.Enabled - Set this to 1 to enable OCR, 0 to turn it off. Default value is 0.

ConverterPlugIn.PNBuiltinsOCRPDF.FirstPageOnly - Set this to 1 to only OCR the first page of any document. Set it to 0 or do not set it to OCR each page in the document. Default is 0.

ConverterPlugIn.PNBuiltinsOCRPDF.Languages - List which languages you want to try to recognize on the page. To look for multiple languages, list the language code for each language separated by a plus sign. For example, the sample profile only looks for English, "eng". To look for English, French,and Spanish, you would use the string "eng+fra+spa". The default when this is not supplied is English only, "eng". The more languages listed the longer the OCR process will take.

Adding New Languages for OCR

The OCR engine needs to know which languages you want to try to recognize on the page. The languages provided with Document Conversion Service, and their language codes are as follows.

Language	Language Code
Arabic	ara
English	eng
French	fra
German	deu
Hebrew	heb
Hindi	hin
Italian	ita
Spanish	spa

To download individual language files, go to Tesseract Languages Code and Traineddata Files. This link also includes a table listing the language code for each traineddata file for each language. To download complete sets of language files go to Traineddata Files for Tesseract.

To add them to Document Conversion Service, copy the desired *.traineddata files into the following folder: