Starting with Document Conversion Service 3.0.031, the Watch Folder Service now includes a sample folder, OCR to AdobePDF Watch Folder, that will use Optical Character Recognition (OCR) on scanned PDF files and images to create searchable PDF files.
Optical Character Recognition searches for and recognizes text (characters) on scanned pages or images and extracts it as digital text. Outside factors such as image quality, the font used, and any image background on the pages will all affect the validity of the OCR results.
The PDF files created using OCR consist of each page embedded as an image, with the page text as an invisible layer over the top of the image. The invisible text layer can be searched and text content can be copied from the PDF if permissions permit.
Optical Character Recognition can only be used when creating PDF files. OCR will increase the processing time for file conversion and is supported by the following converters:
•Built-in PDF Converter
•Built-in Image Converter
Caution |
|
This feature is not supported on Microsoft® Windows Server 2008 R2 and Microsoft® Windows 7. |
When recognizing text, the OCR engine has to know which languages to look for on the page. OCR works by analyzing the patterns, shapes, and curves of the text characters on the page and matching them to predefined information for different characters in each language. It assigns a confidence score for each language, with the highest score determining the language chosen.
Document Conversion Service comes with files to support recognizing Arabic, English, French, German, Hebrew, Hindi, Italian, and Spanish.
To download individual language files, go to Tesseract Languages Code and Traineddata Files. This link also includes a table listing the language code for each traineddata file for each language. You can download complete sets of language files by going to Traineddata Files for Tesseract.
To add them to Document Conversion Service, copy the desired *.traineddata files into the following folder:
%PROGRAMDATA%\PEERNET\Document Conversion Service\tessdata |
OCR is disabled to start until the options are added to your watch folder definition. The sample watch folder already has these options set.
ConverterPlugIn.PNBuiltinsOCRPDF.Enabled
Set this to 1 to enable OCR, 0 to turn it off. Default value is 0.
ConverterPlugIn.PNBuiltinsOCRPDF.FirstPageOnly
Set this to 1 to only OCR the first page of any document. Set it to 0 or do not set it to OCR each page in the document. Default is 0.
The OCR engine needs to know which languages you want to try to recognize on the page. The more languages listed the longer the OCR process will take as it tries to match each character against each language listed.
ConverterPlugIn.PNBuiltinsOCRPDF.Languages
To run OCR on your text and look for multiple languages, list the language code for each language you want, separated by a plus sign. For example, the sample watch folder looks only for English, "eng". To look for English, French,and Spanish, you would use the string "eng+fra+spa". The default when this is not supplied is English only, "eng".
The language codes for the provided languages are as follows.
Language |
Language Code |
Arabic |
ara |
English |
eng |
French |
fra |
German |
deu |
Hebrew |
heb |
Hindi |
hin |
Italian |
ita |
Spanish |
spa |