Batch OCR PDF – Make PDF Searchable

When you need to batch OCR multiple PDF files look no further than Document Conversion Service’s latest new feature – making PDF files searchable.

batch OCR PDF files

We’re excited to announce that batch OCR PDF is now available in Document Conversion Service. And the best part? You can say goodbye to the days of relying on Adobe Reader for your PDF conversions. Our new integrated PDF converter does more in less time without needing additional software.

We added new features to this release. Visit PDF, Text and Cadd Converters, OCR Support, and a Dashboard to see what other new features we added.

Part 1 – What is OCR?

OCR is an acronym for Optical Character Recognition. This process interprets the text within images or scanned PDF documents and converts it into editable and searchable data. An invisible layer containing this text data is placed on top of each page in the PDF, making the file’s content both selectable and searchable.

OCR technology has advanced to the point where it can accurately process documents in multiple languages and various fonts, styles, and layouts. Document Conversion Service includes OCR support for English, French, Italian, German, Spanish, Hebrew, Hindu, and Arabic. Additional languages are available to download as needed.

Part 2 – Use the Adobe PDF – Builtin Convert to Batch OCR

Our new PDF converter, Adobe PDF- Builtin, is a pivotal piece of the batch OCR to PDF process. It supersedes our original PDF converters that used Adobe Reader and Ghostscript to convert PDF files.  

If you are new to Document Conversion Service, the Adobe PDF – Builtin converter is already enabled. Existing users can enable the converter in the DCS Dashboard. Set the Adobe PDF – Builtin converter to Auto or On to use it instead of Adobe Reader or Ghostscript. This converter must be on to batch OCR PDF files.

Batch OCR PDF files using the new built-in PDF converter that replaces Adobe Reader.

Part 3 – How to Batch OCR PDF Using a Drop Folder

To help you quickly start using this new feature, the Watch Folder Service included with the Document Conversion Service comes with a drop folder already set up to create searchable PDF files.

Go to the DCS Dashboard and start the DCS Service and the Watch Folder Service by clicking the green play icon. The services are started when you see the text Running and the square green stop icon is enabled.

Use the Drop folder in Watch Folder Service to batch OCR PDF files.

When both services have started, copy your scanned PDF files into the Input folder under the drop folder OCRtoAdobePDF to batch OCR them to searchable PDF files.

Drag and drop or copy files into the Input folder to batch OCR PDF files into searchable PDF.

The Watch Folder Service picks up your scanned PDF files and passes them to the Document Conversion Service to OCR. The new searchable PDF files are created in the Output folder under OCRtoAdobePDF.

The Watch Folder Service picks up your scanned PDF files and passes them to the Document Conversion Service to OCR. After the OCR process, it copies the new files to the Output folder under OCRtoAdobePDF and the original file to the Completed folder.

Searchable PDF files are copied to the Output folder, original files to the Completed folder.

Part 4 – Changing the OCR Language and Other Settings

We have included language support for eight prevalent languages: English, French, Italian, German, Spanish, Hebrew, Hindu, and Arabic. The OCRtoAdobePDF drop folder recognizes only English to start.

You can easily add others or change the language you want to use to run OCR on the page. Each language has a unique character chode, such as eng for English and fra for French.

To change languages, use the character code for your desired language. To run OCR for multiple languages at a time, list each code separated by a plus (+) sign. For example, to OCR both English and French text, you would say eng+fra.

You can add as many languages to this list as you want. However, the more languages you add, the longer the OCR process will take. A better practice is creating multiple drop folders where each targets a different language or a small subset of languages.

The other OCR setting is the ability to only OCR the first page instead of the whole document. Processing only the first page saves time and computing resources when all the necessary information is on the first page of your documents. Documents with cover pages containing all the information for indexing or technical documents with abstracts are examples of where you may only want to OCR the first page.

OCR settings for languages and first page or whole document can be changed

Part 5 – Downloading Addition OCR Languages

Additional languages not included with Document Conversion Service can be downloaded from Traineddata Files for Tesseract. This link also includes a table listing the language code for each traineddata file for each language. There are links to download complete sets of language files at the top of that page.

To add the new data files to Document Conversion Service, copy the *.traineddata files into the following folder.

%PROGRAMDATA%\PEERNET\Document Conversion Service\tessdata