User Guide    Editing PDFs    OCR (Optical Character Recognition)

OCR (Optical Character Recognition)

Using OCR in PDFpen

OCR (Optical Character Recognition) is the process of converting a bitmap image of text (like a scanned document) into text that can be selected, copied, and searched by PDFpen and other text editing software. Once the text has been recognized by OCR, it is placed on an invisible layer above the image of the text that you can see. When you copy text, the text is copied from this invisible OCR layer. OCR technology will not produce a perfect rendering of the bitmapped text. You will need to proofread and edit the text that results from OCR.

Automatic OCR

  1. Open a scanned PDF in PDFpen.
  2. An alert box opens with the message:
    "This document appears to be scanned. Would you like to perform optical character recognition (OCR) on it? OCR will allow you to select the text."
  3. You have three options:
  • Cancel: No OCR will be performed.
  • OCR Page: OCR will be performed on the current page.
  • OCR Document: If your document has multiple pages, OCR will be performed on all of the pages.

Pick which languages are recognized by OCR in Preferences > OCR. (User Preferences).

While PDFpen is performing the OCR, a progress bar will appear. The operation can take a few seconds or much longer, depending on the size and contents of the scanned document.

Manual OCR

To perform OCR manually, choose Edit > OCR Page. PDFpen commences to perform the OCR operation and the progress bar appears.

Forcing OCR

PDFpen looks at the document and if it sees one image the size of a page, it assumes that the document is a scan and automatically offers to perform OCR. In some cases, PDFpen may not recognize a scanned document. Under the Edit menu, OCR Page will be grayed out and unavailable to select.

  1. Hold down the Command and Option keys together.
  2. Choose Edit > OCR Page from the menu.

Batch OCR

(Advanced feature of PDFpenPro). See Batch OCR.

Tips to Improve the OCR Results

  • The quality of the original document affects the quality of the OCR performance. Crisp, clean originals with clear text will produce much better results than crumpled, faded photocopies.
  • Place your original document on the scanner as straight as possible. If you have a scanned page that is not straight, you can "deskew", or straighten, the image in PDFpen by choosing Edit > Deskew and Adjust Image
  • Increase the contrast of your scanned document so that the background is as white as possible. You can adjust the contrast of the image by choosing Edit > Deskew and Adjust Image

Dictionaries and OCR

Medical and legal dictionaries are included in PDFpen’s OCR engine to improve the quality of OCR output for scanned documents by recognizing words specific to the medical and legal professions. This feature is built-in, so there is no need to turn on or adjust any setting. If you choose to edit OCR text, misspelled words for the selected text may be displayed with a red squiggly underline.