OCR is a technology used to copy and convert printed material into editable word processing file formats like .doc or .txt. This involves reading text from a document and translating the images into an electronic file which is edited with word processor. In this procedure an optical scanner is used for reading text, an advanced software for analyzing images in the page as a bitmap. Advanced OCR technique can read text in large variety of fonts, but they cannot comprehend handwritten text. After scanning the text, the software can be taught the implication of those characters. In this way, the program is able to ascertain the shape of each of the letters even from unusual fonts. Many OCR software also refer to a lexicon while converting. The advantage of OCR is that it allows saving files in a large variety of text and image formats, including PDFs to create a searchable database of scanned documents.
PDF is a file format that captures a printed document as an electronic image and looks like the original document. A PDF file for search contains text data that can be searched by using the search function. A searchable PDF contains the original scanned image and a separate text layer from an OCR process and comes in handy when one has to deal in large volumes of data. By converting them using OCR into editable and searchable files, one can find any information with an easy search.
A searchable PDF file can be of two types: Exact and Compact. With the Exact method the file size is large but as the name suggests it is very accurate. The page appears exactly as it did when it was scanned, only now it is searchable. With the Compact method the file size is smaller than the Exact method and the general look and feel of the original image is retained while it becomes searchable but the quality is not as good as the Exact method. A searchable PDF file enables users to look for image data from full text and can be stored in the document management system.
Tags: OCR, OCR PDF, OCR search, PDF