How Optical Character Recognition (OCR) Works
OCR is a complex technology that converts images with text into editable formats. OCR allows you to process scanned books, screenshots and photos with text and get editable documents like TXT, DOC or PDF files. This technology is widely used in many areas and the most advanced OCR systems can handle almost all types of images, even such complex as scanned magazine pages with images and columns or photos from a mobile phone.
How does modern OCR work? The process of converting an image to editable document is separated to several steps; every step is a set of related algorithms which do a piece of OCR job. General steps in OCR process are:
Loading image as bitmap from given source. Source can be a file or a pointer to a memory block, also good OCR system must understand a lot of image formats: BMP, TIFF (both one-page and multi-pages images), JPEG, PNG and so on. PDF files must be supported as well, many documents are stored as images in PDF format and the only way to extract text from such files is to perform OCR.
Detecting the most important image features like resolution and inversion. Many OCR algorithms expect some predefined range of font sizes and foreground/background colors so the image must be rescaled and inverted before processing when necessary.
Image can be skewed or it can have a lot of noise, so deskew and denoising algorithms are applied to improve the image quality.
Many OCR algorithms require bi-tonal image, therefore color or gray image must be converted to black-white image. This process is called "binarization" and it is very important step because incorrect binarization will cause a lot of problems.
Lines detection and removing. This step is required to improve page layout analysis, to achieve better recognition quality for underlined text, to detect tables, etc.
Page layout analysis; this steps is also called "zoning". At this stage OCR system must detect positions and types of all important areas on the image.
Detection of text lines and words. Sometimes is it not an easy task because of different font sizes and small spaces between words. Combined-broken characters analysis. It is very common situation when some characters are broken to several parts, or when a few characters touch each one; it is necessary to detect such cases and find correct position of every character.
Recognition of characters. This is the main algorithm of OCR; an image of every character must be converted to appropriate character code. Sometimes this algorithm produces several character codes for uncertain images. For instance, recognition of the image of "I" character can produce "I", "|" "1", "l" codes and the final character code will be selected later. Dictionary support. This step can improve recognition quality, some characters like "1" and "I", "C" and "G" can look very similar and the dictionary can help to make the decision.
Saving results to selected output format, for instance, searchable PDF, DOC, RTF, TXT. It is important to save original page layout: columns, fonts, colors, pictures, background and so on.
It is not a complete list, a lot of other minor algorithms also must be implemented to achieve good recognition on various image types, but they are not principal in most cases and can vary in different OCR systems.
Every OCR step is very important; the whole OCR process will fail if only one its step cannot handle given image correctly. Every algorithm must work correctly on the highest range of images, that is why there are only few good universal OCR systems are available. On the other hand, if some features of given images are know the task becomes much easier, it is possible to get better recognition quality if only one kind of images must be processed. To achieve the best results if some features of images are known, good OCR system must have ability to adjust the most important parameters of every algorithm; sometimes this is the only way to improve recognition quality. Unfortunately, nowadays there are not OCR systems that can be comparable with human eyes and it seems they will not be created in the near future.