Chapter 1 Introduction

0 downloads 0 Views 1MB Size Report
One of the critical barriers that hinder the path to reach the successful ... involved in interpretation and recognition of one of the SIL script, Telugu. The ... for its transcription through OCR systems are also depicted in this chapter to .... to machines through various machine learning algorithms accomplishes the goal of.
Chapter 1 Introduction

1.1 Preamble Interpretation of textual contents in document images through computer systems is the major inclination of optical technologies like Optical Character Readers (OCR). The transformation of text document images to its equivalent editable format through computing machines is the motivation behind the development and evolution of OCR systems. Initial attempts on OCR development have been successful on automatic reading and data entry of many Roman and Latin language scripts. Further, the progressions towards various Indian language scripts are accorded to extend the functionality of OCR to read many scripts. Even though there exists many successful investigations on OCR towards Indian scripts, there are still many challenging research issues to be addressed in this regard. One of the critical barriers that hinder the path to reach the successful recognition rate is the complexity involved in script. Especially, this is true with many of the South Indian Language (SIL) scripts due to their wide character set and structural diacritics. Therefore, it is essential to address the complexities of SIL scripts in order to reach higher recognition rates. This research is focused on exploring solutions to the various research challenges involved in interpretation and recognition of one of the SIL script, Telugu. The diverse business needs and smart technological advancements requires the simulation of human activities to be accomplished through computing machines efficiently. OCR is one of such software that simulates the vision function performed by humans to read or interpret via computing machines. Basically, the activity of reading is associated with the knowledge of language and the ability to interpret its script. The knowledge of the script by processing images can be instilled as functionality to OCRs’ through the vast knowledge domain of Digital Image Processing (DIP) and Machine Learning (ML) techniques.

Introduction In conjunction with the constructive knowledge domain of computer science, it is also required to know about the script of a particular language and its characteristics. In this research, the major focus is on recognition of Telugu characters. An introduction to Telugu OCR and its characteristics in the perspective of printed and handwritten scripts are detailed for providing a clear idea about the difficulties involved in development of Telugu OCR systems. The different types of documents and the need for its transcription through OCR systems are also depicted in this chapter to apprehend the research objectives taken up in this regard.

1.2 Objectives of the Research The current OCR technologies for SIL scripts are efficient and expensive, majority of which are suitable for recognition of machine printed scripts subjected to recognition of limited font styles/size recognition. At present, higher recognition accuracies by OCR technologies can be assured only with restrictions like absence of touching/broken/complex compound characters and good resolution of document images etc. Comparatively techniques employed for handwritten character recognition are still flourishing with the several additional instructions to process text image data correctly. Emphasizing on OCR and its development in the perspective of Telugu character recognition, the current state of researches on Telugu Character Recognition Systems (TCRS) are succeeded in terms of recognition of only few machine printed fonts. Moreover, a lot of revision is required for integration of those techniques to handle the various challenges involved in handwritten character recognition. In this sense, the major research gap explored while reviewing the literature is lower recognition rates towards handwritten Telugu character recognition and also very limited works on preprinted Telugu documents. The primary aim of this research is to explore various techniques that lead to an enhanced frame work for processing of Telugu handwritten and pre-printed documents. In the proposed research work, it is planned in order to achieve the following objectives. 

Document enhancement is highly influential to reach appreciable recognition rates by OCR. Especially, the pre-printed documents require more specific pre-processing compared to usual handwritten/printed manuscript documents.

2

Introduction Therefore, it is required to investigate the various pre-processing methodologies desired for transforming the pre-printed Telugu documents suitably for subsequent processing by OCR. 

Extraction of characters from the document image is another major challenge in OCR system. Document image segmentation is the process of character extraction involving boundary identification from one character to another character. It is required to devise a novel framework for segmentation stage that suits for both printed and handwritten characters by addressing the issues of touching and overlapping characters.



Feature extraction, classification and recognition are more decisive stages that adjudge the performance of the system. It is proposed to develop an efficient recognition technique to ensure the clear discrimination of one character from another.



Resolution of OCR errors after recognition stage is the significant post processing task. Thus, there is a need to explore a novel post processing scheme for resolving the recognition errors in Telugu text.

The implementation of OCR is associated with the core discipline of computer science called Digital Image Processing (DIP). The significance of DIP and applicability of its sub disciplines to provide a linear processing model for OCR functionality has been discussed in detail in this chapter. In addition, the pre-requisite knowledge required for understanding various research challenges involved in this work is also paraphrased subsequently.

1.3 Digital Image Processing The process of manipulating the digital images with the help of computing device to analyze and retrieve some useful information is known as DIP. Generally, the techniques of processing digital images varies from one type of image to other; as digital images are produced from varied sources of electromagnetic spectrum [Gon 2004]. The techniques for processing digital images are categorized into low level, middle level and high level respectively. The low level techniques are the various operations employed for preparing the images readily available for subsequent processing. The middle level processing is carried out on the pre-processed images and involves extraction of critical image components required for feature analysis and

3

Introduction representation. The higher level processing is concerned with representation of recognized objects for further interpretation. In reality, there exists diverse needs for analysis of digital images through computers, few of them are; 

Analysis of defective regions in medical images such as X-rays, ultra sound images, PET images, MRI images etc.



Visualization of critical regions in the images captured through satellites or space probe etc.



Analysis of molecular structures in images of chemicals.



Detection of defective productions in industrial inspection.



Detection of crimes through forensics.



Biometric image authentication.



Authentication of individuals through signature verification.



Surveillance monitoring for prediction of abnormal scenes.



Detection of varieties of items in voluminous data.



Automated surgery supervision.



Handwritten image analysis and recognition.



Character recognition for automatic reading.

The reasons for processing of digital images are quite extensive and exhaustive, since the necessity arises based on the variety of applications in varying disciplines. Some of the applications of digital image processing include medical image processing, biometric image processing, satellite image processing and document image processing. In this research work, investigations are directed towards the use of different techniques available in DIP and Document Image Analysis (DIA). DIA belongs to the area of computer vision and pattern recognition. The computer vision is said to be continuum to DIP. Computer vision techniques are analogous to the vision tasks that are performed by humans. Usually, the manual procedures of performing certain tasks in real time that are involved with some intelligence are achieved artificially with the help of machines. Automating the tasks such as reading/understanding/visualization/interpretation/introspection performed by humans through optic vision of computers is the objective of computer vision techniques. Human beings achieve the activities like reading, learning, thinking, interpreting by learning through number of examples from day to day. The training of these examples 4

Introduction to machines through various machine learning algorithms accomplishes the goal of pattern recognition. In fact pattern recognition deals with detection, identification, classification and recognition of various patterns available in the form of digital images cost effectively and efficiently. The machine learning procedures form a basis for execution of all these tasks comprising together as an automated activity. The automation is required for the tasks that involve some critical efforts by humans and also the advantages are to be known before hand. One of such tasks is associated with machine based reading of document images.

The insights required for

apprehension of DIA and its association with OCR development through document processing is further presented.

1.4 Document Image Analysis DIA deals with implication of computer algorithms on the documents images to detect, retrieve or recognize desired contents. The term computer algorithms indicate the prototypes that are designed by incorporating various constructs or tools available for DIA or generated heuristically to fit the specific processing requirements of document images. The essential aspects of these prototypes include processing documents to detect/recognize either textual contents or graphical contents inherent in the image. The algorithms for DIA are categorized into graphical data processing and textual data processing. The categorization of DIA is depicted in figure 1.1 for comprehension.

Figure 1.1: Document Image Analysis and its classifications

The graphical data processing comprises the analysis of objects such as tables, logos, emblems, photographs, artistic text and layouts of documents, whereas the textual data processing includes recognition of textual contents in printed or handwritten format. The textual data processing in documents is non-trivial and performed by

5

Introduction OCR. The various procedures for graphical data processing come under page layout analysis. In real time variety of documents exist ranging from purely textual documents to the documents with complex pre-defined graphical layouts, thus leading to design application specific algorithmic procedures for document analysis. Some of the documents are postal documents, bank cheques, pre-printed forms comprising varied job requirements in Government/private organizations/educational institutions, historical documents, text inscribed on palm leaves, bills and receipts etc. The textual documents are further classified based on the type of the script and the type of contents, since OCR systems are developed as language specific and content specific. Depending on the nature of content as handwritten, printed or hybrid documents, the processing methods are devised suitably. Processing of these documents is advantageous to many of the business as well as commercial/Government organizations. Transforming the paper documents to digital images and further converting those images to machine editable document format results in many benefits. Few of them are as follows; 

Document image understanding



Recognition of text images



Fast information retrieval as well as processing



Reduced the storage requirements



Improves economic benefits via reducing the manual intervention



Reduces the time consumption



No need of rebuilding the new hard copy documents periodically to maintain historical data of organizations



Fast transcription of contents with reduced errors

Document processing techniques help in transforming the digital images that are obtained through the imaging devices such as scanner, camera and smart phones. However the techniques of processing each image type varies from one to another and is compatible with any type of valid image formats like .jpg, .png, .bmp and .gif etc. The analysis of text document images is performed by the software commonly termed as OCR systems.

6

Introduction

1.5 Optical Character Recognition OCR is an intelligent text document image reader that employs the techniques of computer vision and pattern recognition for transforming the textual document images to a machine editable document format. The editable document format implies a word processor/note pad/script based document format such as Baraha [Url 7], Sreelipi [Url 8], Nudi [Url 9] etc. Usually, the input of document image is acquired through a scanner or imaging devices like digital camera, smart phone etc. The overview of OCR processing is as shown figure 1.2.

Figure 1.2: Overview of document processing using OCR

The main aim of OCR is to detect, classify and recognize the characters inherent in textual images. The variety of textual images exist in real time, the textual images depends upon the type and nature of script. The type of script represents the language in which the text is composed and the nature of script implies machine printed, type written or handwritten or hybrid. The figure 1.3 represents the classification of textual documents.

Figure 1.3: Classification of Textual Documents

7

Introduction The printed documents indicate the machine printed text, type written text, handwritten documents composed through writing by individual. Finally, hybrid documents are the combination of both printed as well as handwritten text. Along with these, the textual documents are associated with various other attributes as follows: 

Type of script associated,



Nature of content as printed/handwritten/hybrid,



Font style and size of machine printed documents,



Handwriting type such as cursive/isolated/slant etc,



Type of documents from which text has to be recognized,



Context of data in textual documents,



Geographical locations in case of pre-printed documents.

Printed text is available in variety of font styles/sizes and type of scripts, hence the OCR systems employed for recognition of one type of script are not compatible for recognition of other scripts. However, at present the OCR systems for a particular machine printed script can possess ability to interpret and recognize diverse range of font styles and sizes. On the other hand, the OCR systems for recognition of handwritten text are developed with lots of constraints, since handwritten text is generated in unconstrained environments. Especially, in case of purely handwritten documents, the user is not restricted from enjoying the freedom of space usage, style of handwriting, type of handwriting etc. Finally the hybrid documents are composed of both printed as well as handwritten text, these type of combinations exists only in case of pre-printed documents. The layout of pre-printed documents is defined prior to composition of handwritten text and attributes associated with printed text type are fixed. Since the structure or layout of the documents relies on the Government/Private organizations, institutions for which the documents are designed to fulfill their domain requirements. The OCR systems employed for recognition of pre-printed documents are specific with respect to their domain needs and also associated with context of data adapted in a particular organization and also geographical regions. The context of data in the pre-printed documents represent the data to be filled is constrained to only specific domain requirements. For example, a bank cheque requires only details such as account number, date, bank name, branch etc. The 8

Introduction character set used to fill a cheque is confined to digits, alphabets and some set of special characters like /, $, Rs etc, thus optimizing the feature set of OCR systems for processing data in bank cheques. Similarly the pre-printed application forms used for admission requirements in educational institutions are constrained by data related to particular geographical locations and context of data like name, contact address, phone number, date of birth, previous degree, proposed degree to apply, caste, nativity etc. Thus the OCR systems developed for pre-printed document processing is constrained to a restricted and validated character set. The development of a generic OCR for processing textual documents with variety of attributes as mentioned above results in huge computational efforts and reduced accuracy. As the feature set of OCR comprises variety of scripts or nature of scripts, this increases the degree of ambiguity during recognition stage and thereby resulting in erroneous outcomes. In the perspective of optimizing the complexity and improvising the efficiency of OCR systems, the development of these systems will be associated with specific set of attributes the document possesses.

The standard

categorization of OCR systems is as depicted in figure 1.4.

Figure 1.4: Generalization of OCR systems

Irrespective of type of OCR and the attributes of the textual document, the linear processing model of textual images is same. The text document undergoes various stages in OCR system as demonstrated in figure 1.5.

9

Introduction

Figure 1.5: Linear processing model of OCR

1.5.1 Document Image Acquisition It is the process of acquiring a text on a paper media as input and representing the same in digital form. The image acquisition process is accomplished through imaging devices such as scanner, camera and other optical devices. The textual image can be stored in any of the valid image format [Url 10] such as .jpg, .png, .gif, .bmp etc.

1.5.2 Preprocessing In this stage, the input document is subjected to various filtering operations for the removal of noise and restoration of images from blurring or other illumination effects, skew detection and correction [Mur 2006]. The spatial domain or frequency domain filtering techniques [Gon 2004] are employed for improving the quality of the image. The operations like scaling and conversion of image from RGB to binary or grayscale to binary is performed in this stage. The conversion from one image type to another image type is performed by determining the intensity/gray level threshold of the image with help of thresholding algorithms like adaptive thresholding [Sha 2008], Otsu’s thresholding [Gat 2009] or local/global thresholding techniques [Gon 2004]. The final outcome of pre-processing stage is to prepare the image in the format suitable for subsequent processing and the pre-processing techniques employed are normally application specific.

1.5.3 Layout Analysis Layout analysis is specific to document images under consideration and also combines various textual image pre-processing operations. The layout analysis is associated with pre-printed document images for detection/removal of its layout to 10

Introduction identify the textual regions in the images. Few of the operations include table detection, horizontal or vertical line detection, emblem/logo detection, photograph detection, scratch mark detection, authentication and seal detection etc. Analyzing the document layout is one of the important tasks as it provides a way to track the textual regions in the images. The final outcome of layout analysis is to prepare the image ready for extraction/segmentation of textual components in the image.

1.5.4 Text Segmentation The text segmentation is the process of extracting the textual regions in the image. The text document segmentation involves the decomposition of the text region into lines and further into words and then to individual characters. Finally, the segmented characters are directed towards the feature computation process.

1.5.5 Feature Extraction The feature computation represents the process of computing features from each character so that it can be recognized uniquely. Choosing the right feature extraction method is very critical in this stage as it influences the accuracy of classification stage. Variety of features can be computed from a character image such as geometrical features, topological features, statistical features and other global features through feature extraction methods. The selection of feature extraction technique depends on the factors like type of image i.e., binary or gray scale, nature of text i.e., printed or handwritten text and various orientations of characters etc.

1.5.6 Classification The classification stage employs various machine learning algorithms to map a test character image to a corresponding target class label. A classifier tracks for various decisive rules that are employed for discrimination of one class of features from the other. The features extracted in the previous stages are used to train the classifier, thus serving as a knowledge base for the classification process. Generally, the efficiency of the classifier is tested with various standard measures like precision, recall, specificity, accuracy and F-measure [Lab 2012] etc.

1.5.7 Recognition Once the features of test character image are mapped to a target class, the corresponding character Unicode/ASCII equivalent is obtained in this stage. The

11

Introduction recognition is concerned with ordering of symbols classified in case of South Indian scripts where a character is combination of one or more symbols.

1.5.8 Post processing Post processing is an essential and specific processing required for OCR systems as well as speech to text recognition systems. In this stage the output of OCR is validated for error detection and correction. The error detection comprises the detection of spelling or grammatical errors, re-ordering problem and conflicts between the confusing character classes etc. The error detection techniques are based on statistical models and language models [San 2013], keyword based [Niw 1992], online dictionary suggestions [You 2012] and other methods based on Unicode’s are employed extensively in the literature. The processing requirements of all the above stages are influential highly on the type of the script for which the OCR is developed. Especially the OCR systems for various SIL script desires its own specificities to work efficiently. The overview of some of the important SIL scripts is summarized subsequently.

1.6 South Indian Languages South Indian languages are the second largest language family coming under the Dravidian languages [Url 1]. The most prominent and highly spoken South Indian languages are Telugu, Kannada, Tamil and Malayalam. The development of SIL OCR systems is a challenging research initiative due to the varying structural diacritics in their character sets. The development of OCR systems for South Indian languages is carried out in terms of printed scripts and handwritten scripts separately [Rah 2009]. The development of printed OCR systems is accomplished in terms of recognition of variety of font styles and sizes that are available for composition of scripts. The printed OCR incorporates the abilities for recognition of all varieties or a set of font styles as well as sizes. Though the OCR for recognition of printed scripts are emerging, as per the raising technological standards it has become an obligation to extend the intelligence of OCR to recognize handwritten characters. The number of styles used to write a character is unconstrained; as each individual is associated with writing in own style. The other factors that complicates the process of handwriting recognition is non-uniformity in maintenance of spatial characteristics of characters and touches/merges/cuts/overlaps between/within the characters, slants in writing lines/words/characters and freedom 12

Introduction taken by the users in usage of writing area. All the above factors increase the degree of complexity in development of SIL OCR systems for Handwritten Character Recognition (HCR). In addition, with the above said challenges involved in development of printed and handwritten SIL OCR systems there are also few common set of barriers associated with it. Though all the SIL scripts possess vowels, consonants, consonant conjuncts and vowel diacritics in its character set, the OCR specifications for recognition of each script varies and thus the OCR systems are script dependent. The other reasons for having script dependent OCR systems are due to the wide and distinct character classes in each script leading to expensive computations and erroneous classification rates. As the SIL are employed only in varied and distinct geographical locations, it is feasible to have script dependent OCR systems to attain optimal efficiency. Along with the challenges of operating characteristics of SIL OCR systems, the structural characteristics of the characters in SIL scripts also aggravates the complexity in recognition of characters. The presence of consonant conjuncts and vowel diacritics along with the vowels and consonants in each character set results in generation of more than 500 different characters including compound characters. A compound character is the combination of one or more vowels/consonants and one or more vowel diacritics and conjunct consonants. The presence of compound characters can be perceived in almost every word of SIL scripts. The specialized strategies are required for separation of the components in a compound character. Further, this desires a reordering and grouping of separated components to reconstruct the editable form of the same compound character. The other challenging aspect to be addressed in the character set of SIL scripts is resolution of confusing character classes. Confusing character classes are the characters with similar resemblance leading to ambiguity during classification/recognition of character class. Currently, there is an immense demand for development of printed as well as handwritten SIL OCR to deploy as a part of emerging technological innovations and to achieve fast information processing.

In this work, the main focus is on

development of an enhanced framework for OCR required for processing of Telugu pre-printed documents. The insights into Telugu character set and its structural characteristics are discussed subsequently.

13

Introduction 1.6.1 Telugu Language Telugu is the native language of states Andhra Pradesh and Telangana with more than 75 million speakers. Telugu is an ancient Dravidian language of India with third largest number of speakers [Url 2]. Significance of Telugu script lies in retaining of many Sanskrit terms which are lost in languages like Hindi and Bengali. Telugu script is descendent of Brahmi script and it has symmetrical, angular and monumental appearance [Dan 1996]. All the SIL scripts are descendants of Brahmi script as represented in figure 1.6. As the investigations in this thesis are craved towards the recognition of Telugu script and its development, the details of Telugu script and its alphabetical set is further outlined.

Figure 1.6: Brahmi scripts and its descendants

1.6.2 Telugu Alphabetical Set Telugu script is composed of 16 vowels, 36 consonants and its modifiers. Telugu script is leading to more than five hundred characters [Url 3] resulted from combination of vowels, consonants, consonant conjuncts and vowel diacritics. The vowels and consonant set of Telugu script adapts to a canonical order of short to a long symbol of two closely related and characters with similar resemblance. The vowel and consonant set is as depicted in figure 1.7. The vowels and consonant symbols are combined with the vowel sound symbols (Maatras) and consonant sound symbols (Voththulu) [Url 4] for the generation of compound or connected characters. Figure 1.8 depicts the vowel and consonant sound symbols. Figure 1.9 indicates the formation of compound characters from vowel and consonant sound symbols. The combination of consonant and a vowel modifiers results into the character formations termed as single level vowel consonant clusters as shown in figure 1.9.

14

Introduction

Figure 1.7: Vowels and consonant set in Telugu script

Similarly, a consonant can be combined with consonant conjuncts or modifiers, resulting into formation of new character called consonant clusters or compound character. The formation of consonant clusters is presented in figure 1.10. A consonant can be combined with vowel modifiers, one or more consonant conjuncts resulting into formation of multi level vowel consonant clusters or complex compound characters. The combination of some of the vowels, consonants and its modifiers are depicted in figure 1.11.

Maatras

Voththulu

Figure: 1.8 Vowel and consonant sound symbols

Figure 1.9: Single level vowel consonant clusters-Consonant + vowel modifiers

15

Introduction

Figure 1.10: Consonant clusters- Consonant + Consonant conjuncts

The generation of multi-level vowel consonant clusters in a word processing file requires a ‘halant’ to be included in between the base characters that are involved in formation of the resultant character and also accompanying the vowel/consonant modifier at the end of the combination. The halant is denoted by

, the inclusion of

‘halant’ in formation of multi-level vowel consonant clusters are demonstrated with the help of figure 1.12.

Figure 1.11: Multi-level vowel consonant clusters

Good number of character combinations can be generated out of vowels, consonants and its modifiers and finally resulting into wide character set. In Telugu script there more than 400 distinct symbols and from which more than 500 different character combinations can be generated. In addition, the Telugu script also includes variety of machine printed font styles to represent the characters in word processor.

16

Introduction

Figure 1.12: Formation of multi-level vowel consonant clusters

1.6.3 Machine Printed Font Styles The Government of Andhra Pradesh has released various font styles which include nine Unicode fonts in the name of Sri Krishna Devaraya and his eight poets along with six Unicode font styles from silicon Andhra [Url 5]. The nomenclatures of the various Telugu fonts are as shown in figure 1.13. Later, in addition to the above font collections, three new Unicode fonts are also launched by silicon Andhra. Few of the fonts are very close in resemblance where as the other have a greater degree of variations. The greater differences of same character in two or more font styles make the classifier to recognize the character as two different characters. Thus desiring the OCR to learn the variety of fonts of same character and thereby increasing the volume of knowledge base and computational complexity. The recognition of the printed documents by OCR assumes the document input to exist in the set of font styles the OCR is able to interpret. The font styles with isolated characters are simple and easy to process by OCR rather than fonts with touching or overlapping characters and slightly tilted characters. The fonts with touching/overlapping/tilted characters requires a specific processing prototype to separate the one character from the other characters or detect the regions of touching or overlapping portions in the words. However the complexity of processing printed text is very much simplified and of ease compared with handwritten text, since printed characters are always confined by the uniform spatial characteristics as dependent on its font size and style.

17

Introduction

Figure 1.13: Telugu machine printed font styles

1.6.4 Handwritten Script and its Complexity Handwritten script refers to the text composed by an individual. The composition of handwritten script highly depends on the writing style of a particular individual. However there are many more factors that can decide the level of complexity in processing the handwritten script. The various factors include; 

The type of document (Pre-printed/blank/postal document etc.)



Clarity/Visibility of characters in handwriting



Amount of space allocated to write (Pre-printed documents)



Cursive or non-cursive text



Ink bleeds/scratch marks generated during composition of text

As the handwritten script in this research is concerned with Telugu script, there is much scope for existence of the above factors in handwritten documents. Also the Telugu script consists of many compound characters, which possess the presence of vowel modifiers (super scripts) and conjunct consonants (subscripts), the spatial characteristics of modifiers in the handwritten compound characters may change variably when compared with printed characters. The processing of multi level vowel consonant clusters in handwritten script is very complex as tracking the segmentation boundaries is quite complicated in case of the modifiers touching one another and also 18

Introduction the modifiers will be overlapping with one another by default. Figure 1.14 depicts the separation of modifiers in a multi-level vowel consonant cluster.

Figure 1.14: Separation of modifiers in multi-level vowel consonant cluster

On the other hand, the skewed lines within the document are another vital challenge in the handwritten document. Skew is a common characteristic found in most of the handwritten text and it can exist at either line level, word level or at character level. Detection and correction of the skew is an additional procedure and ignoring these corrections may lead to erroneous recognition by OCR. The composition of knowledge for handwritten character recognition is other challenge, as the style of handwriting is unrestricted. The dynamic machine learning approaches are required for maintenance of knowledge base which is computationally very expensive as well as challenging. The handwritten character recognition may fail in cases where handwriting is cluttered and severely complex to recognize even with human vision and cognition. Factors that determine the neatness of handwriting also depends on the type of document. A completely blank document possess all the critical barriers mentioned above, however a document with pre-defined layout like bank cheques, admission forms, applications for job requirements constrains few of the difficulties that arises during recognition stage of characters. Even though the visibilities of characters are enhanced in pre-printed documents, it requires additional pre-processing procedures to detect the textual regions in pre-printed documents. The subsequent sub sections provide the insights into the variety of documents employed for recognition by OCR.

1.6.5 Bank Cheques In the perspective of simplifying the execution of various transactions and to optimize the work flow, the OCR has rendered its services through automated bank cheque processing systems. The important details like bank account number of an individual, authentication of cheque and other relevant details are extracted from the cheque image for processing by OCR. Figure 1.15 depicts a typical image of bank cheque processing app.

19

Introduction

Figure 1.15: Digital bank cheque processing applications

1.6.6 Postal Documents The optimal speed in work flow can be achieved only through faster processing of various tasks involved in work. Especially processing data is of more significance to optimize the work flow in certain organizations like post offices [Url 11]. The postal documents are processed digitally to read all the addresses and preserve data in safe repositories for further processing. The OCR system can perform the interpretation of various postal codes and speed up its work flow. The view of a typical postal document is shown in figure 1.16.

Figure 1.16: Postal document images

1.6.7 Pre-printed Documents A wide variety of documents can be recognized by OCR to render numerous data processing needs of various Government/private organizations. These documents possess pre-defined structure/layout within which the spaces for embossing printed text and to fill in details with handwriting are provided. The different task requirements desire a varied layout or structure for the documents and its contents to be specified depend on the type of task requirements for a particular organization. Usually, the pre-printed documents suitable for one type of application system do not fit into other application system requirements. Some of the general characteristics of a 20

Introduction pre-printed form include; presence of vertical and horizontal grid lines, tables and cells, varying location for photographs, existence of graphical components like logos/symbols/emblems, presence of scratched words due to incorrect entry of data, presence of ink bleeds in handwritten portions, non-variability of printed text regions belonging to a particular requirement, variability of handwritten text regions belonging to a particular requirement, impression of seal marks, improper order of writing, authorized signatures, overlap of text with horizontal grid lines or table cells. A pre-printed document may exist with few or all these characteristics. It is highly crucial task to handle the documents with all characteristics present and the outcome will be more erroneous as well as expensive for such document processing. However there are few documents existing with fairly less number of characteristics and desiring the need of automatic data extraction and processing. The figure 1.17 through 1.19 presents the typical pre-printed documents.

Figure 1.17: School admission form

Few of the pre-printed forms are designed according to the geographical regions and the linguistic characteristics of the individuals residing in those regions. Although there are pre-printed forms in official language like English, the needs of various individuals of local languages, documents are required to be designed even in regional languages.

21

Introduction

Figure 1.18: Job application form

Figure 1.19: Govt. FMB application

The processing of pre-printed forms in English has been extensively investigated and also successfully deployed in some of the fields in real time like computerized evaluation of entrance examination scripts, automated data extraction from forms etc. It is very much important to enhance the investigations for other languages and extend the OCR systems applicability to variety of application requirements in real time. The processing of pre-printed forms in this research is focused on development of OCR framework required for recognition of Telugu handwritten characters commonly found in pre-printed documents.

1.7 Telugu Character Recognition System Even though there are OCR systems currently available for recognition of printed Telugu characters and other SIL, the functionality of printed OCR systems are not suitable enough for recognition of handwritten Telugu characters. Telugu Character Recognition System (TCRS) are defined as the automated reading and interpretation systems for transformation of Telugu character images to its equivalent Unicode/ASCII format. The efficient performance of OCR systems can be attained only by language specific and context specific OCR and thus it is very vital to have a TCRS suitable for the processing of Telugu pre-printed forms.

22

Introduction

1.8 Need for TCRS Developing a TCRS will benefit the society in the following ways.  Educational institutions/private/Govt. organizations can perform automatic text entry from pre-printed forms quickly to process the data for statistics associated  Efficient document retrieval  More economic and efficient maintenance of work flow through reduced time, manual labor, paper handling costs etc  To reduce the errors in manual entry process  Rapid and quick access to data at any time as there are no physical boundaries  To simplify the generation of historical databases of an organization  No need of employee man power with proficiency in Telugu script typing  To quickly automate the data entry from forms  To visualize data with different data visualization techniques from the digitized data extracted from various forms like census, survey, order forms.  Useful for publishing agencies at large or small scales to convert the handwritten or printed notes into machine editable text

1.9 Motivation Attempts on development of OCR systems towards recognition of many Indic scripts are sufficiently found in literature. Most of the works converge on the recognition of Devanagari, Gurumukhi, Gujarati, Arabic and other Brahmi scripts. Few of the attempts are also reported in the literature for recognition of Telugu characters. The investigations on recognition of Telugu characters are categorized in terms of printed character recognition and handwritten character recognition. Even though some of successful researches exist on Telugu printed character recognition; there are limited works reported on handwritten Telugu character recognition. In addition, most of the works are robust on recognition of isolated Telugu characters rather than overlapping or touching characters in the text. The main barrier for the reduced recognition rates in TCRS is the presence of compound characters. The presence of compound characters in Telugu script is very frequent and few possess similar topological structure making the process of character recognition by TCRS very difficult. Also, the overlapping of one character with another character 23

Introduction due to the existence of subscripts and superscripts differ the process of Telugu character recognition from other Indic scripts. In addition, the issue of handwritten text that creeps in the character recognition process of TCRS aggravates the complexity in interpretation of a compound character. It is also observed that, in the literature there are no specific works focusing on segmentation of touching/overlapping characters in handwritten Telugu characters. There are many other issues in the literature, which are not addressed such as preprocessing procedures for Telugu pre-printed documents, scratched word detection, single and multi character text block classification in Telugu handwritten words and post processing of Telugu text produced by TCRS. All the above research challenges motivated us to investigate “An enhanced framework for pre-processing and character recognition systems suitable for Telugu documents”.

1.10 Contributions In this research, the focus is on development of algorithms for each stage of OCR suitable for processing requirements of both pre-printed documents and manuscript documents of Telugu handwritten characters. The various research challenges addressed in this research are described as follows; 

A generic line elimination method for removal of horizontal lines in preprinted documents using circular masks. The significance of this algorithm lies in removal of horizontal lines without disrupting the text overlapping with the horizontal lines. The algorithm detects the regions of lines with text overlapping prior to the removal of lines.



An unsupervised algorithm for the detection of scratched and non-scratched words in pre-printed document images using features of number of connected components, Euler’s number and area covered by non-hole regions. Further, the classification is accomplished through a discrimination coefficient called scratch factor. The simplicity of the algorithm lies in its unsupervised classification process.



A work on classification of printed words and handwritten words in preprinted documents. The algorithm employs statistical features and performs an unsupervised classification through a dynamically threshold.

24

Introduction 

An inference based technique for detection of text

blocks with

touching/overlapping characters. The algorithm employs an unsupervised procedure for detection of touching or overlapping character blocks. 

An iterative split analysis technique is devised for touching/ overlapping character block segmentation through recognition a based method.



A performance efficient technique for recognition of handwritten Telugu characters in document images. The importance of this algorithm lies in applying a caching technique that optimizes the efficiency of the system through a cache database and main database.



A model for representation of character classes using XML tags.



A comparative study on efficacy of various classifiers with Gabor features for classification and recognition of Telugu handwritten characters. The major emphasis of algorithm is in performing the classification of characters using only certain zones in the character. The comparision of recognition accuracy is performed for entire character with two different zone selections proposed.



A post processing methodology for error detection and correction of OCR text output. The novelty of this algorithm lies in employing Unicode Approximation Models (UAM) for error detection and correction dynamically using a mapper module.

1.11 Organization of Thesis The entire research work has been organized into rest of the chapters as described. The Chapter 2 provides the overview of literature relevant to the research challenges in terms of pre-processing, segmentation of touching characters/overlapping characters, feature extraction and classification, recognition and post processing stages of OCR discretely. Further, the first contributory chapter is on the implementation of a line detection/elimination algorithm for removal of lines with text crossings are discussed in chapter 3.

The subsequent contributions are on

classification of textual components in pre-printed documents in terms of scratched/non-scratched and printed/handwritten words are detailed in chapter 4. Then the chapter 5 describes the approach for segmentation of touching and overlapping character blocks in handwritten words. Subsequently, the recognition of Telugu characters using a customized template matching technique and a zone based Gabor feature technique is proposed. The 25

Introduction testing is conducted with various classifiers along with necessary post processing methodology have been explained in chapter 6. Finally, chapter 7 presents the conclusion of the research work along with scope for further research in this direction. Finally, the research papers published on the contributions and references of related literature are presented at the end of the thesis. The figure 1.20 depicts the various research challenges addressed in this thesis.

Figure 1.20: Research contributions in the thesis

1.12 Conclusion In summary, the significance of OCR systems, its applicability in various domains and facets of OCR systems towards SIL scripts recognition is explicated to a required extent. Also, the overview of OCR stages, processing of variety of documents and scripts has been discussed to an extent of gaining comprehensibility towards this thesis. The characteristics of printed and handwritten script and the complexity of processing SIL scripts had been discussed with special emphasis on Telugu script. The chapter also covers the types of documents processed by OCR and its applications along with the research challenges and contributions made during the research work.

26