A New Method for Writer Identification Based on Histogram Symbolic ...

2014 14th International Conference on Frontiers in Handwriting Recognition

A New Method for Writer Identification based on Histogram Symbolic Representation Alireza Alaei

Partha Pratim Roy*

Laboratoire d’Informatique (LI EA6300) Université François-Rabelais de Tours, France [email protected]

Advanced Software Group Samsung India - Noida, UP, India [email protected]

connected component, enclosed region, lower and upper contours based features [4], fractal code [5], contour-based orientation and curvature [6], textural, edge-direction and edge-hinge [7], run-length [8] global and local information in the sliding window [9], geometrical data [12], and grapheme/ stroke extraction [13] have been introduced in the past for writer identification/verification and remarkable progresses in writer identification have been achieved [2-13]. Most of the methods for writer identification/verification in the related literature have used a single type of features or a combination of different features in conjunction with a nearest neighbor classification method for the identification of individual writers [4-8]. There are, however, some research works on handwriting identification that utilized different kinds of machine learning approaches such as Support Vector Machine (SVM), Hidden Markov Model (HMM), Gaussian Mixture Model (GMM), and Logistic Regression (LR) for modeling/realizing different handwriting styles based on some linear/nonlinear principles [8-12]. The system proposed in [13], however, uses a vector space (a set of features) to represent a handwritten document in spite of modeling an individual’s handwriting style. Though literature survey on writer identification shows some impressive progress in the field, the results reported in the SigWiComp2013 indicate that the problem of writer identification/verification still remains a challenging problem [1] which needs more investigation in terms of incorporating new model based techniques for writer identification/ verification. In this paper we exploit the writing style of each individual using symbolic data representation. The concept of symbolic data analysis has a rich aptitude for being used for knowledge mining and model creation [14-15]. Furthermore, this concept has efficiently been adapted to many applications such as time series analysis, document analysis and graphical image understanding [16-17]. However, the use of symbolic data for knowledge/model extraction from handwritten text has not been explored in the field writer identification/verification. In this research work, handwriting style of each individual is modeled in such a way that only one prototype is introduced for each individual’s handwriting style instead of many raw feature vectors (prototypes) which represent different handwritten text-lines or pages written by an individual. Thanks to the histogram symbolic representation [14-17] which provides this typical model-based representation for each individual’s handwriting style. The

Abstract—In this paper, a new model-based writer identification scheme using histogram symbolic representation approach is proposed. In the proposed scheme, initially, some pre-processing techniques are employed to enhance image quality and extract text-lines from each handwritten document image. For each extracted text-line, a set of 92 features are computed based on analysis of connected component, enclosed region, lower and upper contours, fractal code, and Curvelet. Considering the extracted feature vectors, a histogram is created for each feature of every writer as a histogram-valued symbolic data. This process results in a handwriting style model for each individual that consists of a set of histograms. To evaluate the proposed scheme, two different handwritten datasets written in two different scripts (Kannada as an Indian based script and English) were used. The first dataset contains 228 pages written in Kannada by 57 people. The other one is the dataset used in SigWiComp2013 composed of 330 document pages written in English by 55 individuals. The same criteria used in the SigWiComp2013 were followed in our evaluation strategy. Concerning the Kannada dataset, an Fmeasure of 92.79% was obtained when 114 documents were used in learning stage and the rest (114) were used for testing. For the SigWiComp2013 dataset an F-measure of 26.67% was obtained that is fairly comparable to the best result reported in the literature. Keywords: Writer Identification/Verification; Histogram Symbolic Representation; Similarity Measure; English and Kannada Handwritten Documents.

I.

INTRODUCTION

Handwritten text commonly carries important information about the handwriting style and even personality of every individual. There exists a certain degree of stability in the writing style of an individual which makes it possible to identify the writer for which one has already seen a handwritten text. This information has often been used for the writer identification/verification purpose [2-5]. The task of writer identification/verification is to recognize the writer of a handwritten text/signature or to confirm the identity of an individual based on his/her handwriting. Writer identification/verification has been an attractive research topic in the literature for a number of decades [1-3]. Applications of this particular field of research include biometric recognition, personalized handwriting recognition, automatic forensic document examination, classification of ancient manuscripts and smart meeting rooms [1-3, 8]. Different types of features based on *

The contribution corresponding to Partha Pratim Roy was initiated during his Postdoc at LI, Tours, France.

2167-6445/14 $31.00 © 2014 IEEE DOI 10.1109/ICFHR.2014.44

216

B. Feature extraction In the literature, many feature extraction techniques have been used for writer identification/verification [4-13]. Since, in this research work our objective is not to introduce a new set of features for writer identification and verification, a number of simple features [4, 5, 19] used in the literature for writer identification are utilized to characterize the handwriting styles of different individuals. The features introduced in [4, 5, 19] can be extracted in text-line level and are categorized into six main groups as shown in Table I. The Connected Component (CC) based features, as the first group of features, are the average distance between consecutive bounding boxes of extracted CCs from a textline, the average gap between words, the average gap within words, the average, median, and standard deviation of the width of connected components, and the average number of foreground-to-background transitions for each connected component of a text-line [4].

idea of histogram symbolic representation for modeling the feature distribution in some points may seem to be close to the GMM based method [10]. However in a GMM few outliers can significantly change the mean and standard deviation parameters which consequently result in the change of Gaussian model. In the case of histogram symbolic representation, outliers cannot substantially change the histogram distribution, since; the mean and standard deviation are not included in the creation of histogram symbolic data. The proposed model in this research work is also textindependent and is not restricted by any specific part of text content for identifying handwritings of different individuals. Instead, the proposed system analyzes the handwriting styles of individuals through a model-based scheme for the identification purpose. To prove the applicability of the proposed model, two datasets from two different scripts (English and Kannada scripts) are considered for evaluating the proposed scheme and promising results are obtained. Outline of the rest of the paper is as follows: Section II describes our proposed writer identification approach. Section III discusses the experimental results and comparative analysis. Finally, some conclusions and future work are drawn in Section IV. II.

Training phase

Testing phase

Handwritten Document Images for Training

Handwritten Document Images for Testing

PROPOSED SCHEME

An overview of our proposed writer identification scheme is depicted in Fig. 1. It contains 4 main steps: a) preprocessing, b) feature extraction, c) creation of handwriting style model, and d) computing similarity values. Details of each step are explained in the subsequent subsections.

Pre-processing Feature extraction

A. Pre-processing Pre-processing as a preliminary step is used in many applications of document image analysis to enhance the image quality or to obtain some particular parts of a document such as text-lines, words or characters for further processing. In this research work, at first, a text-line segmentation algorithm presented in [18] is employed to extract the text-lines from a document image. In case, the extracted text-lines are less than 4 lines (Fig. 2), either the input image can be vertically divided into 2 equal vertical parts (Fig. 2) and the same text-line segmentation method is employed on both parts to extract the text-lines (half textlines) or the extracted text-lines from the original image can be simply divided into two equal parts. The former solution is employed in this research work, since dividing the original image into stripes in complex handwritten documents helps to obtain better text-line segmentation results [18]. This process is employed to have enough information (text-lines) for the writer identification process. A comprehensive study on the effect of text-line size for writer identification has been reported in [13]. Each extracted text-line is then binarized and very small size connected components (noise and small dots) are filtered out from the extracted text-line using a threshold computed based on the average size of connected components in the extracted text-line. Here the threshold is fixed to 10 pixels size based on the experimentation.

Extracted features + Models of handwritings All extracted features

Computing similarity values between test image and the handwriting models

Handwriting models

Histogram symbolic creation and writers’ handwriting styles modelling

Majority voting for page level writer identification Identified/Retrieved handwritten documents

Fig.1. Overview of the proposed writer identification scheme.

Fig.2. An image from the SigWiComp2013 dataset [1] having 3 text-lines. TABLE I. TYPES AND NUMBER OF FEATURES UTILIZED IN THIS RESEARCH WORK TO CHARACTERIZE THE HANDWRITING STYLES. Features extracted based on Connected component Enclosed region Lower and upper contours Fractal code Basic information Curvelet

217

Number of features 7 3 16 57 4 5

Formally, let X be a continuous variable defined on a finite support ‫ ܦ‬ൌ ሾ‫ݔ‬ǡ ‫ݔ‬ሿ where ‫ ݔ‬and ‫ ݔ‬are the minimum and maximum values of the variable domain respectively. The variable X is divided into a set of t adjacent bins (intervals) {I1, …, Ip, …, It} where ‫ܫ‬௣ ൌ ሾ‫ݔ‬௣ ǡ ‫ݔ‬௣ ሻ. Given N observations of the variable X, each interval Ip is associated with a random variable defined as: ߰൫‫ܫ‬௣ ൯ ൌ σே (1) ௨ୀଵ ߰௫ೠ ൫‫ܫ‬௣ ൯ ͳ݂݅‫ݔ‬௨ ‫ܫ א‬௣ ߰௫ೠ ൫‫ܫ‬௣ ൯ ൌ ൜ (2) Ͳ‫݁ݏ݅ݓݎ݄݁ݐ݋‬ It is also possible to associate with Ip an empirical distributionߨ௣ ൌ ߰൫‫ܫ‬௣ ൯Τܰ , whereͲ ൑ ߨ௣ ൑ ͳ. A histogram of X is represented by a number of pair (‫ܫ‬௣ ,ߨ௣ ) for p=1, …, t which ‫ܫ‬௣ is a base interval along horizontal axis andߨ௣ is its corresponding frequency probability along vertical axis [15-17]. In this research work, using the above mentioned concept a histogram-valued symbolic data representation is created to model each feature of every individual’s handwriting style using the training data. Consequently, each handwriting style is modeled by a number of histogram-valued data computed for all the features in the feature set. To have a clear idea about the way of creating histogram-valued symbolic object and also to formulate this concept, detailed mathematical descriptions are provided in the following. Let ܵ௝ ൌ ൛‫ݏ‬௝ଵ ǡ ‫ݏ‬௝ଶ ǡ ǥ ǡ ‫ݏ‬௝௠ ൟ be a set of m samples from a ௜ ‫ ۄ‬represents a handwriting class‫ܥ‬௝ . ‫ܨ‬௝௜ ൌ ‫݂ۃ‬௝ଵ௜ ǡ ݂௝ଶ௜ ǡ ݂௝ଷ௜ ǡ ǥ ǡ ݂௝௡ feature vector of size n extracted from the ith sample of ܵ௝ say ‫ݏ‬௝௜ . For every feature ݂Ǥ௞Ǥ the minimum ݂Ǥ௞Ǥ and maximum Ǥ ݂Ǥ௞ values are computed (from the features extracted during Ǥ the training) to be considered as the support domain ቂ݂Ǥ௞Ǥ ǡ ݂Ǥ௞ ቃ of the kth feature where k is varied between 1 and n. For the Ǥ kth feature of class j say ݂௝௄ using the extracted features during training stage a histogram ‫ܪ‬௝௞ is computed as follows:

Features based on enclosed region are the average of the form factor of blobs, the average roundness of the blobs, and the average size of the blobs extracted from every text-line [4]. To compute features based on lower and upper contours, first, lower and upper contour of a text-line are extracted employing a simple profile projection technique. From both lower and upper contours of the text-line a number of features such as skew of the lower and upper contours, the mean squared error between the regression lines and the original curves (lower and upper contours), frequency of the local maxima and local minima on both lower and upper contours, and the average value of local skews of the contours to the left and right of a local maximum (minima) in both lower and upper contours are computed [4]. As a result, 16 features are extracted from both lower and upper contours of the extracted text-line that mostly formulate the skew direction of different handwritings. For computing fractal features a disk-shaped dilation kernel and also different ellipsoidal dilation kernels are used. For each of these kernels an evolution graph is derived and the skew of the three straight line segments of the evolution graph is computed [4] to establish a total of 57 fractal features for each text-line. Basic features correspond to skew, slant, height of the main writing zone, and width of the writing of a text-line constitute another 4 features [4, 5] used in this research work for writer identification. The last 5 features are computed based on the Curvelet feature extraction method [19]. The standard deviation of the 5 coefficient matrices obtained from the binary document image are computed and considered as texture feature/information used in the literature. Details of Curvelet feature extraction technique can be found in [19]. The above mentioned features are then used in the next step to create a specific handwriting style model for every writer/individual. C. Creation of handwriting style model In the literature, most of the systems used classical data analysis wherein the basic units under the analysis are single objects rather than models [14]. Objects are described by a set of numerical and/or categorical variables called features which each of them takes a single value. The features extracted for different objects are organized in a data-array, where each cell (i, j) contains the value of feature j for object i. However, this kind of model is too restricted and cannot take into account the variability and/or uncertainty which are often inherent to the features. To efficiently represent the variability and distribution of feature values of a feature in a specific class object, interval- and histogram-valued symbolic data/variables have been introduced in the domain of symbolic data analysis [14-17]. An object or a set of similar objects is described by histogram data which is a classical histogram defined by a support composed of many intervals. Each interval is then weighted by an empirical density. Definition of histogram data/variable is provided in the following.

ଵ

ଶ

௧

ଵ ‫ۃ ۄ‬ൣ‫ܫ‬ଶ ଶ ௧ ௧ ǡ ‫ܪ‬௝௞ ൌ ቄ‫ۃ‬ൣ‫ܫ‬ଵ௝௞ ǡ ‫ܫ‬௝௞ ቁ Ǣ ߨ௝௞ ௝௞ ǡ ‫ܫ‬௝௞ ቁ Ǣ ߨ௝௞ ‫ ۄ‬ǡ ǥ ǡ ‫ۃ‬ൣ‫ܫ‬௝௞ ǡ ‫ܫ‬௝௞ ቁ Ǣ ߨ௝௞ ‫ۄ‬ቅ(3)

௣

௣

௣

where ൣ‫ܫ‬௝௞ ǡ ‫ܫ‬௝௞ ቁis the bin limit or base interval and ߨ௝௞ is its corresponding frequency probability. It is worth mentioning that p is the number of bins in the histogram ‫ܪ‬௝௞ and varies between 1 and t. Consequently, the histogram symbolic representation so called model-based representation of class‫ܥ‬௝ with n features is defined as: ܵ‫ܥ݉ݕ‬௝ ൌ ൛‫ܪ‬௝ଵ ǡ ‫ܪ‬௝ଶ ǡ ǥ ǡ ‫ܪ‬௝௡ }

(4)

Considering q classes in a particular problem, complete histogram symbolic representation of the problem is shown in Table II. A pictorial view of the proposed histogram symbolic representation of class ‫ܥ‬௝ calledܵ‫ܥ݉ݕ‬௝ is shown in Fig. 3. Here it is assumed that only 3 features are extracted for characterizing handwriting styles and 4 bins are considered to build the histograms.

218

TABLE II. HISTOGRAM REPRESENTATION/MODEL OF A PROBLEM WITH q CLASSES AND n FEATURES BASED ON THE PROPOSED SYMBOLIC APPROACH.

simple majority voting is applied to determine the writer of the page. Ties are broken based on the similarities obtained by the proposed similarity measure.

Feature hf1

hf2

…

hfk

…

hfn

ࡿ࢟࢓࡯૚ ࡿ࢟࢓࡯૛

‫ܪ‬ଵଵ ‫ܪ‬ଶଵ

‫ܪ‬ଵଶ ‫ܪ‬ଶଶ

… …

‫ܪ‬ଵ௞ ‫ܪ‬ଶ௞

… …

‫ܪ‬ଵ௡ ‫ܪ‬ଶ௡

ࡿ࢟࢓࡯࢐

‫ܪ‬௝ଵ

‫ܪ‬௝ଶ

…

‫ܪ‬௝௞

…

‫ܪ‬௝௡

ࡿ࢟࢓࡯ࢗ

‫ܪ‬௤ଵ

‫ܪ‬௤ଶ

…

‫ܪ‬௤௞

…

‫ܪ‬௤௡

Class

Hj1

Hj2

III.

A. Datasets and metrics of evaluation To evaluate the performance of the proposed scheme for writer identification, two different handwritten datasets are used in this research work. The first one is the Kannada dataset, which contains 228 handwritten documents written by 57 native speakers of Kannada. Each individual has written 4 different documents [20]. The second dataset is SigWiComp2013 dataset [1] used for writer identification. It is composed of 330 handwritten documents written by 55 individuals. Each person has written 6 different documents [1]. Some statistics about both datasets are tabulated in Table III. For evaluation of the proposed writer identification scheme, F-Measure (FM) is computed as the metric of performance evaluation. To have a clear observation about the performance of the proposed histogram symbolic based approach for writer identification, the top 1, 2, 3, 5, 10, and 15 results are provided.

Hj3

ࡿ࢟࢓࡯࢐

Fig.3. A pictorial illustration of the proposed histogram symbolic representation/model for a particular class ‫ܥ‬௝ .

D. Computing similarity values In the literature many distance measures have been proposed to compute the similarity/dissimilarity between two objects or two set of features [15-17]. Since, in this research work the proposed representation model for each handwriting class is based on the histogram-valued data and the features extracted from each test sample (text-line) are numerical values, a specific distance measure is proposed to compute the similarity between the classes of models and a test/query sample. The similarity Sim൫‫ ்ܨ‬ǡ ܵ‫ܥ݉ݕ‬௝ ൯between a test data (textline) T and the histogram symbolic reference ܵ‫ܥ݉ݕ‬௝ of the class j is computed as follows: ௣ (5) ܵ݅݉൫‫ ்ܨ‬ǡ ܵ‫ܥ݉ݕ‬௝ ൯ ൌ ൫σ௡௟ୀଵ σ௧௣ୀଵ ߩ௝௟ ൯Τ݊ ௣

௣

TABLE III. SOME STATISTICS OF THE KANNADA DATASET [20] AND THE ONE USED IN SIGWICOMP2013 [1]. Number of Number Number Number Number Statistic document of textof of of Dataset images lines training testing writers Kannada dataset 228 4860 114 114 57 SigWiComp2013 330 2317 165 165 55

B. Results and discussion Concerning the Kannada dataset [20], the proposed system was trained with 114 (57×2) training document images (2 pages from each writer) and the rest of documents (114) were used for testing (2 other documents from each writer). The results obtained based on the proposed histogram symbolic representation approach for writer identification/retrieval are shown in Table IV and VI. The FM results were provided in both line- and page-levels for Top 1, Top 2, etc. A pictorial illustration of the results obtained on Kannada dataset is also provided in Fig.4. Experimentation results for line- and page-level handwriting identification on the SigWiComp2013 dataset [1] are shown in Table V and VI. The graphical representation of the results is plotted in Fig. 5.

௣

(6) ߩ௝௟ =ቊߨ௝௟ ݂݅‫ܫ‬௝௟ ൑ ݂௧௟ ൏ ‫ܫ‬௝௟ Ͳ‫݁ݏ݅ݓݎ݄݁ݐ݋‬ where ‫ ்ܨ‬is the set of features extracted from the test sample T, n is the number of features in the feature set, and t is the number of bins in each histogram-valued representation of every features of the handwriting styles’ models. The similarity values between a test data and the histogram symbolic models will be always between 0 and 1. Since, q histogram-valued symbolic models are defined; q similarity values are obtained from the test/query sample and q classes of models. The label of the handwriting model owing the maximum similarity value indicates the writer who has written the test sample. For retrieval, the sample(s) belongs to the model class with highest similarity value are considered as the most likely sample(s) to the query sample to be retrieved. Since, a ranked list of results based on the similarity values is fairly available, the second, third,... possible solutions for writer identification/retrieval are also provided. Consequently to obtain a writer identification result for a handwritten document page, the results obtained for all the text-lines extracted from that page are combined and a ௣

EXPERIMENTAL RESULTS AND COMPARISON

TABLE IV. THE RESULTS OBTAINED BASED ON THE PROPOSED HISTOGRAM SYSMBOLIC REPRESENTATION MODEL USING THE KANNADA DATASET [20]. Result Level Line-level Page-level

Top 1 FM % 67.22 92.79

Top 2 FM % 77.72 93.69

Top 3 FM % 82.42 95.50

Top 5 FM % 88.26 96.40

Top 10 FM % 93.04 99.10

Top 15 FM % 95.86 99.10

TABLE V. THE RESULTS OBTAINED FROM THE PROPOSED HISTOGRAM SYSMBOLIC REPRESENTATION MODEL ON THE SIGWICOMP2013 DATASET. Result

219

Level

Top 1 FM %

Top 2 FM %

Top 3 FM %

Top 5 FM %

Top 10 FM %

Top 15 FM %

Line-level Page-level

16.05 26.67

23.72 31.52

30.93 36.97

38.50 43.64

55.09 48.48

65.92 52.73

TABLE VI. THE RESULTS OBTAINED FROM THE PROPOSED APPROACH USING KANNADA AND SIGWICOMP2013 DATASETS. FM for Top 1 Results Dataset Text-line level Page level Improvement Kannada Dataset 67.22% 92.79% 25.57% SigWiComp2013 Dataset 16.05% 26.67% 10.11%

From the experiment results on the Kannada dataset [20] shown in Table IV, it is noted that using text-lines extracted from Kannada handwritten document a correct identification/ retrieval of 67.27% was obtained at the line-level. The accuracy was increased by more than 25% at the page-level considering the identification results obtained for all the textlines extracted from the handwritten document page and employing the majority voting to obtain the final writer identification results. The FM for writer identification of Kannada handwritten documents is 92.79%, which is fairly good with the limited number of documents (only 2 documents per individual) used for the training and very few number of features used for characterizing handwriting styles. Concerning the identification results obtained for the SigWiComp2013 dataset shown in Table V, writer identification of 26.67% was achieved at the page-level, which is more than10% better than the results obtained at the line-level. The level of improvement for writer identification at the page-level on the SigWiComp2013 dataset is not the same as the Kannada dataset. This is because of the fact that content and the number of text-lines in each document of the SigWiComp2013 dataset is comparably less than the number of text-lines of Kannada handwritten documents. This point is clearly shown in Table III. The effect of this small number of text-lines in each document of the SigWiComp2013 for writer identification is clearly pointed out in Table IV, where the improvement of the results obtained at the page-level compared to the line-level is more than 25% for the Kannada dataset [20] and this increment is only 10% in the SigWiComp2013 dataset. In the proposed writer identification scheme, the number of bins as the only parameter for histogram-valued symbolic creation should be fixed. This parameter was experimentally tuned using the results obtained from the training data. To obtain the optimum performance, an iterative process was performed on the training data. In each iteration a histogramvalued symbolic representation with a different number of bins was created and writer identification results were obtained based on the number of bins. These results were then traced from the smallest bin to the longest one to obtain 3 consecutive increasing results of writer identification. The optimum number of bins was chosen based on the last value in the 3 consecutive increasing results seen during the tracing process. The writer identification results obtained for both training and testing parts of Kannada dataset [20] using different numbers of bins for histogram creation are shown in Fig.6 and 7. Since, the first 3 consecutive increasing results on training data were obtained when the number of bins was 19, the identification results with the same parameter was reported on the test data. The same procedure was employed on SigWiComp2013 dataset [1] and results are depicted in Fig.8 and 9.

Fig.4. The pictorial illustration of the writer identification results obtained from the Kannada dataset [20].

Fig.5. The graphical representation of the writer identification results obtained from the SigWiComp2013 dataset [1].

Fig.6. Writer identification results obtained using different number of bins for histogram creation on the training Kannada dataset [20].

Fig.7. Writer identification results obtained using different number of bins for histogram creation on the test Kannada dataset [20].

Fig.8. Writer identification results obtained using different number of bins for histogram creation on the training SigWiComp2013 dataset [1].

220

from handwritten” to be incorporated for better modeling the handwriting styles. ACKNOWLEDGMENT Authors would like to thank the organizers of the SigWiComp2013 for providing the dataset. REFERENCES [1] M. I. Malik, M. Liwicki, L. Alewijnse, W. Ohyama, M. Blumenstein, B. Found, “ICDAR 2013 Competitions on Signature Verification and Writer Identification for On- and Offline Skilled Forgeries (SigWiComp 2013),” In Proc. of ICDAR, pp. 1477-1483, 2013. [2] R. Plamondon and S. N. Srihari, “On-line and off-line handwriting recognition: a comprehensive survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), pp. 63–84, 2000. [3] D. Impedovo and G. Pirlo, “Automatic signature verification: The state of the art,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(5), pp. 609–635, 2008. [4] C. Hertel, H. Bunke, “A Set of Novel Features for Writer Identification,” Audio- and Video-Based Biometric Person Authentication, LNCS, 2688, pp. 679-687, 2003. [5] S.-H. Cha and S. Srihari, “Multiple feature integration for writer verification,” In Proc. of 7th Int. Workshop on Frontiers in Handwriting Recognition (IWFHR), pp. 333–342. 2000. [6] I. Siddiqi, N. Vincent, “Text independent writer recognition using redundant writing patterns with contour-based orientation and curvature features,” Pattern Recognition Letters 43(11), 3853–3865, 2010. [7] M. Bulacu, L. Schomaker, “Text-independent writer identification and verification using textural and allographic features,” IEEE Trans. Pattern Anal. Mach. Intell., 29 (4), 701–717, 2007. [8] C. Djeddi, I. Siddiqi, L. Souici-Meslati, and A. Ennaji, “Text independent writer recognition using multi-script handwritten texts,” Pattern Recognition Letters, 34, pp.1196–1202, 2013. [9] A. Schlapbach, H. Bunke, “Using HMM Based Recognizers for Writer Identification and Verification,” In Proc. of 9th Int. Workshop on Frontiers in Handwriting Recognition (IWFHR), pp. 167–172. 2004. [10] A. Schlapbach, H. Bunke, “Off-line writer identification using Gaussian mixture models,” In Proc. of ICPR, pp. 992–995, 2006. [11] M. Liwicki, A. Schlapbach, H. Bunke, S. Bengio, J. Mariéthoz, J. Richiardi, “Writer Identification for Smart Meeting Room Systems,” In Proc. of the Document Analysis Systems, pp.186-195, 2006. [12] A. Hassane, S. Al-Maadeed, and A. Bouridane, “A set of geometrical features for writer identification,” Neural Information Processing. Springer Berlin Heidelberg, 2012. [13] A. Bensefia, T. Paquet, L. Heutte, “A writer identification and verification system,” Pattern Recognition Letters, 26(13), pp. 2080– 2092, 2005. [14] H. H. Bock, E. Diday, “Analysis of Symbolic Data: Exploratory methods for extracting statistical information from complex data,” Studies in Classification, Data Analysis and Knowledge Organisation, Springer-Verlag, 2000. [15] L. Billard, E. Diday, “Symbolic Data Analysis: Definitions and Examples,” Technical Report, 62 pages, 2003. [16] F. A. T. DeCarvalho, “Histograms in Symbolic Data Analysis,” Annals of Operations Research, 55, pp. 299-322, 1995. [17] P. Brito, M. Chavent, “Divisive Monothetic Clustering for Interval and Histogram-Valued Data,” In Proc. of International Conference on Pattern Recognition Applications and Methods, pp. 229-234, 2012. [18] A. Alaei, P. Nagabhushan, U. Pal, “Piece-wise Painting Technique for Line Segmentation of Unconstrained Handwritten Text: A Specific Study with Persian Text Documents,” Patten Analysis and Application, 14(4), pp. 381-394, 2011. [19] J. Starck, E. J. Candès, D. L. Donoho, “The Curvelet Transform for Image Denoising,” IEEE Transactions on Image Processing, 11(6), pp. 670-684, 2002. [20] A. Alaei, U. Pal, P. Nagabhushan, “Dataset and Ground Truth for Handwritten Text in Four Different Scripts,” IJPRAI, 26(4), 2012.

Fig.9. Writer identification results obtained using different number of bins for histogram creation on the test SigWiComp2013 dataset [1].

C. Comparative analysis To compare the performance of the proposed writer identification scheme, the results reported in SigWiComp2013 [1] are compared to the result obtained by the proposed system in this research work. The best result obtained from each group participated in SigWiComp2013 [1] is chosen and included in Table VII. From Table VII, it is evident that the result obtained from the proposed writer identification scheme in this research work is better than the results reported in the competition [1]. The system proposed in [8] provided better result. The reason relies on the fact that different types of features with very high dimensions (2904) were incorporated in [8] for writer identification and these features probably provided more information for discrimination of different handwriting styles. However, in our system, we used only 92 features which are quite less compared to the 2904 features used in [8]. It is also worth mentioning that the main reasons for obtaining lower identification results on the SigWiComp2013 dataset [1] compared to the Kannada dataset [20] is the less number of text-lines and more diversity of writing styles in handwritings of an individual in most of SigWiComp2013 documents compared to Kannada documents. TABLE VII. COMPARISON OF DIFFERENT FEATURES AND RESULTS Results

Types of features

Method

FM %

Hassane et al. [12]

Local binary patterns (LBP), Histogram 19.39 of oriented gradients (HOG) features Geometrical features 21.81

Djeddi et al. [8]

Run length, Edge-hinge, Directional

28.48

The proposed scheme

Structural and Texture features

26.67

System 21 reported in [1]

IV.

CONCLUSIONS

A new scheme for writer identification and retrieval of handwritten documents is proposed in this research work. The proposed scheme is a model-based approach used the histogram-valued symbolic representation for modeling handwriting styles. The proposed model was evaluated using two datasets of handwritten documents written in two different scripts. The results obtained from these datasets employing the proposed histogram symbolic representation scheme reveal that the method can be used in different context providing fairly good results compared to the stateof-the-art methods. Our future work will be investigating and better understanding of “the way that human experts learn

221