An Online Self-Learning Algorithm for License Plate ... - IEEE Xplore

1806

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 14, NO. 4, DECEMBER 2013

An Online Self-Learning Algorithm for License Plate Matching Francisco Moraes Oliveira-Neto, Lee D. Han, and Myong Kee Jeong, Senior Member, IEEE

Abstract—License plate recognition (LPR) technology is a mature yet imperfect technology used for automated toll collection and speed enforcement. The portion of license plates that can be correctly recognized and matched at two separate stations is typically in the range of 35% or less. Existing methods for improving the matching of plates recognized by LPR units rely on intensive manual data reduction, such that the misread plates are manually entered into the system. Recently, an advanced matching technique that combines Bayesian probability and Levenshtein text-mining techniques was proposed to increase the accuracy of automated license plate matching. The key component of this method is what we called the association matrix, which contains the conditional probabilities of observing one character at one station for a given observed character at another station. However, the estimation of the association matrix relies on the manually extracted ground truth of a large number of plates, which is a cumbersome and tedious process. To overcome this drawback, in this study, we propose an ingenious novel self-learning algorithm that eliminates the need for extracting ground truth manually. These automatically learned association matrices are found to perform well in the correctness in plate matching, in comparison with those generated from the painstaking manual method. Furthermore, these automatically learned association matrices outperform their manual counterparts in reducing false matching rates. The automatic self-learning method is also cheaper and easier to implement and continues to improve and correct itself over time. Index Terms—Edit distance (ED), license plate recognition (LPR), text mining, vehicle tracking.

I. I NTRODUCTION

L

ICENSE plate recognition (LPR) technology is a mature yet imperfect technology used for automated toll collection and speed enforcement. Because of LPR’s limited Manuscript received September 28, 2012; revised March 23, 2013; accepted June 3, 2013. Date of publication August 1, 2013; date of current version November 26, 2013. This work was supported in part by the National Transportation Research Center, Inc. through the U.S. Department of Transportation (USDOT) Research and Innovative Technology Administration under Grant DTRT-06-0043-U09 and in part by the USDOT Federal Highway Administration through the Dwight David Eisenhower Graduate Scholarship Program under Grant DDEGRD-09-X-00407. The Associate Editor for this paper was Prof. S. Sun. F. M. Oliveira-Neto is with the Department of Civil and Environmental Engineering, The University of Tennessee, Knoxville, TN 37996 USA, and also with the Center for Transportation Analysis, Oak Ridge National Laboratory, Oak Ridge, TN 37831 USA. L. D. Han is with the Department of Civil and Environmental Engineering, The University of Tennessee, Knoxville, TN 37996 USA, and also with the School of Traffic and Transportation Engineering, Changsha University of Science and Technology, Changsha 410004, China. M. K. Jeong is with the Department of Industrial and Systems Engineering and the Rutgers Center for Operations Research (RUTCOR), Rutgers University, Piscataway, NJ 08854 USA. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TITS.2013.2270107

Fig. 1. Sample LPR character-reading error rates per plate for two units installed at two different locations.

accuracy, around or less than 60%, depending on the model, installation, variation of the license plates in the traffic stream, and other factors, the portion of license plates that was correctly recognized and matched at two separate stations was typically in the range of 35% or less. This number further deteriorates if one tries to match the same plate through more than two sequential stations. Existing techniques for improving plate matching accuracy consist of intensive manual data reduction and posterior training [1], [2]. Through the study of the characteristics of the errors made by LPR hardware, we can find out that, while significant portion of plates were recognized incorrectly, many of these misread plates only had one, two, or three misread characters out of the entire string of six or so total characters, as shown in Fig. 1. In other words, the correct recognition rate of individual characters is much higher than the correct recognition rate of entire plates. This is a simple yet powerful fact unexplored by hardware manufacturers and researchers in the area of LPR technology and video image processing in general. Our idea takes advantage of this simple fact and explores the likelihood that two seemingly different license plate strings (sequence of alphanumeric characters) resultant from two LPR stations are actually a match. It should be pointed out that the license plate matching is far more challenging than a traditional template matching problem because license plate strings typically 1) do not have a readily available dictionary to compare to, 2) do not have a context to help “guess” the meaning of the plate, and 3) include both

1524-9050 © 2013 IEEE

OLIVEIRA-NETO et al.: ONLINE SELF-LEARNING ALGORITHM FOR LICENSE PLATE MATCHING

Fig. 2.

Sample truth matrix extracted from I-640 field data.

alphabetical and numerical characters having many possible syntaxes. In an automated license plate matching process, we do not know whether each plate string is recognized correctly at all. However, we still have to try to discern whether two strings, both could be incorrectly recognized, is a match. Even when two strings are identical, there is no guarantee that the license plate was correctly recognized at both stations. For example, the plate “SAMPLE,” which was used in the testing process in Section V, was recognized incorrectly at both stations as “SAWPLE,” yielding a perfect match nevertheless. These are some of the complications. To tackle this problem, Oliveira-Neto et al. (2009) used the traditional Levenshtein edit distance (ED) technique [3], a textmining method, to improve the matching technology [4], [5]. They subsequently applied a generalized ED (GED) technique combined with weight schemes based on statistical data [5]. Although this latter technique improved the matching rate, it relies heavily on using the manually extracted ground truth of a large number of plates, typically in the thousands, and the resultant truth matrices to provide the probability of whether a character, e.g., “A,” reported by the LPR machine is actually “A” or perhaps “4” or “6” or any other characters. Fig. 2 shows an example of the sample truth matrix that is calculated from I-640 field data in our case study in Section V. The element (i, j) of this matrix is the probability value in percentage that whether a character, e.g., “i,” reported by the LPR machine is actually a “j.” That is, it represents the odd of an LPR machine in reading (diagonal elements) a character correctly or misreading a character into another one (off-diagonal elements). The rows of this matrix are related to character readings, and the columns represent the truth characters. For example, the element in the 13th row and 8th column is the probability that the character “8” reported by an LPR machine is actually “B.” These truth matrices can be different for each station and can change over time due to a variety of reasons. These matrices are

1807

expensive and time consuming to obtain and are, thus, highly valued by LPR manufacturers and users. Incorporating these truth matrices for different stations and using Bayesian probabilities, one can then derive an association matrix for every pair of stations representing the likelihood that a character, e.g., “A,” reported by the LPR machine at one station is reported as “A” or perhaps “4” or “6” or any other characters by another LPR machine. The association matrices, which look similar to the truth matrices, but without the information of inherent ground truth, are essential to plate matching, particularly when the plates are read incorrectly. In this paper, we propose a novel self-learning algorithm that can generate these important association matrices without the need of extracting ground truth manually. These automatically learned association matrices are found to perform well in the correctness in plate matching, in comparison with those generated from the painstaking manual method. Furthermore, these automatically learned association matrices outperform their manual counterparts in reducing false matching rates. The automatic self-learning method is cheaper and easier to implement and continues to improve and correct itself over time. II. R EVIEW OF T EXT-M INING T ECHNIQUES A. Notation Let Σ be a finite alphabet from A to Z and N be a numeric set formed by symbols of natural numbers from 0 to 9. Ω = Σ ∪ N is referred to as the alphanumeric set. λ is the null symbol, i.e., λ ∈ / Ω. The set Ω∗ = Ω ∪ {λ} is the appended alphanumeric set. The symbol |.| will be used to denote the number of characters in a string or the number of elements in a given list or set. Considering the English alphabet, the length of Ω∗ is |Ω∗ | = |Σ| + |N| + 1 = 37. A string X of the form X = x1 , . . . , xL , where each xi ∈ Ω, is said to be of length |X| = L. Its prefix of length i will be written as Xi = x1 , . . . , xi . The notation Xi is also used to denote the ith element of a given set of strings, e.g., G = {Xi ; i = 1, . . . , m}, where m is the number of strings in G, i.e., its length is |G| = m. We denote the substring of X by Xi...j , including the symbols from Xi to Xj , i ≤ j, j ≤ L. The length of such substring is |Xi...j | = j − i + 1. If i > j, Xi...j is the null string λ, where |λ| = 0. Upper case symbols represent strings with length greater than one, and lower case symbols represent elements of the set Ω∗ under consideration. B. String-to-String Comparison In this paper, we deal with the problem of matching plate readings from a dual-LPR setup. In this context, two LPR units are located at two locations, referred to as stations g and h, to recognize the sequence of characters from license plates of moving vehicles. Station h is located downstream of station g. Therefore, for any given plate read at station h, there are a number of candidate plates already read at station g for matching purposes. For example, let X = “ABC123” and Y = “480SI2B” be two sequences of characters read at stations g and h, respectively. The problem is to discern whether they come from the

1808


Fig. 3. Trace diagram.

same true sequence of characters. Since there is no certainty about the veracity of these two outcomes, there is a multitude of strings from which they might have been originated and can be undesirable to determine all the true possibilities for both X and Y . Instead, we approach this problem by searching for the most likely alignment between the two outcomes X and Y . Therefore, methods to compare the similarity between two sequences of characters are deployed. We briefly describe the concepts of trace and editing path, which are essential in the development of the proposed self-learning method. For more details, see the work of Marzal and Vidal (1993) [6]. One among thousands of possibilities to compare X and Y is shown in Fig. 3 [6]. This diagram represents a trace TX,Y from X to Y . TX,Y or simply T is a sequence of pair of integers satisfying the following constraints: 1) 1 ≤ i ≤ |X|; 1 ≤ j ≤ |Y |. 2) For every two distinct pairs (i, j) and (i , j ) in TX,Y , a) i = i and j = j ; b) i < i if only if j < j . Conditions 1 and 2 guarantee that each character position of either X and Y is joined by at most one line, i.e., no two lines cross, and the size of set T does not exceed the least length between X and Y . As stated in [6], we can assign a cost to trace T by the following equation: γ(xi → yj ) + γ(xi → λ) + γ(λ → yj ) W (T) = (i,j)∈T

i∈I

j∈J

(1) where I and J are sets of positions in X and Y that are not touched by any line in T, respectively, and γ is an elementary weight function relating the corresponding pair of characters. If the triangular inequality holds for the elementary weight function, i.e., γ(a → b) + γ(b → c) ≥ γ(a → c) where a, b, and c are strings of length 1, an important result of [7], in (2), relates a trace with a common metric used for comparing the similarity between two strings called ED. Given two strings X and Y , the ED δ(X, Y ) calculates the least number of fundamental operations required to transform X into Y [8]. There are three types of elementary operations termed substitutions, deletions, and insertions, which take the forms xi → yj , xi → λ and λ → yj , respectively. We denote the trace from X to Y with minimum weight assignment as Tδ (X, Y ) or simply Tδ . Thus δ(X, Y ) = min {W (T)|T is a trace between Xand Y } . (2) The original assignment for the cost functions, as proposed by Turner [1], was to set γ(xi → yj ) = 0 if xi = yj , or γ(xi → yj ) = 1 otherwise (xi and yj cannot be λ at the same time). Wagner and Fisher [7] devised an efficient recursive method for calculating the ED, which relies on the result stated in (2). In the calculation of δ(X, Y ), it is also possible to find the

Fig. 4.

Editing path between two strings in Fig. 3.

trace with minimum weight, i.e., Tδ (X, Y ), as demonstrated by Oommen [9], who devised a backtracking algorithm to find the trace corresponding to the ED between two strings. In addition to Tδ , the corresponding sets Iδ and Jδ , representing deletions and insertions, are also determined. Trace weights can be more compactly specified through an alternative way that follows from the concept of editing path. As defined in [6], an editing path between X and Y , i.e., PX,Y or simply P, is a sequence of pair of integers (ik , jk ), 0 ≤ k ≤ n, satisfying the following: a) 1 ≤ ik ≤ |X|; 1 ≤ jk ≤ |Y |, where (i0 , j0 ) = (0, 0) and (in , jn ) = (|X|, |Y |). b) 0 ≤ ik − ik−1 ≤ 1; 0 ≤ jk − jk−1 ≤ 1, ∀k ≥ 1. Fig. 4 shows an example of the editing path associated with the trace presented in Fig. 3 [6]. Every pair of successive points in a path corresponds to an elementary editing operation. Diagonal path segments correspond to substitutions, whereas horizontal and vertical path segments represent insertions and deletions, respectively. Thus, we can associate weights to paths as follows: n γ Xik−1 +1...ik → Yjk−1 +1...jk W (P) =

(3)

k=0

where P = (i0 , j0 ), . . . , (ik , jk ), . . . , (in , jn ). As can be seen, trace is directly related with editing paths. Hence, the path with minimum weight also gives the ED between X and Y as follows: δ(X, Y ) = min{W (P)|P is an editing path between X and Y } .

(4)

The elementary weight function can be also dependent on the particular characters involved. The ED with a symbol-based cost function is called GED [10]. The symbol-based function is of practical importance for real-world applications, where some edit operations are more likely to happen than others, such as the situation under study. It is worth noting that other extensions of the ED measure have been proposed in addition to symbol-based weight assignments. An idea of constraining ED by the number and types of edit operations to be included in the optimal edit transformation can be found in [9]. Attempts to consider the length of the


patterns in question was formalized by Oliveira-Neto et al. [4]. New editing operations (i.e., merge, split, and pair substitution) were introduced in [11]. Most recently, a new measure considering the local interactions among adjacent subpatterns has been proposed by Wei [12]. We explore the symbol-based approach and propose a new weight function for comparing the similarity between a pair of characters. Stochastic models have been also proposed for string comparisons. A memoryless stochastic transduction, which defines the way that a sequence of edit operations can occur in the problem analyzed, has been proposed by Ristad and Yianilos [13]. They generate a probability function, which defines a probability value for many possible ways of associating two string values. A similar idea is also proposed by Bilenko and Mooney [14], which designed a stochastic transduction that is capable of modeling gaps between characters and applicable for long strings. In both references, the weights for symbol-based edit operations are modeled according to joint probability distributions, i.e., p(a, b), rather than the conditional probabilities. In addition, they propose to calculate the distance between a pair of strings as the probability of all possible ways to generate the two strings simultaneously. As stated by Ristad and Yianilos [13], this can be useful if a given pair of strings has many likely generation paths. III. A SSOCIATION M ATRIX A. Definition The association matrix C between two LPR stations (g, h) is a square matrix of size |Ω∗ | × |Ω∗ | whose elements are the conditional probabilities p(b|a) of observing a character reading b, b ∈ Ω∗ , in station h for a given character reading a, a ∈ Ω∗ , in station g [5]. By convention, each row of C refers to a given character reading at station g, and each of its columns is associated to a reading at station h. This matrix is the key component in the calculation of a weight function in the following equation: 1 γ(a → b) = log . (5) p(b|a) With this weight function, the cost of the trace T in (1) can be calculated as 1 log W (T) = p(yj |xi ) (i,j)∈T

+

log

i∈I

1 p(λ|xi )

+

j∈J

log

1 p(yj |λ)

.

(6)

B. Estimation Oliveira-Neto et al. [5] proposed to estimate the elements of C by the following Bayesian expression: p(b|t) · p(t|a), t ∈ Ω∗ (7) p(b|a) = t

where t is the actual character, or the “truth” value for both a and b.

1809

In matrix form, (7) can be written as C = Cg · Ch .

(8)

The matrices Cg and Ch with size of |Ω∗ | × |Ω∗ |, and each row summing to 1, are called as truth matrices or confusion matrices [5]. Their elements, i.e., p(t|a) and p(b|t), are the odds of the LPR machines in reading (diagonal elements) or misreading characters (off-diagonal elements). To make the matrix multiplication possible in (8), the rows of Cg are referred to the character readings at g, and its columns represent the truth characters; conversely, the rows of Ch represent the truth characters, whereas its columns represent the character readings at station h. However, the estimation of the truth matrix using (8) requires the extensive manual extraction, by visual inspection, of the ground truths of a large number of plate images. IV. P LATE M ATCHING AND S ELF -L EARNING P ROCEDURE A. Motivation for Automated Learning Estimating the association matrix from manual extraction is expensive and time consuming. Larger sample sizes are usually required to compensate for the high imprecision in data survey. The process of inspecting plate images is exhaustive with the chances of mistakes increasing as surveying time passes. In addition, the association matrices are site and time dependent. The number of plate patterns in the U.S. is large and changes due to the insertion of new syntax over time; hence, the truth matrices are not only different for different locations, but they also change over time. In addition, the accuracy of LPR equipment degrades over time, thus requiring a periodic upgrade of the association matrix. It is, therefore, unrealistic to expect for such periodic updating to be performed, particularly for systems with many LPR units [e.g., in the case of estimating origin-destination (O-D) trips for a given study area]. B. Association Matrix from Plate Matches The estimated association matrix using (8) may not be representative of the plate patterns of vehicles observed between two stations for different locations. The existence of various plate patterns (more than 3000 different syntaxes) in the U.S., which combined with the trip patterns on the roadways, results in different truth matrices for different locations. In general, when two stations are located very far from each other, only a few vehicles may travel between these two stations. Therefore, the expected plate patterns to be observed at each location may differ significantly from those patterns of vehicles driving between stations. This may lead to a considerable departure of the estimated matrix C from the expected association matrix between stations. A more consistent method for estimating C would rely on matched readings between stations rather than separated sets of readings and their ground truths. Therefore, instead of estimating the association matrix from truth matrices, which requires a total of four sets of data (two sample sets of readings and two sample sets of ground truths), we propose to estimate

1810


the association matrix directly from a set of matched pairs of readings. In other words, if we can somehow obtain the pairs of readings associated to vehicles detected at both stations, the association matrix can be calculated directly from these truth matches without calculating a truth matrix for each station. The association matrix estimated in such a way is, thus, more representative of the plate patterns of vehicles traveling between these two stations. Truth Matches: Let G = {Xm ; m = 1, . . . , mg } be a set of mg readings captured at station g for a given survey period. Here, Xm denotes its plate reading for a given vehicle m. Similarly, let H = {Yn ; n = 1, . . . , nh } be a set of outcomes at station h for the same survey period, with nh vehicles detected and string readings denoted by Yn . Let G and H be the sets of strings representing the actual values associated with the readings in G and H, respectively. ; m = 1, . . . , mg } and H = {Yn ; n = Therefore, G = {Xm and Yn , cor1, . . . , nh }, with their elements denoted by Xm respond to the true reading, or truth, for the outcomes Xm and Yn , respectively. We define the set of true matches between the = Yn }. sets G and H as M = {(Xm , Yn ) ; Xm Estimation of Association Matrix from M : We propose to estimate the elements of the association matrix from a sample of matches as follows: p(b|a) = ρab /ρa

(9)

where ρba is the number of times a character b, b ∈ Ω∗ , is associated to a character a, a ∈ Ω∗ , in the matched set M , and ρa is the number of times the character a appear in the set of readings G. The frequencies ρab can be estimated by finding the most likely alignment for each match in M . This can be done using the traditional ED (i.e., ED with elementary weight assignments: 0 and 1). Hence, the sets Tδ (Xm , Yn ), Iδ (Xm , Yn ), and Jδ (Xm , Yn ) are calculated for each match in M , and the resultant alignments are used to compute the frequencies ρab . The portion of matrix C used for computing the cost of substitutions are estimated from the sets Tδ (Xm , Yn ). Hence, when both a and b are not equal to the null character, ρab is calculated from those pairs (xmi , ynj ), where (i, j) ∈ Tδ (Xm , Yn ), such that (xmi , ynj ) = (a, b). The edges (row and column ends) of the matrix C, representing deletions and insertions, respectively, are calculated from the sets Iδ (Xm , Yn ) and Jδ (Xm , Yn ). Hence, when b = λ, ρab is calculated by counting the number of occurrences of type (xmi , λ), with i ∈ Iδ (Xm , Yn ), such that xmi = a. Finally, for the case where a = λ, ρab is equal to the number of occurrences of type (λ, ynj ), i ∈ Jδ (Xm , Yn ), and ynj = b. This estimation process is more representative of the plate patterns between stations. Plate Matching Algorithm: Suppose that a prior association matrix C is known. A matching set, namely, M, between a pair of stations (g, h) for a given period of operation can be found by the method proposed by Oliveira-Neto et al. [5]. We mean by matching set a list of reading pairs with certain likelihood of being truth. In essence, we say that a pair of readings (Xm , Yn ) is very likely to be a match if δ(Xm , Yn ) is very small and/or the

estimated time to travel between stations lies within a certain time-window constraint. The proposed method is described in details below. Let U = {um ; m = 1, . . . , mg } be the corresponding list of time stamps, which is denoted by um , at station g for each outcome m in G. The list of time stamps for each outcome n in H is given by V = {vn ; n = 1, . . . , ng }. Let Γn = {m; jtl ≤ vn − um ≤ jtu } be a set of integers identifying the matching candidates for the nth reading in g, where jtl and jtu are the lower and upper bounds for the expected vehicle journey time between the two stations. The minimum ED (with editing weights calculated using the prior matrix C) between Yn and its matching candidate is calculated by ζn = minm δ(Xm , Yn ), m ∈ Γn . We define a set of matches M between the sets G and H as the pairs (Xm , Yn ) satisfying either of the following constraints: 1) ζn ≤ τ min , jtl = dgh /su , jtu = dgh /sl , where τ min is a minimum threshold for the ED, sl and su are fixed estimates for the lower and upper bounds of the vehicle speeds, and dgh is the physical distance between stations g and h. 2) τ min ≤ ζn ≤ τ max , jtl = μ−z(ζn )σ, jtu = μ+z(ζn )σ, where μ and σ are the mean and standard deviation of the vehicle journey times for traveling between the two stations, respectively, and z(ζn ) determines the size of the time window, which is a decreasing function of ζn . In this case, the width of the time window constraint is smaller for higher values of ζn . The matching set M can be then used as an input to update the prior association matrix and to find a more representative matrix of the current plate patterns observed. In the next sections, we circumvent the need for manual extraction by estimating C from a recursive procedure, in which a sequence of matrices Ck are estimated from a sequence of matches Mk . C. Self-Learning Algorithm We devise a self-learning algorithm to eliminate the need for painstakingly deriving truth matrices, which relies on human verification of plate strings. The association matrices are automatically learned through a self-learning algorithm. The automated learning algorithm takes into the consideration that, while the correct recognition rate of a plate string is relatively low, usually 30%–50% without calibration, the correct recognition rate of individual characters is much higher. That is, when a plate is misread, there is still quite a lot of useful and correct information embedded in the incorrect string. With this as the departure point, the algorithm starts with a clean association matrix and begins a self-learning process with every new plate string reported from an LPR unit and continues to learn over time. In our recursive procedure, when an initial estimate for C, i.e., C0 , can be found, we can progressively find better matrices, i.e., Ck , by repeatedly applying the plate matching procedure. That is, this is done through a recursive process, in which, at iteration k, a matrix Ck is obtained directly from a matching set Mk that, in turn, is obtained from the previous


1811

TABLE I LPR P ERFORMANCE D URING THE F IRST S URVEY P ERIOD

A. Data Sample

Fig. 5. LPR setup on the junction of I-640 with I-40 near Knoxville, TN, USA.

matrix Ck−1 . Then, the algorithm keeps learning until a stop criterion is reached. The algorithm can start with an initial association matrix C0 estimated from a set of matches M0 . For example, this initial set can be obtained using the traditional Levenshtein ED into the plate matching procedure, when comparing two sets of readings H and G. Note that the traditional ED to compare a pair of strings is equivalent to determining the integer number of edit operations to convert one string into another rather than calculating the likelihood that the two strings are similar. The proposed method is summarized by the following recursive equations: 1) k = k + 1; 2) Mk = M (Ck−1 ); 3) Ck = C(Mk ); 4) Stop if Ck − Ck−1 < ε. M (C) is a function that determines a set of matches M from a matrix C using the matching procedure; conversely, C(M) represents the module that calculates the association matrix from a set of matches M. The algorithm is stopped when the difference between two successive estimates falls below a preassigned threshold ε, as measured using a suitable norm vector Ck − Ck−1 < ε. In the case study, we show that the estimated association matrix approaches to the true association matrix as the iteration increases through simulation, even when we start with a very poor initial matrix. V. C ASE S TUDY With the partnership of Tennessee Department of Transportation (TDOT), private companies, and The University of Tennessee, two LPR units were mounted 3 mi apart on a stretch of the Interstate system, as shown in Fig. 5, to monitor truck plates 24/7 (on the rightmost lane) for over a year.

Two survey periods over the LPR system operation were selected to assess the performance of our proposed self-learning algorithm. At the first survey period (five complete days of operation in 2010, i.e., April 6th and 7th and May 25th, 26th, and 27th), both LPR readings and their ground truths were obtained. As for the second survey period (41 days of operation in April and May of 2010), only the plate readings were obtained. The first data sample, containing ground truth sets, is the reference sample for validation of the self-learning algorithm. The second sample is used to estimate a sequence of association matrices using the self-learning procedure. Table I shows a summary of the samples captured during the first survey period. The table shows the total number of plate images captured, the number of readable images, the number of truth matches, as well as both plate and character recognition rates of each LPR unit. We consider readable images those pictures of plate frames where all characters can be precisely extracted by visual inspection. Note that the accuracy of each LPR unit is calculated from the sample of readable images. As shown in Table I, the recognition rate of a single character is higher than the recognition rate of an entire plate. Notice that the relative fraction of vehicles detected at both stations is low. The number of matches is usually between 1/4 and 1/3 of the total readable images captured at station g and about 1/7 of the total readable images captured at station h. Under this pattern, the number of false matches resulting from a poor matching procedure can be quite high. B. Estimation of an Association Matrix and Performance of the Self-Learning Algorithm The self-learning algorithm is compared with the manualreading method in [5] and the proposed manual-reading method in Section IV-B. In addition, the performance of approach of using the conditional probability function is compared with that of the joint probability function proposed in the literature. Sample 1 is further divided into training sets and validation sets. Training sets were used for both parameter calibration of the matching procedure (i.e., minimum and maximum thresholds parameters of the ED) and for estimation of the association matrix from truth matrices and truth matches. The validation sets were used to evaluate the performance of the self-learning algorithm against its manual counterparts. Such separation in training and validation sets consisted in using as training sets the sample corresponding to combinations

1812


Fig. 6. Estimated association matrix using truth matrices Cg and Ch .

Fig. 7. Estimated association matrix using truth matches M .

of three-day periods out of the total five, with the remaining two survey periods used as validation sets. Thus, a total of ten (5!/3!2!) training sets were obtained. Association matrices were obtained using the method presented in [5], which is based on truth matrices (see Section III-B), and from truth matches, as described in Section IV-B. For estimation, each combination of three-day periods was accumulated in chronological order. Hence, from each training set, six “confusion matrices,” i.e., three matrices of type Cg and three matrices of type Ch , were estimated and used to estimate three association matrices (one per reference day). Similarly, cumulative association matrices were estimated from truth matches obtained from the ten combinations of training sets. Figs. 6–8 show the estimated association matrices using truth matrices, truth matches, and the self-learning algorithm, respec-

Fig. 8.

Estimated association matrix using the self-learning method.

tively. As explained in Section III, the estimated association matrices in Figs. 6–8 contain the values of the conditional probabilities p(b|a), in percentages. For example, the element of the 12th row and 4th column is the probability that the character “4” reported by the LPR machine at station h is reported as “A” by the LPR machine at station g. Each row of the matrices sums to 100, in percentages. Fig. 6 shows the association matrix estimated from truth matrices using the complete three-day period of the first training combination. Fig. 7 shows the association matrix estimated from the truth matches also obtained from the first three-day training set. The two matrices are slightly different because most of the vehicles detected at each station are not detected at both stations. The matrix in Fig. 7 must better represent the pattern between the two stations. It is more likely that the two units recognize characters the same way, as shown by the higher percentages in the matrix diagonal in Fig. 7, as compared with that in Fig. 6. The self-learning algorithm was used for estimating a sequence of 41 association matrices, i.e., {C1 , . . . , Ci , . . . , C41 }, from sample 2. Hence, the ith matrix was estimated from the cumulative sample of readings, i.e., G[1, . . . , i] and H[1, . . . , i], corresponding to a sample of i-day periods, where i = 1, . . . , 41. In the matching step of the algorithm, we used as ED thresholds τ min = 5 and τ max = 17.5 (calibrated to provide no more than 2% false matching rate), and the journey time variation was calculated in terms of mean and standard deviation over moving time-frames containing genuine matches, as described in [5]. Fig. 8 shows the matrix estimated using the entire data set, all 41 days of operation. Note the similarity of this matrix to the matrix estimated from truth matches in Fig. 7. Fig. 9 shows a summary of the convergence behavior of the self-learning algorithm. The graph shows median, maximum, and minimum values of the threshold criterion (calculated over the 41 cumulative day samples), on each iteration. The curve range shows that the algorithm consistently converges between


1813

Fig. 10. Self-learning algorithm error for sample 1. Fig. 9.

Self-learning algorithm convergence.

eight and nine iterations. The narrow range of the curves shows that this convergence behavior is independent of the sample size (number of days or number of matched plates). To assess how the self-learning algorithm performs in estimating the association matrix, we first demonstrate, through simulation, that the estimated matrix approaches the actual association matrix as the iteration increases, and then, we show that the matching performance increases as the sample of plates increases. As for the algorithm estimation error, we have performed a few additional experiments using data from sample 1, in which the iterative process is started with a poor initial matrix (i.e., a matrix with all elements of equally likely conditional probabilities, i.e., p(b|a) = 1/37). The estimated association matrix obtained after each iteration is compared with a matrix estimated using the complete set of truth matches. The matching performance as the sample size increases is evaluated by using the 41 estimates from sample 2 to match the ten combinations of validation sets of readings from sample 1. Such matching performance was compared against the performance from matching the same validation sets using those association matrices obtained from truth matrices and truth matches. We also modeled edit weight functions with joint probabilities, as proposed by Ristad and Yianilos [13] and Bilenko and Mooney [14]. The performance measures were defined in terms of positive matching rate (pmr) and false matching rate (fmr), as shown in (10)–(12), of the matching procedure when only the input association matrix is changed. Thus pmr = (|M| − |F|)/|M |

(10)

f mr = |F|/|M|

(11)

mrate = 2 · (1 − f mr) · pmr/(1 − f mr + pmr) (12) where F = {(Xm , Yn ); (Xm , Yn ) ∈ M, Xm = Yn } is the set of false matches from M, with M and M as defined before. The mrate is the harmonic mean of the fraction of matches that are actual matches (1-fmr) and the fraction of actual matches that are identified (pmr). Fig. 10 shows the self-learning algorithm error for sample 1 (computed as the root-mean-square differences between the cell values of the estimated matrices), at each iteration, when the

Fig. 11. Edit cost between identical characters. Note: Cost at “iteration zero" = log(37) = 3.61.

algorithm starts assuming that the conditional probabilities are uniform (i.e., poor initial estimates). As expected, the learning speed is slow, as compared with the case when Levenshtein ED is used to estimate the initial matrix, which required only five iterations. This happens because the initial matrix using the traditional Levenshtein ED is estimated from a set of matches with a significant amount of positive matches (pmr = 91%), and much lower false matches (f mr = 7%). Whereas, the initial matrix with uniform probabilities generates a set of matches with pmr = 6% and f mr = 98%. Therefore, Fig. 10 demonstrates that, even when the initial matrix is poor, the algorithm still provides a good estimate of the association matrix. The presence of a few positive matches at the beginning helps to identify more and more positive matches as the iteration passes, even if the proportion of false matches is significantly high at an initial matching set. What happens is that, along the iterations, the ED between a pair of positively matched characters decreases, whereas the ED between a pair of falsely matched characters increases, increasing the proportion of matches as compared with the nonmatches. In other words, occurrences of positively matched characters fall in specific cells of the association matrix, whereas those of falsely matched characters are randomly located. Fig. 11 shows that the average EDs of pairs of identical characters indeed decreases over the iterations, showing that

1814


Fig. 12. Self-learning algorithm performance.

Fig. 13. Ground-truth-based performance.

the matrix approaches to an identity matrix until a certain limit. Thus, the results show that, if the initial matrix is generated from a set of matches with a few true matches, the algorithm tends to reduce the number of false matches and increase positive matches along the iterations, consequently improving the estimated matrix. Fig. 12 shows curves of the matching rate by sample size for the conditional and joint cost functions. The horizontal axis represents numbers of days, or sample size, used for estimation. The performance of the self-learning algorithm is represented in terms of minimum, maximum, and median values that are calculated over the ten validation sets. Similarly, Fig. 13 shows the performance of the matching algorithm when the association matrices are manually obtained. In Fig. 13, “T” stands for truth matrix, and “M” stands for truth matches. According to Fig. 12, at the very beginning, the matching rate of the self-learning algorithm is low compared to the groundtruth-based procedures shown in Fig. 13, because the matching algorithm is running with an association matrix estimated from a small sample of plates. However, the chart shows a steady improvement in the matching rate. After a day of learning, the pmf already reaches about 90%. After a week of learning, the pmf can be over 94%. Eventually, the pmf can reach over 95%, a performance comparable to the ground-truth-based methods. For the cases shown here, we only used vehicle plates in a single

Fig. 14.

False matching rate.

freeway lane for each day. Should we have used passenger car plates in all free lanes, it is arguable that a pmf = 95% or so matching rate could be achieved in a single day of learning. As shown in Fig. 12, the joint probability model is less suitable for the LPR matching problem than the conditional probability function, particularly for smaller sample sizes.1 It seems that considering character occurrence likelihoods into the model slightly increases the probability of classifying a match as genuine, when it is actually false. This may happen because the travel time constraint diminishes the importance of character likelihoods and emphasizes the LPR operation. The intuition is that the joint probabilities consider the likelihoods of the pair of characters over the pair of strings, whereas the conditional cost function explicitly considers the LPR algorithm recognition rate. When the objective is to match a certain pair of contemporaneous strings (most recent strings detected) for online matching, the conditional cost function is more suitable. More interestingly, Fig. 14 demonstrates that the learning efficiency in reducing the false matching rate is comparable or superior to that of the method based on truth matches and always superior to that based on truth matrices. The learning method can achieve a false matching rate less than 1% in most cases, whereas its manual counterparts based on truth matrices can achieve no less than 1.5%. This is quite remarkable considering that the amount of work required to obtain the ground truth of plate images is expensive and time consuming. Even using computerized extraction tools (developed particularly with the purpose of helping in the extraction task) takes about 15 s to extract one single sequence of character from a plate image. For an entire set of 10 000 images, it would take approximately 42 h of painstakingly work. Hence, the learning method eliminates all these burdens. As a final point to highlight, we present the performance of using the traditional Levenshtein ED when obtaining the initial sets M0 for the self-learning algorithm. The positive matching rate over the validation sets was between 84.3% and 87.3%, and the false matching rate was between 5.4% and 7.3%. If we compare these results with those of the self-learning algorithm,

1 A paired t-test of means show that the hypothesis of equal performance is rejected at less than 1% significance level.


i.e., pmr within 94%–96% and fmr within 0.5%–1.5% after a week of learning, we can conclude that the self-learning method considerably improves the matching performance, leading to a consistent estimate of the association matrix, even with a small sample of plates. The traditional Levenshtein ED can still be used for situations where no sampled plates are available and that do not require high matching accuracy. In summary, the self-learning method is fast, transferable, and adaptive. The self-learning algorithm performed well in estimating a reliable association matrix within a few days of learning. The manual-reading type of learning is simply too costly and unsustainable if to be done frequently. The selflearning and self-maintaining methodology proposed is far superior in terms of efforts needed and costs. For example, in a large-scale setup (e.g., 1000 locations), the self-learning algorithm would learn all combinations of association matrices automatically, whereas the manual-reading methods would be difficult to determine all these matrices manually. As the makeup of traffic (e.g., O-D) changes and new plates are used and old plates are retired, the self-learning algorithm method will continue to learn and adapt. VI. C ONCLUSION Without tinkering with the LPR hardware and the embedded video-image-processing algorithms, our proposed procedure improves the license plate matching rate significantly. The association matrices that can be learned automatically via the selflearning algorithm, which used to be prohibitively expensive to acquire and maintain for a large network with many LPR units installed, can be attained in a matter of days. As the algorithm continues to run in the background, the association matrices are maintained, updated, and improved over time. When new license plate designs are introduced to the system, which can be often as there are over 3000 different license plate designs in the U.S., the algorithm automatically learns to match them quickly. The algorithm can be applied in a system with LPR units from mixed vendors. Since the algorithm is entirely postprocessing based, as long as an LPR unit provides a plate string for comparison purposes, it can be used as input. Even when different LPR hardware producers have various levels of recognition performance, as long as they can recognize a reasonable amount of characters, the matching performance will be satisfactory. As shown in this paper, two LPR units, reading at less 60% and less than 30%, respectively, can still have matching rates in the range of 95–96% and false matching rates in the range of 1%. Tens of thousands of LPR units have been deployed in this country and others already. Many of them suffer from low reading rates. Yet, with the proposed algorithm, they are afforded added value and may find new uses in the future with minimal further investment in the hardware already in the field. Plate matching has been a desirable yet unsatisfactory functionality since the 1970s when LPR was first used. Our method can be now applied to a host of real-time safety and security applications, including speed monitoring and enforcement, tracking of be-on-the-look-out (BOLO) vehicles, Amber Alert tracking, tagless/boothless tolling, freight and commodity flow

1815

monitoring, evacuation operation monitoring, special location entrance/exit tracking, and so on. In addition, expensive studies, such as O-D studies, can be implemented with within-hourly details using this product also. Other applications, such as travel time monitoring and truck travel distance tracking, for fuel tax purposes can be performed with ease. As a further research, we can use the image processing techniques in order to eliminate potential false-positive matches. The idea is to include into the decision process a third measure, in addition to the ED and the time difference, based on image comparisons. This additional information would eliminate any possibility of false-positive match. The self-learning method highly relies on the extension of Levenshtein ED that considers the symbol-based edit operations. We adapted the similarity measure for LPR application by estimating association matrices into the weight functions in order to find the most likely alignment between the pair of strings. This paper represents an advancement of a first application of text mining for matching data reported by LPR systems. To the best of our knowledge, there is no similar application in the literature. Future research can also explore other formulations for the weight functions and other models to associate the LPR-reported strings (e.g., transduction model), as well as classification methods. In particular, models that can work well without travel time constraint and that consider the plate syntaxes are desirable. R EFERENCES [1] S. M. Turner, “Advanced techniques for travel time data collection,” Transp. Res. Rec., vol. 1551, pp. 51–58, 1996. [2] Travel Time Data Collection Handbook: Office of Highway Information Management, Federal Highway Admin., Washington, DC, USA, 1998. [Online]. Available: http://www.fhwa.dot.gov/ohim/start.pdf [3] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions and reversals,” Soviet Physics Doklady, vol. 10, no. 8, pp. 707–710, 1966. [4] F. M. Oliveira Neto, L. D. Han, and M. K. Jeong, “Tracking large trucks in real-time with license plate recognition and text-mining techniques,” Transp. Res. Rec., vol. 2121, pp. 121–127, 2009. [5] F. M. Oliveira Neto, L. D. Han, and M. K. Jeong, “Online license plate matching procedures using license-plate recognition machines and new weighted edit distance,” Transp. Res. Part C, Emerg. Technol., vol. 21, no. 1, pp. 306–320, Apr. 2012. [6] A. Marzal and E. Vidal, “Computation of normalized edit distance and applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 15, no. 9, pp. 926–932, Sep. 1993. [7] R. A. Wagner and M. J. Fischer, “The string-to-string correction problem,” J. Assoc. Comput. Mach., vol. 21, no. 1, pp. 168–173, Jan. 1974. [8] R. O. Duda, P. E. Hart, and D. G. Stork, “Recognition with strings,” in Pattern Classification, 2nd ed. Hoboken, NJ, USA: Wiley-InterScience, 2000, ch. 8, pp. 413–420. [9] B. J. Oommen, “Constrained string editing,” Inf. Sci., vol. 40, no. 3, pp. 267–284, Dec. 1986. [10] T. Okuda, E. Tanaka, and T. Kasai, “A method for correction of garbled words based on the Levenstein metric,” IEEE Trans. Comput., vol. C-25, no. 2, pp. 172–178, Feb. 1976. [11] G. Seni, V. Kripasundar, and R. Srihari, “Generalizing edit distance to incorporate domain information: Handwritten text recognition as a case study,” Pattern Recognit., vol. 29, no. 3, pp. 405–414, Mar. 1996. [12] J. Wei, “Markov edit distance,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 3, pp. 311–321, Mar. 2004. [13] E. S. Ristad and P. N. Yianilos, “Learning string-edit distance,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 5, pp. 522–532, May 1998. [14] M. Bilenko and R. J. Mooney, “Adaptive duplicate detection using learnable string similarity measures,” in Proc. 9th ACM SIGKDD Int. Conf. KDD, 2003, pp. 39–48.

1816


Francisco Moraes Oliveira-Neto received the Ph.D. degree in civil engineering with concentration in transportation engineering from The University of Tennessee, Knoxville, TN, USA, in 2010. He held a traffic engineering position with the Advanced Urban Control Center of Fortaleza, Fortaleza, Brazil. He holds a Postdoctoral Research Associate position with the Center for Transportation Analysis, Oak Ridge National Laboratory, Oak Ridge, TN, USA. He is also with the Department of Civil and Environmental Engineering, The University of Tennessee, Knoxville. His research interests include computational transportation science, transportation modeling, applied statistics, and operation research. Dr. Oliveira-Neto serves as a Reviewer for the IEEE Intelligent Transportation Systems Transactions and Magazine. He can be reached at [email protected].

Lee D. Han received the M.S. degree from Virginia Polytechnic Institute and State University (Virginia Tech), Blacksburg, VA, USA, the B.S. degree from National Taiwan University, Taipei, Taiwan, and the Ph.D. degree from the University of California, Berkeley, Berkeley, CA, USA. He is a Professor with the Department of Civil and Environmental Engineering, The University of Tennessee, Knoxville, TN, USA, where he is the Coordinator of the Transportation Engineering Program. He is also with the School of Traffic and Transportation Engineering, Changsha University of Science and Technology, Changsha, China. His current research interests include microscopic traffic simulation algorithms, unsupervised real-time machine learning, mass evacuation modeling and optimization, near-optimal engineering solutions, nonequilibrium transportation planning, system sustainability and resilience, and human decisions/performance in driving. More details on Dr. Han’s work can be found at web.utk.edu/~lh.

Myong Kee Jeong (SM’10) received the Ph.D. degree in industrial and systems engineering from the Georgia Institute of Technology. He is an Associate Professor with the Department of Industrial and Systems Engineering and the Rutgers Center for Operations Research (RUTCOR), Rutgers University, New Brunswick, NJ, USA. His research interests include intelligent transportation systems, data mining, and stochastic processes. Mr. Jeong was a recipient of the Freund International Scholarship and National Science Foundation Career Award in 2002 and 2007, respectively. He serves as an Associate Editor for the IEEE T RANSACTIONS ON AUTOMATION S CIENCE AND E NGINEER ING and International Journal of Quality, Statistics and Reliability. Contact him at [email protected].