Distance measures based on the edit distance for ... - CiteSeerX

15 downloads 0 Views 116KB Size Report
ulation (Mauldin, 1984) or to locate a set of different solutions to a multi-modal problem such as in crowding. (Mahfoud, 1992) or fitness sharing (Goldberg, 1987) ...
Distance measures based on the edit distance for permutation-type representations

Kenneth S¨ orensen University of Antwerp Prinsstraat 13 B–2000 Antwerp

Abstract In this paper, we develop and discuss several distance measures for permutation type representations of solutions of combinatorial optimisation problems. The problems discussed include single-machine and multiple-machine scheduling problems, the travelling salesman problem, vehicle routing problems, and many others. The distance measures are based on the edit distance. We introduce several extensions to the simple edit distance for more complex problem representations and show how each of these can be calculated.

1

INTRODUCTION

Recent research points out that distance measures can contribute to the effective design of metaheuristics. The development of effective distance measures however is not a trivial task. A good distance measure should provide an accurate estimate of the difference— or similarity—between two given solutions to an optimisation problem. The use of distance measure for GA design is not new, and has been proposed for maintaining a diverse population (Mauldin, 1984) or to locate a set of different solutions to a multi-modal problem such as in crowding (Mahfoud, 1992) or fitness sharing (Goldberg, 1987). In this paper, we develop and discuss a number of distance measures for solutions that can be represented as a permutation—or a set of permutations—of a set of items. These problems are characterised by the fact that the order in which the items appear in the solution is important. We refer to this class of representations as permutation type representations. Many solutions to problems in combinatorial optimisation can be represented in this way, including solutions

to scheduling problems, routing problems, and many other problems. Previous work on the development of distance measures for permutation problems is due to e.g. Ronald (1997, 1998); Mart´ı et al. (to appear). A distance measure cannot be developed independent of the problem and the representation of a solution for this problem (we use the term representation to refer to the phenotypical encoding of the “real-world” equivalent of the solution, not to the transformation of phenotypes into genotypes). Often, solutions can be represented in several ways. The solution to a knapsack problem can be represented e.g. as a binary string in which a zero in position i indicates the absence of item i and a one the presence of this item. The same solution can be represented as a set of indices, where the presence of index i in this set indicates the presence of item i in the solution. For both representations, a different distance measure should be developed. It is obvious that a more “natural” representation of a solution provides a better basis for the development of distance measures. Although a definition of natural representation is difficult to provide, this representation should reflect the structure of a solution. A solution to a single-machine scheduling problem, for example, is best represented as a sequence (permutation) of jobs. The order of the jobs in the sequence determines the order in which the jobs are executed on the machine. Although the same solution can be represented as a binary string, it should be obvious that this representation is very likely to cloud the structure of the solution, resulting in a distance measure that does not accurately reflect the difference between two solutions. The distance measures developed in this chapter are based on the edit distance, also called Levenshtein distance. The edit distance is a measure of the difference between two strings of characters. We show that for a large number of combinatorial problems, an effective distance measure can be developed by adapting the edit distance measure to the specific requirements

of the problem representation. We provide formulae for the calculation of the distance between solutions of several scheduling and routing problems.

2

EDIT DISTANCE

The edit distance is based on—and sometimes used as a synonym for—the Levenshtein distance, which was first introduced by Levenshtein (1966) in the context of error-correcting codes. When binary strings are transmitted across a channel, three types of errors can occur (Λ denotes the absence of a character). Reversals: errors of the type 1 → 0 or 0 → 1

Definition. An ordered set of edit operations that transforms a string s1 into a string s2 is called an edit transformation of s1 into s2 . Elementary edit operations can be weighted by a weight function γ, that assigns a nonnegative real number γ(x, y) to each elementary edit operation (x, y). Using the function γ, a nonnegative real value can be assigned to any edit transformation E = [(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )]. ThePweight of edit m transformation E is equal to γ(E) = i=1 γ(xi , yi ). Edit distance Definition. The edit distance strings s1 and s2 is defined as

Deletions: errors of the type 1 → Λ or 0 → Λ

δ(s1 , s2 ) between

δ(s1 , s2 ) = min{γ(E)},

(1)

Insertions: errors of the type Λ → 1 or Λ → 1

where E is an edit transformation of s1 into s2 .

Definition. The edit distance between two binary strings b1 and b2 is the minimal number of edit operations (reversals, insertions, deletions) required to transform b1 into b2 .

If γ(x, y) = 1 for all x, y then δ(s1 , s2 ) is equal to the number of edits required to transform string s1 into s2 . Without loss of generality, the examples in the paper use a weighting function γ(x, y) ≡ 1.

Wagner and Fischer (1974) extend the work of Levenshtein (1966) in several ways. First, strings are allowed to be composed of any finite alphabet (and not just a binary one). Levenshtein’s reverse operation now becomes a substitution operation that occurs when a character is changed into another one. Second, different weights can be assigned to each individual edit operation. This turns Levenshtein’s distance measure into a real-valued function of the two strings involved. Third, Wagner and Fischer (1974) provide a dynamic programming algorithm to calculate their distance measure in time proportional to the product of the lengths of the strings. 2.1

DEFINITIONS

In the rest of this paper, we adhere to the following definitions. We assume that we are given a set S of strings composed of characters of a finite alphabet Σ. The length of a string s is written as |s|. The i-th character of string s is written as s(i). Λ is the null-character, signifying the absence of a character, i.e. |Λ| = 0. Definition. An elementary edit operation (x, y) 6= (Λ, Λ) is an ordered pair of characters from the set Σ ∪ Λ. An elementary edit operation is called a substitution iff x 6= Λ and y 6= Λ. It is called an insertion iff x = Λ and a deletion iff y = Λ.

Wagner and Fischer (1974) show that the concept of edit distance can be related to a structure called trace. Definition. A trace Ts1 ,s2 from s1 to s2 , is defined as a sequence of pairs of integers (i, j) for which the following is true: 1. 1 ≤ i ≤ |s1 |; 1 ≤ j ≤ |s1 | 2. ∀(i, j), (i0 , j 0 ) ∈ Ts1 ,s2 ; (i, j) 6= (i0 , j 0 ) : i < j ⇐⇒ i0 < j 0 The trace Ts1 ,s2 connects characters from s1 to characters from s2 that are either substituted (when they are different) or left unchanged (when they are equal). Characters in s1 and s2 that are not connected through Ts1 ,s2 are deleted or added respectively. An example of an edit transformation is given in figure 1. c→Λ

b→d

s1 = cdbaabca −→ dbaabca −→ ddaabca a→Λ Λ→a a→Λ −→ ddabca −→ ddabaca−→ ddabac = s2 Figure 1: Edit transformation of s1 = cdbaabca to s2 = ddabac 2.2

ALGORITHMS AND COMPLEXITY

For two strings s1 and s2 of length |s1 | respectively |s2 |, the time-complexity of the dynamic programming

algorithm proposed by Wagner and Fischer (1974) is O(|s1 |×|s2 |), i.e. O(n2 ) if the lengths of both strings is about n. The space complexity can be reduced to O(n) if only the value of the edit distance is needed (and not the edit sequence that this distance corresponds to). A more efficient algorithm is given by Ukkonen (1985). The worst case time complexity for this algorithm is O(n×d), the average complexity is O(n+2×d), where n is the length of the strings, and d is their edit distance. This is fast for similar strings where d is small, i.e. when d ¿ n. Other, even more efficient algorithms have been proposed, but for the purposes of this paper (i.e. the algorithms are used to determine the distance between solutions of combinatorial optimisation problems, that are of relatively small size) these algorithms are unnecessarily intricate. 2.3

PROPERTIES

Under the assumptions that γ(x, y) > 0 if x 6= y, γ(x, y) = γ(y, x), and γ(x, x) = 0 ∀x, y ∈ Σ ∪ {Λ}, the edit distance function δ : S × S → R is a metric. This means that it satisfies the following properties (∀si , sj , sk ). 1. δ(si , sj ) ≥ 0 (non-negativity) 2. δ(si , sj ) = 0 ⇒ si = sj (separation) 3. δ(si , sj ) = δ(sj , si ) (symmetry) 4. δ(si , sj )+δ(sj , sk ) ≥ δ(si , sk ) (triangle inequality) The first three properties are trivial and follow from the definition. Property 4 is proven by remarking that δ(si , sj ) + δ(sj , sk ) < δ(si , sk ) would imply that si can be changed into sk with less edit distance than the edit distance between si and sk (by changing si into sj first), which is a contradiction. In the remainder of this paper, we assume that the properties required to make δ a metric hold.

3

APPLICATION OF THE EDIT DISTANCE TO PERMUTATION TYPE REPRESENTATIONS

The edit distance measure cannot be applied without modification to all combinatorial optimisation problems. As mentioned, this measure needs to be adapted to (1) the problem and (2) the specific solution representation used. The edit distance measure works well

for problems of which the solutions are naturally represented as a permutation (with or without repetition) of items. We define three types of extensions of the unmodified edit distance measure, that can be combined to be used for more complex solution representations. The modified edit distance measures can be used for 1. representations in which the reversed string represents the same solution (reversal independence), 2. representation in which a cycling of the string represents the same solution (cyclic independence) and 3. representations with independent parts. The use of the standard edit distance and these extensions are discussed in the next sections. A final section discusses how the extensions can be combined. 3.1

PERMUTATIONAL REPRESENTATIONS

For problems for which a solution can be represented as a simple permutation with or without repetition of the elements of some set of problem attributes, the edit distance in its original form can be used. For single machine scheduling problems, a solution can be represented as a permutation (without repetition) of jobs. For this representation, the unmodified edit distance measure can be used. Example. A single machine scheduling problem requires the scheduling of 8 jobs labelled 1 to 8. The distance between a solution s1 = 52643187 and a solution s2 = 65321487 can be calculated as δ(s1 , s2 ) = δ(52643187, 65321487) = 5. To transform s1 into s2 in 5 edit operations, remove 3 and 1 from s1 , change 6 into 1 and then add 6 and 3 in the appropriate locations. 3.2

REPRESENTATIONS WITH REVERSAL INDEPENDENCE

Some solution representations exhibit reversal independence, meaning that the reverse string represents the same solution. Consider a shortest path problem the object of which is to find the edges that connect two nodes nstart and nfinal in such a way that the sum of edge weights for the path is minimal. Shortest path problems are not generally solved by a metaheuristic, as they can be solved in polynomial time. Additional constraints however

might render the problem N P hard and might require a metaheuristic to be used. A solution can be represented by an ordered set of edges e1 e2 . . . em . The distance between two solutions can be determined by using the edit distance (using the edge set as alphabet). However, a path using edges e1 e2 . . . em in this order and running from edge e1 to edge e2 is equivalent to the reverse path, represented as em . . . e2 e1 . Any distance measure comparing these two paths should return zero1 . Example. Given the graph in figure 2, two paths between nfirst and nfinal can be found. The first one is p1 = abd, the second one is p2 = cd.

a nfirst

resentations with reversal independence is δri (s1 , s2 ) = min(δ(s1 , s2 ), δ(s1 , s˜2 )) ≡ min(δ(s1 , s2 ), δ(˜ s1 , s2 )).

Like the original edit distance measure, this measure can be calculated in O(n2 ). The equivalence sign in equation 2 signifies that the edit distance between a string s1 and the reverse of a second string s2 is equal to the edit distance between the reverse of s1 and s2 . This is proven using the following lemmas. Lemma 1. Any edit transformation that can be used to transform a string s1 into a string s2 can also be used to transform s˜1 into s˜2 . Proof. This follows from the definitions of edit transformation and elementary edit operation.

b c

d

nfinal

Lemma 2. The edit distance between two strings does not change when both strings are reversed, i.e.

Figure 2: A shortest path problem It is clear that abd and dba are both valid representations of p1 . Also, cd and dc are both representations of p2 . The distance between p1 and p2 should reflect this. Table 1 shows the possible edit distances (assuming all weights are equal to one). Table 1: Possible edit distances for the shortest path problem cd dc

(2)

abd 2 3

dba 3 2

The “correct” edit distance is 2, the smallest possible distance in table 1. It holds that δ(abd, cd) = δ(dba, dc) and δ(abd, dc) = δ(dba, cd). This is a general property that simplifies the calculation of the edit distance in the case of reversal independence. Edit distance for representations with reversal independence Let the reverse string of a string s be given by s˜, then the modified edit distance for rep-

δ(s1 , s2 ) ≡ δ(˜ s1 , s˜2 ). Proof. From lemma 1, it follows that δ(˜ s1 , s˜2 ) ≤ δ(s1 , s2 ) (as exactly the same elementary edit operations can be used to transform s˜1 into s˜2 , the edit distance between these two strings cannot be greater than the edit distance between s1 into s2 ). Suppose now that δ(˜ s1 , s˜2 ) < δ(s1 , s2 ). According to lemma into s2 in less edits that its edit distance, which is a contradiction. As a result, δ(˜ s1 , s˜2 ) = δ(s1 , s2 ). Lemma 3. The edit distance between a string and the reverse of another string is equal to the edit distance between the reverse of the first string and the second string, formally δ(s1 , s˜2 ) ≡ δ(˜ s1 , s2 ).

(3)

Proof. This result follows from lemma 2 by setting s2 equal to s˜2 and remarking that s˜˜2 = s2 . This result simplifies the calculation of the edit distance in the case of reversal independence. It is not necessary to compare each string with the reverse of the other, it suffices to compare just one string with the reverse of the other.

1

Admittedly, in the case of the shortest path problem, it is relatively easy to solve this problem as all paths have a “natural” direction and it is possible to orient all path representations in the same direction (from the first node to the final node). However, this is not the case in all problems. Routing problems for example often do not have a “natural” direction of the routes and it is therefore impossible to orient all the routes in the “same” direction.

3.3

REPRESENTATIONS WITH CYCLIC INDEPENDENCE

Some solution representations exhibit the property of cyclic independence. For these representations, a solution can be encoded as a permutation, but only the

relative position is important. An example of a problem for which such a representation occurs naturally is the travelling salesman problem on a directed graph, for which solutions are encoded as a string of customers. For this problem, two strings that have the same order of customers but start at a different position, represent the same solution and should consequently have an edit distance of zero. E.g. for a travelling salesman problem with customers c1 to cm , strings c1 c2 . . . ci cj . . . cm and cj . . . cm c1 c2 . . . ci represent the same tour. If the travelling salesman problem is defined on an undirected graph, the representation exhibits reversal independence in addition. A commonly used distance measure for the travelling salesman problem is the number of common edges in two solutions. This measure has the disadvantage that it does not explicitly take into account the order in which the customers appear in the tour. As a result, relatively similar solutions can be found that have zero common edges, and for a given solution, the set of solutions that have a maximal distance is relatively large. Another advantage of the edit distance based measure is that a weighting function can be used to change the weights of replacing a customers in a route by another one. This way, lower weights can be associated with the replacement of customers that are located closer to each other. Example. Given a travelling salesman problem with 5 customers labelled a to e. Two different tours are found, represented respectively as t1 = abcde and t2 = adbce. The possible edit distances (again assuming all weights are equal to 1). Are found in table 2. Table 2: Possible edit distances for the travelling salesman problem abcde bcdea cdeab deabc eabcd adbce 2 4 4 4 3 dbcea 3 2 4 4 4 bcead 4 2 3 4 4 ceadb 4 4 2 3 4 eadbc 4 4 4 2 2

the i-th character of string s. si represents a cycling, starting from position i, i.e. si = s(i)s(i + 1) . . . s(n)s(1)s(2) . . . s(i − 1). The modified edit distance for representations with cyclic independence is δci (s1 , s2 ) = min δ(si1 , s2 ) ≡ min δ(s1 , sj2 ). i

j

(4)

If both strings are approximately of size n, this measure can be calculated by enumeration in O(n3 ). The equivalence sign in equation 4 indicates that the minimum distance between s1 and all cyclings of s2 is equal to the minimum distance between s2 and all cyclings of s1 . This can be proven using the following lemmas. Lemma 4. If the minimum distance between strings s1 and s2 that have cyclic independence is given by δ(s1 , s2 ) = δ(si1 , sj2 ) (i.e. no cycling exists that has a smaller edit distance), then a cycling sp2 of s2 exists p for which δ(si+1 1 , s2 ) = δ(s1 , s2 ) Proof. Let |s1 | = n and |s2 | = m. Let Tsi1 ,si2 be a minimum-cost trace between si1 and sj2 . Two possibilities arise: (1) si1 is not connected by Tsi1 ,si2 to a character of sj2 and (2) si1 is connected by Tsi1 ,si2 to a character of sj2 . In the first case, the edit distance between si+1 and 1 sj2 cannot be larger than that between si1 and sj2 because removing s1 (i) from both strings yields the same string. It cannot be smaller because that would violate our assumption that δ(si1 , sj2 ) is minimal. In this case, setting p = j proves the lemma. In the second case, assume that si1 is connected by the trace to sq2 (see figure 3). The edit transformation of si1 to sj2 will delete s2 (j) to s2 (q − 1) and change s1 (i) into s2 (q) (if they represent different characters). To transform si+1 into sq+1 , the same operations can 1 2 be used. Removing s2 (j) to s2 (q − 1) from both sj2 and sq+1 yields the same string. The edit distance be2 tween si+1 and sq+1 therefore cannot be larger than 1 2 δ(s1 , s2 ). It cannot be smaller, because this would violate the assumption that δ(si1 , sj2 ) is minimal. Thereq+1 fore δ(si+1 ) = δ(s1 , s2 ). In this case, setting 1 , s2 p = q + 1 proves the lemma. Figure 3 shows this.

From this table follows that the minimal number of edits required to transform a cycling of t1 into one of t2 is 2. Note that this minimal distance can be found at least once in each column and each row. This is a general property, that can be proven and that allows us to simplify considerably the calculation of the edit distance in the case of cyclic independence.

Lemma 5. If the minimum distance between strings s1 and s2 that have cyclic independence is given by δ(s1 , s2 ) = δ(si1 , sj2 ) (i.e. no cycling exists that has a smaller edit distance), then for all r(1 ≤ r ≤ |s1 |) a cycling sp2 of s2 exists for which δ(sr1 , sp2 ) = δ(s1 , s2 )

Edit distance for representations with cyclic independence Assume that |s| = n and let s(i) be

Proof. The proof of this lemma follows from the repeated application of lemma 4.

si1 = sj2 =

s2 (j) . . . s2 (q − 1)

s1 (i)

s1 (i + 1) . . . s1 (1) . . . s1 (i − 1)

s2 (q)

s2 (q + 1) . . . s2 (m)s2 (1) . . . s2 (j − 1) (a)

si+1 = 1

s1 (i + 1) . . . s1 (n)s1 (1) . . . s1 (i − 1)

sq+1 = 2

s2 (q + 1) . . . s2 (m)s2 (1) . . . s2 (j − 1)

s1 (i) s2 (j) . . . s2 (q − 1)

s2 (q)

(b)

Figure 3: Proof of lemma 4—case 2 A consequence of lemma 5 is that the edit distance between two strings that exhibit cyclic independence can be calculated by comparing any cycling of a string to all cyclings of the other strings. 3.4

REPRESENTATIONS WITH INDEPENDENT PARTS

Some problem representations consist of parts that are independent. Depending on the specific representation, different modifications of the edit distance are possible. 3.4.1

Representations with recognisable parts

In some cases, the parts of the solution represent different recognisable entities. Stated differently, each part of s1 has a “naturally” corresponding part in s2 . A natural modification of the edit distance measure for this representation is to measure the edit distance between the schedules for each machine separately and add the distances. Suppose solution s is split into parts σ 1 to σ m , then X δ(σ1i , σ2i ). (5) δip1 (s1 , s2 ) = i

Example. In multiple machine scheduling problems, the parts of the solution represent the jobs scheduled on each machine. Each solution has a sequence of jobs scheduled on each machine. The edit distance is the sum of the edit distances calculated for each machine independently. If mij represents the sequence of jobs scheduled on machine j for solution i then the edit distance between solutions s1 = {m11 , m12 , . . . , m1n } and s2 = {m21 , m22 , . . . , m2n } is given by δip1 (s1 , s2 ) =

n X j=1

δ(m1j , m2j ).

(6)

Example. Another problem with recognisable parts is the bandwidth reduction problem (BRP). In its matrix formulation, the object of this problem is to find a permutation of the rows and a permutation of the columns of a (sparse) matrix that minimises the maximal distance to the main diagonal of the non-zero elements of the matrix. The distance to the main diagonal of a non-zero element (i, j) is given by |i − j|. A solution to the problem can be represented as a set of two permutations, one for the rows and one for the columns. Let r1 and r2 represent the permutations of the rows for solution s1 and s2 respectively and let c1 and c2 represent the permutations of the columns, then the distance between solutions s1 and s2 is given by δip1 (s1 , s2 ) = δ(r1 , r2 ) + δ(c1 , c2 ).

3.4.2

(7)

Representations with unrecognisable parts

If the parts of the solution are not recognisably linked to a certain entity, it is impossible to determine in advance which part of solution 1 should correspond to which part of solution 2. The “ideal” correspondence between parts of s1 and parts of s2 lies in the fact that this correspondence minimises the total distance. This involves solving an assignment problem. Suppose solution s1 is split into parts σ11 , σ12 , . . . , σ1m1 and s2 is split into parts σ21 , σ22 , . . . , σ2m2 . The modified edit distance measure is then given by the following mathematical program.

δip2 (s1 , s2 ) = min

XX i

j

δ(σ1i , σ2j )xij ,

(8a)

s.t. X

xij ∈ 0, 1

∀i, j,

(8b)

xij = 1

∀i,

(8c)

xij = 1

∀j.

(8d)

j

X i

2. trips can be executed in any direction (forward or backward). An algorithm to calculate the distance is not given due to space restrictions.

4

CONCLUSIONS

Example. In the vehicle routing problem with time windows, a central depot is given, as well as a set of customers requiring service. The object of this problem is to build routes that start and end at a depot and optimise a number of performance measures (number of vehicles used, total travel time, etc.). Moreover, each customer has a certain time window, that determines the earliest and latest time the service at this customer can start.

In this paper, we develop several extensions of the edit distance measure, that can be used to measure the similarity or difference between solutions of combinatorial problems, that can be represented as (sets of) permutations. We show how each of these distance measures can be calculated.

A solution to a vehicle routing problem with time windows can be represented as a set of sequences. Each sequence represents a tour, starting and ending at the depot.

D. Goldberg. Genetic algorithms with sharing for multimodal function optimization. In Proceedings of the Second International Conference on Genetic Algorithms, pages 41–49, 1987.

For a vehicle routing problem with 10 customers, labelled a to j, 2 solutions are given. s1 = (abc)(defg)(hij) and s2 = (bcdef)(gjiha).

V.I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10:707–710, 1966.

Table 3: Possible edit distances for the vehicle routing problem with time windows

S.W. Mahfoud. Crowding and preselection revisited. In R. Manner and B. Manderick, editors, Parallel Problem Solving from Nature, pages 27–36, Amsterdam, 1992. Elsevier.

bcdef gjiha Λ

abc 4 5 3

defg 3 5 4

hij 5 4 3

Solving an assignment problem with the matrix in table 3 as cost matrix assigns abc → Λ, defg → bcdef and hij → gjiha (indicated in bold in the table), yielding a total cost of 10. 3.5

COMBINING EXTENSIONS TO THE EDIT DISTANCE

The edit distance extensions defined in the previous sections can be easily combined, depending on the specific problem and solution representation used. An example is the vehicle routing problem on an undirected graph, a natural representation of which has independent parts, as well as reversal independence. Solutions of vehicle routing problems are typically represented as strings with trip delimiters. A distance measure between VRP solutions has to take into account that 1. trips are independent (with unrecognisable parts), and

References

R. Mart´ı, M. Laguna, and V. Campos. Scatter search vs. genetic algorithms: An experimental evaluation with permutation problems. In C. Rego and B. Alidaee, editors, Adaptive Memory and Evolution: Tabu Search and Scatter Search, to appear. M. Mauldin. Maintaining diversity in genetic search. In Proceedings of the National Conference on Artificial Intelligence, pages 247–250, 1984. S. Ronald. Distance functions for order-based encodings. In D. Fogel, editor, Proceedings of the IEEE Conference on Evolutionary Computation, pages 641–646, New York, 1997. IEEE Press. S. Ronald. More distance functions for order-based encodings. In Proceedings of the IEEE Conference on Evolutionary Computation, pages 558–563. IEEE Press, 1998. E. Ukkonen. Finding approximate patterns in strings. Journal of Algorithms, 6:132–137, 1985. R.A. Wagner and M.J. Fischer. The string-to-string correction problem. Journal of the Association for Computing Machinery, 21:168–173, 1974.