A Label-Free Similarity Measure between ... - Semantic Scholar

A Label-Free Similarity Measure between Workflow Nets Haiping Zha

Jianmin Wang, Lijie Wen, Chaokun Wang

Dept. of Computer Sci. & Tech., Tsinghua University Beijing, P.R. China 100084 Institute of Standards and Specifications, Shanghai, P.R. China 200235 [email protected]

School of Software, Tsinghua University Beijing, P.R. China 100084 {jimwang, chaokun}@tsinghua.edu.cn, [email protected]

Abstract—Many activities in business process management, such as process search, process clustering, and process mining, need to determine the similarity between two process models. Although several approaches have recently been proposed to measure the behavioral similarity between business processes, all of them require that tasks in processes are properly labeled. According to these approaches, similarity between two given processes can be dramatically different under different task labeling schemes. In this paper, we consider process similarity measure from another view point, i.e., focusing on the control flow structures and ignoring the task labels. Thus, we propose a label-free similarity measure between process models based on transition adjacent relations (TARs) in the context of workflow nets (WF-nets), as well as an efficient algorithm. The experimental results involving comparison of different similarity measures on artificial processes and evaluation of the efficient algorithm on real-life processes are discussed.

I. I NTRODUCTION Nowadays, process-aware information systems have been widely adopted in industry [1]. A process aware information system is driven by explicit process models. The process models which represent the real business handling procedures of the organizations have become important intellectual assets. On the basis of these process model data, there are many applications in business process management which require to measure similarity between process models, such as process search, process clustering, and process mining. Although a process model can be regarded as a graph in the context of WF-nets, the similarity or the distance of graph [2], [3] cannot be directly used in the context of process similarity. In business process management, usually the focus is on the behavior rather than the topology structure. For example, a slight difference in the topology structure of a process may lead to a significant change in its behavior, while two processes with very different topology structures may exhibit similar (or even the same) behavior. Recently, researchers have proposed different approaches to quantify the behavioral similarity between business processes [4], [5], [6], [7]. However, all of them require that process models are properly labeled. For example, Processes N3 and N4 can be regarded as the same WF-net but with different task labeling schemes, as shown in Figure 1. These approaches will

N1

N2

N3

N4

A

A

A

E

C

B

D

C

B

D

Fig. 1.

B

C

D

F

G

H

Sample process models in WF-nets.

come to a conclusion of a low similarity between Processes N3 and N4 which ignores that the two WF-nets have the same control flow structure. In fact, there are two different view points on a low similarity between processes: One is that the two processes have similar task labeling schemes but with different control flow structures (e.g., Processes N2 and N3 ); the other is that the two processes have similar control flow structures but with different task labeling schemes (e.g., Processes N3 and N4 ). In real applications, we sometimes need to compare processes coming from different organizations or different businesses (e.g., to mining frequent patterns from all available process models, or to discovery similar handling procedures from different businesses). Since the processes come from different organizations or different businesses, the labeling schemes of tasks may be very different. Therefore, similarity measure based on task labeling schemes may fail to deal with them. Instead we have to focus on the control flow structures of processes and ignore the different labeling schemes. In this paper, we take the second view point on process similarity, i.e., the similarity should focus on the control flow structures. We distinguish the observed behavior from the task labeling scheme of a process, and propose a new

similarity measure which focuses on the control flow structures of processes and is independent of the labeling schemes. Thus, it can measure similarity not only between properly labeled processes, but also improperly labeled ones which are common in industrial applications. The remainder of this paper is organized as follows. Section II presents the label-free similarity measure based on the TAR set. Section III compares different similarity measures and evaluates the efficient algorithm by experiments. Section IV discusses the related work, and Section V concludes the paper and outlines future work. II. L ABEL - FREE SIMILARITY MEASURE In this section, we first introduce the TAR similarity between WF-nets. Based on the foundation, an approach for label-free similarity measure is proposed. We use some basic concepts used throughout the paper, such as Petri nets and WF-nets. For a detailed introduction of these concepts, we refer the readers to [8], [9]. A. Labeling scheme and observed behavior A labeled WF-net is a WF-net where each transition is mapped to an illustrative label. Since the full set of firing sequences (i.e., the observed behavior) is a power set of these labels, the labeling scheme applies a great impact on the observed behavior of a WF-net. The formal definition of the labeling scheme and the observed behavior of a WF-net is as follows. Definition 1 (Labeling scheme): Let T be the transition set of a WF-net N , and Γ be a finite alphabet. Then, a labeling scheme ϕ is a mapping function from T to Γ. For example, Processes N3 and N4 in Figure 1 can be seen as the same WF-net with different labeling schemes where Γ3 = {A, B, C, D} and Γ4 = {E, F, G, H}. Note that, the labels can also be strings. We use letters for convenience. Definition 2 (Observed behavior): Let ϕ be the labeling scheme of a marked WF-net N , and Γ be the alphabet of labels, then the observed behavior is the full set of firing sequences of N which is a language on Γ. According to the information-theoretic definition of similarity [10], we can directly construct similarity (or distance) between processes based on the full set of firing sequences. Here, the commonality of two processes is the intersection set of their full sets of firing sequences. The ratio between the intersection set and the union set of their full sets of firing sequences represents the similarity between them. The larger intersection set, the greater the similarity in observed behavior. However, when a WF-net has loop structures, the full set of firing sequences is not finite. For example, we cannot calculate similarity between Process N2 and the other three processes in Figure 1 because Process N2 does not have a finite full set of firing sequences. Moreover, the measure is too rigid [5], i.e., one different step in a sequence invalidates the entire sequence. For example, the full sets of firing sequences of Processes N1 and N3 are {ABD, ACD} and {ABCD, ACBD} respectively. Thus, similarity(N1 , N3 ) = 0, because they have no

commonality firing sequence between N1 and N3 despite of the same subsequences between them. B. TAR similarity measure Since the full set of firing sequences of a process may be infinite, our efforts find a substitute measure based on the transition adjacent relation (TAR) set. A more detailed discussion about TAR similarity can be founded in [7]. The concept of TAR is inspired from the definition of event log relation >w presented in [11] which shows that a complete set of transition adjacent relation can specify the behavior of a process (e.g., SWF-nets). The finding motivates us to measure the similarity between business processes based on the TAR set. The formal definition of TAR and TAR similarity are as follows. Definition 3 (Transition adjacent relation (TAR)): Let F S be the full set of observed firing sequences of a WF-net N = (P, T, F ). Let a, b ∈ T : < a, b > is a transition adjacent relation of N if and only if there is a trace σ = t1 t2 t3 ...tn and i ∈ {1, 2, ..., n − 1} such that σ ∈ F S, ti = a and ti+1 = b. The complete TARs in F S are called the TAR set of N . Definition 4 (TAR similarity): Let N1 and N2 be two WF-nets with initial markings M1 and M2 , T S1 and T S2 be the TAR set respectively. Then, the similarity between N1 and N2 is defined as follows: |T S1 ∩ T S2 | . |T S1 ∪ T S2 | The advantage of using the TAR set is clear. There is always a finite TAR set for any process regardless of its control flow structure. We can calculate similarity T between processes shown in Figure 1. The TAR sets of Processes N1 , N2 and N3 are {AB, AC, BD, CD}, {AB, BD, BC, CB}, and {AB, AC, BC, CB, BD, CD} respectively. Thus, similarity T (N1 , N3 ) = 4/6 ≈ 0.67 shows that the two models are similar but not equivalent. similarity T (N2 , N3 ) = 2/6 ≈ 0.33 shows that Processes N1 and N3 are more similar than Processes N2 and N3 . There are WF-nets beyond the range of SWF-nets for which the TAR set cannot specify the exact behavior. A well known sample of such WF-nets is non-free choice nets [12]. Despite of the fact, similarity T is still valuable for assessing the similarity between processes. similarity T ((N1 , M1 ), (N2 , M2 )) =

C. Label-free similarity In the previous section we present a similarity definition base on TAR set which is relevant with task labeling scheme. For example, similarity T (N3 , N4 ) = 0 despite of the same control flow structure, as shown in Figure 1. Since we focus on measuring similarity between the control flow structures of processes, the task labeling schemes should not play the key role. For example, Processes N3 and N4 seem to be the same WF-net except that they have complete different task labeling schemes. To reflect that fact, we need to do label matching before performing similarity measure. Definition 5 (Label matching): Let N1 and N2 be two WFnets. Let Γ1 and Γ2 be the set of labels of N1 and N2

respectively, and sizeof (Γ1 ) ≤ sizeof (Γ2 ). Then, the label matching is a 1 to 0/1 mapping between Γ1 and Γ2 . Under each mapping scheme, we can measure the TAR similarity between the two processes. Thus, we define the label-free TAR similarity as the maximum TAR similarity. Definition 6 (Label-free TAR similarity): Let N1 and N2 be two WF-nets with initial markings M1 and M2 . Let Γ1 and Γ2 be the set of labels of N1 and N2 respectively. Let the size of Γ1 and Γ2 be m and n respectively, and m ≤ n. Let I be the number of matching schemes between N1 and N2 , then I = Pnm . Let i be a mapping scheme between Γ1 and Γ2 . For each i there will be a similarityiT . Then, the similarity between N1 and N2 is defined as follows: similarity L ((N1 , M1 ), (N2 , M2 )) = M axIi=1 (similarityiT ). For example, under the mapping scheme between Processes N3 and N4 where A → E, B → F, C → G and D → H, we get the maximum similarity T = 1. Therefore, similarity L (N3 , N4 ) = 1, which shows that the two processes will have the same observed behavior if they take proper task labeling schemes. D. An Efficient algorithm A disadvantage of the label-free TAR similarity measure is its high complexity which is expected as O(n!) (n is the size of the label set). The cost is mainly coming from finding a proper matching between the labels of the two processes. Therefore, it is necessary to find an efficient algorithm for label matching between two processes. In Section II-A, the behavior of a process is defined as a language on the label set Γ. Since a letter in a language has its frequency, we construct an approximate algorithm based on frequency analysis which is an elementary approach in cryptanalysis [13]. We first define the label frequency as follows. Definition 7 (Label frequency): Let Γ be label set of a WFnet N , and T S be the TAR set of N . Let i be an item in Γ. Then, f requency(i) is the times of i appears in T S. For example, the label set of N1 is {A, B, C, D}, and the TAR set of N1 is {AB, AC, BD, CD}, as shown Figure 1. Then, f requency(B) = f requency(C) =2. The algorithm includes four steps: First, letters in each alphabet are ordered by the frequency in the TAR set. Then, each alphabet is partitioned into pieces according to an input parameter C. After that, match each piece independently, i.e., high frequency pieces match with high frequency pieces and low frequency pieces match with low frequency pieces. Finally, we can calculate similarity L according to the Definition 6. The algorithm is an approximate algorithm, and the time complexity is O((n/C)!C ). Pseudocode for the algorithm is as follows. //**Input: Two marked WF-Net N1 and N2 . Let the label set be Γ1 (l1 , l2 , .., lm ) and Γ2 (r1 , r2 , ..., rn ), and m ≤ n, all letters are ordered by frequency in TAR set, i.e., f requency(li ) ≥ f requency(li+1 ) and f requency(ri ) ≥

f requency(ri+1 ). //**Output: The approximate similarity L . Algorithm (Approximate algorithm for the label-free TAR similarity) 1. Read C; 2. IF (C == 1) THEN 3. { 4. M = 1, L = m, L = n; 5. GOTO Step25; 6. } 7. ELSE 8. { 9. Lm = m/C ; 10. Ln = n/C ; 11.} //** Partition Γ1 and Γ2 into C pieces respectively 12. For i = 0 TO C − 2 13. { 14. IF (i < m − Lm ∗ C − 1) THEN 15. L = Lm + 1; 16. ELSE 17. L = Lm; 18. IF (i < n − Ln ∗ C − 1) THEN 19. L = Ln + 1; 20. ELSE 21. L = Ln; 22. Γ1 [i] = {li∗L+1 , li∗L+2 , ..., l(i+1)∗L−2 , l(i+1)∗L−1 }; 23. Γ2 [i] = {ri∗L +1 , ri∗L +2 , ..., r(i+1)∗L −2 , r(i+1)∗L −1 }; 24. } 25. Γ1 [C − 1] = {l(C−1)∗L+1 , li∗L+2 , ..., ln−1 , ln }; 26. Γ2 [C − 1] = {r(C−1)∗L +1 , ri∗L +2 , ..., rm−1 , rm }; //**Match Γ1 [i] with Γ2 [i] C 27. M appingScheme = i=1 (PLL ); //**Calculate similarity based on the TAR set 28. similarity L = 0; 29. FOR EACH ms IN M appingScheme 30. { 31. Similarity T = Similairty T (N1 (Γ1 → Γ2 ), Γ2 ); 32. IF (Similarity T > M axsimilarity L ) THEN 33. similarity L = Similarity T ; 34. } //** similarity L is the label-free TAR similarity between N1 and N2 For example, we calculate similarity L between Processes N3 and N4 in Figure 1. The TAR set of Process N3 is {AB, AC, BC, CB, BD, CD}, thus we get the label sequence with the frequency from high to low as {B(3), C(3), A(2), D(2)}. Similarly, the TAR set of Process N4 is {EF, EG, F G, GF, F H, GH} and the label sequence is {F (3), G(3), E(2), H(2)}. Here, let the input parameter C = 2, thus each sequence is partitioned into two pieces. Then, we match {B(3), C(3)} with {F (3) G(3)}, and {A(2), D(2)} with {E(2), H(2)} respectively. The total number of matching schemes is (2!)2 = 4 which is much less than the original 4! = 24. similarity L is 1 in this case with the matching scheme as B → F, C → G, A → E, D → H.

III. E XPERIMENTAL EVALUATION In this section, we first present a comparison of different approaches for similarity measure on a set of small artificial processes. Then, an evaluation on the efficient algorithm is presented. A. Comparison of different measures We give a comparison of different approaches mentioned in the paper by measuring the similarity between processes shown in Figure 1. The result is shown in Table 1. The comparison includes four similarity measure approaches in total. The TAR similarity and label-free TAR similarity are proposed in this paper. The approach based on topology structure and the approach based on full set of firing sequences is adapted from [5], [7]. From the result in Table 1, we can learn that the measure based on of the topology structures of processes generally derives a similarity value which is not consistent with the result of the behavior measure approaches. For example, Processes N1 and N2 have a high similarity, but their behaviors do not. The measure on full firing sequences cannot work between Processes N2 and the other processes because the full set of firing sequences of Process N2 is infinite. Moreover, it is rigid. For example, the similarity between Processes N1 and N3 is 0 despite of the same subsequences in their firing sequences. The TAR similarity can be calculated between all processes including Process N2 despite of its loop structure. Moreover, the measure result indicates the similar behavior exists between Processes N1 and N3 because there are identical subsequences between their firing sequences. Note that, all above approaches give a very low similarity between Process N4 and other processes. The reason is that Process N4 has a complete different task labeling scheme. Therefore, above approaches are not label-free. However, the label-free TAR similarity proposed in this paper can give an appropriate similarity between all these processes. For example, the similarity between Processes N3 and N4 is 1 which shows that the label-free TAR similarity focuses on control flow structures and is independent of task labeling schemes. B. Algorithm evaluation To illustrate and evaluate the approximate algorithm for label-free similarity measure, we design the experiment as follows. We calculate the similarity L between a process and itself, assumed that they take different labeling schemes, as shown in Figure 2. The process is adapted from a real-world business process coming from the TiPLM workflow system 1 in DongFang Steam Turbine Works Co., Ltd 2 . The process is a typical review process of development graphs with total 33 transitions. We rename the labels as T0 to T32 for convenience. The TAR set of the process includes 113 T ARs. Thus, we can count the frequency of each label in the TAR set. With 1 http://www.thit.com.cn/TiPLM/TiPLM.htm 2 http://www.dfstw.com/

different options of the input parameter C, we get different partitioned and matching schemes, as shown in Figure 3. For example, when C = 3, the transition set will be partitioned into 3 pieces as {T13 , T14 , T16 , T17 , T18 , T19 , T20 , T21 , T1 , T15 , T22 }, {T12 , T23 , T6 , T26 , T5 , T7 , T10 , T11 , T27 , T2 , T3 } and {T4 , T8 , T9 , T24 , T25 , T26 , T29 , T30 , T31 , T0 , T32 }. We calculate the similarity L under each partitioned and matching scheme. The result is listed in Table 2. Because the labels with the same frequency can be in any order, we can derive different values of similarity L by the algorithm depending on the label orders. We call the order is the best case where we get the maximum similarity L , while call the order is the worst case where we get the minimum similarity L . The result shows that, each scheme can lead to a satisfactory result in the best case. However, in the worst case, a small C can lead to a high similarity L , but will generate more matching schemes. On the contrary, a large C will generate fewer matching schemes, but can lead to a low similarity L . Therefore, there is a balance between time cost and a good enough result in the worst case. IV. R ELATED WORK Our work is related to existing work on process aware information systems [14], [1], e.g., business processes modelling and business processes analysis [9]. In [10] an informationtheoretic definition of similarity is proposed which can be used in different application domains to construct domainspecific similarity. The similarity or the distance between business processes is similar to the similarity or the distance in other domains, such as similarity between strings [15] and similarity between graphs [3], [2]. However, these approaches cannot be used in the context of business processes directly. [16] formally addresses the issue of similarity definition with regard to process variant properties in multiple dimensions, such as structural similarity, or (execution sequence as) behavioral similarity, and contextual similarity. We focus on the behavioral similarity in this paper. The foundation of behavioral similarity is process equivalence. Several equivalence notions are proposed in the literature, such as trace equivalence, bisimulation, and branching bisimulation [17], [18], [19], [20]. All of these notions can only provide a true or false answer when they are applied to compare two process models. However, the similarity notion proposed in this paper can tell not only whether two process models have the same behavior, but also how different their behaviors are in case of inequivalence. Our work is in line with the recent efforts on quantifying similarity between business processes [4], [5], [6], [7]. All of them aim to quantify the difference and measure the similarity between business processes. In [4], [5], two concepts, precision and recall, are defined to describe the similarity between business processes. An approach based on observed behavior (i.e., event logs) is proposed to quantify the similarity between processes, which takes into account of not only firing sequences but also their frequencies. Both the

Fig. 2.

A real-life business process in WF-net.

Fig. 3.

Label frequency and partitioned schemes.

TABLE I D IFFERENT SIMILARITY MEASURES BETWEEN WF- NETS SHOWN IN F IGURE 1.

Similarity measure Measure based on topology structure Measure base on full firing sequences Measure base on the TAR set The label-free TAR similarity measure

N1 N 2 0.78 NA 0.33 0.33

N1 N3 0.64 0 0.67 0.67

Similarity N1 N4 0.13 0 0 0.67

between N2 N3 0.64 NA 0.67 0.67

WF-nets N2 N4 0.13 NA 0 0.67

N3 N4 0.18 0 0 1

TABLE II R ESULT OF DIFFERENT SOLUTIONS OF THE ALGORITHM . similarity L in best case similarity L in worst case Number of mapping schemes

S1 1 1 33!

strengths and weaknesses of the approach come from event logs. If there are no such event logs or event logs cannot reflect the typical behavior well, the result may be questioned. In [6], a single indicator sim to measure the similarity is proposed between EPC processes. It uses causal footprints as the representation of the behavior captured by a process model instead of exploring the state space of the process. In [7] a similarity measure based on TAR set is proposed. It also discusses how to get a complete TAR set of a process. The topic of label matching is discussed in [6], but it only addresses the matching problem between synonymies. The label-free similarity is based on the work in [7] where the TAR similarity is defined and discussed in detailed. All above approaches require that processes models are properly labeled. In this paper, we define a similarity which focuses on measuring the similarity between the control flow structures of two processes and is independent of the task labeling schemes. V. C ONCLUSION AND FUTURE WORK In this paper, we propose a similarity measure between business processes which is independent of the task labeling schemes. The similarity distinguishes the task labeling scheme from the observed behavior of a business process, and focuses on the control flow structures of processes. The approach may find its advantage in real-life applications where task labels of a process cannot satisfy the requirements of other similarity measure approaches. According to the definition, the computation of similarity L is an optimization problem. In the future, we will investigate the possibility or impossibility of using other optimization techniques to find a better solution. ACKNOWLEDGEMENTS The work is supported by the National Basic Research Program of China (No. 2009CB320700), the National HighTech Development Program of China (No. 2008AA042301 and No. 2007AA040607), the Project of National Natural Science Foundation of China (No. 90718010), and the Program for New Century Excellent Talents in University of China.

S2 1 0.81 (16!)2

S3 1 0.87 (11!)3

S4 1 0.35 (8!)4

S5 1 0.33 (6!)5

R EFERENCES [1] M. Dumas, W. van der Aalst, and A. ter Hofstede, Process-Aware Infor-mation Systems: Bridging People and Software through Process Technology. Wiley & Sons, 2005. [2] K. Zhang, J. Wang, and D. Shasha, “On the editing distance between undirected acyclic graphs,” International Journal of Foundations of Computer Science, vol. 7, no. 1, pp. 43–58, 1996. [3] H. Bunke and K. Shearer, “A graph distance metric based on the maximal common subgraph,” Pattern Recognition Letters, vol. 19, pp. 255–259, 1998. [4] W. van der Aalst, A. A. de Medeiros, and A. Weijters, “Process equivalence: Comparing two process models based on observed behavior,” in Proceedings of International Conference on Business Process Management (BPM 2006)), Vienna, Austria, Sep. 2006, pp. 129–144. [5] A. de Medeiros, W. van der Aalst, and A. Weijters, “Quantifying process equivalence based on observed behavior,” Data and Knowledge Engineering, vol. 64, no. 1, pp. 55–74, 2008. [6] B. van Dongen, R. Dijkman, and J. Mendling, “Measuring similarity between business process models,” in Proceedings of CAiSE2008, Montpellier, France, Jun. 2008, pp. 450–464. [7] H. Zha, J. Wang, L. Wen, C. Wang, and J. Sun, “A workflow net similarity measure based on transition adjacency relations,” Technical Report, Tsinghua University, 2009. [8] T. Murata, “Petri nets: Properties, analysis and applications,” Proceedings of the IEEE, vol. 77, pp. 541–580, 1989. [9] W. van der Aalst, “The application of petri nets to workflow management,” The Journal of Circuits, Systems and Computers, vol. 8, no. 1, pp. 21–66, 1998. [10] D. Lin, “An information-theoretic definition of similarity,” in Proceedings of the 15th International Conference on Machine Learning, 1998, pp. 296–304. [11] W. van der Aalst, A. Weijters, and L. Maruster, “Workflow mining: Discovering process models from event logs,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 9, pp. 1128–1142, 2004. [12] L. Wen, W. van der Aalst, J. Wang, and J. Sun, “Mining process models with non-free-choice constructs,” Data Mining and Knowledge Discovery, vol. 15, no. 2, pp. 145–180, 2007. [13] I. Al-Kadi, “The origins of cryptology: The arab contributions,” Cryptologia, vol. 16, no. 2, pp. 97–126, 1992. [14] W. van der Aalst and K. van Hee, Workflow Management: Models, Methods, and Systems. Cambridge, MA: MIT press, 2002. [15] E. Ristad and P. Yianilos, “Learning string-edit distance,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 5, pp. 522–532, 1998. [16] R. Lu, S. Sadiq, and G. Governatori, “On managing business processes variants,” Data and Knowledge Engineering, vol. 68, no. 7, pp. 642–664, 2009. [17] L. Pomello, G. Rozenberg, and C. Simone, “A survey of equivalence notions for net based systems,” Advances in Petri Nets, LNCS, vol. 609, pp. 410–472, 1992. [18] R. van Glabbeek and W. Weijland, “Branching time and abstraction in bisimulation semantics,” Journal of the ACM, vol. 43, no. 3, pp. 555– 600, 1996.

[19] R. Milner, “A calculus of communicating systems,” Lecture Notes in Computer Science, vol. 92, 1980. [20] J. Hidders, M. Dumas, W. van der Aalst, A. ter Hofstede, and J. Verelst, “When are two workflows the same?” in Proceedings of the 11th Australasian Theory Symposium, vol. 41, Newcastle, Australia, 2005.