Indexing Business Processes based on annotated Finite State Automata

Indexing Business Processes based on annotated Finite State Automata Bendick Mahleko

Andreas Wombacher

Fraunhofer Institute

Center for Telematics and Information Technology

Integrated Publication and Information Systems

University of Twente

D-64293 Darmstadt, Germany

500 AE Enschede, The Netherlands

[email protected]

[email protected]

Abstract The existing service discovery infrastructure with UDDI as the de facto standard, is limited in that it does not support more complex searching based on matching business processes. Two business processes match if they agree on their simple services, their processing order as well as any mandatory or optional requirements for the service. This matching semantics can be formalized by modelling business processes as annotated finite state automata (aFSAs) and deciding emptiness of the intersection aFSA. Computing the intersection of aFSAs and deciding emptiness are computationally expensive, being more than quadratic on the number of states and transitions, thus does not scale for large service repositories. This paper presents an approach for indexing and matching business processes modeled as aFSAs, for the purpose of service discovery. Evaluation of this approach shows a performance gain of several orders of magnitude over sequential matching and a linear complexity with regard to the data set size.

information have been developed, for example, Business Process Execution Language for Web Services (BPEL) [1]. In particular, service providers want to be able to express optional and mandatory requirements within their business processes. This means for example a buyer may wish to find a seller that supports its business process in a complementary way, e.g., where cancellation and payment activities are mandatory, meaning the matching seller must support these messages. Thus the buyer wants to express his service description in order to find services that complementarily support his own business process, where all the mandatory requirements are covered. Such semantics are not easily expressible using the current descriptions and standards. A1

A3

A2

1

1

1

Introduction

The current Web service discovery technology which is based on the Universal Description, Discovery and Integration (UDDI) standard, offers a simple API to search for simple services and their providers based on attribute/ value queries. In particular, attributes like business name, service name, key-ids, category-name, etc., are used as parameters by which discovery of matching service providers is performed. This approach to service discovery, is limited in a number of respects. For example, more complex service descriptions of services such as process aspects, Quality of Service (QoS) aspects, semantics etc., are not supported. It is not possible using the current infrastructure to find service providers with matching business processes, because process semantics is not deployed. In the last couple of years, more elaborate descriptions of services including process

1

1

a AND b i

i

i 2

2

2 b

a

a

b

a

2

b 3

4

3

4

2

i

a 3

s'

i

i

3

b

4

s

5 3

a 4

1

A5

A4

6

Figure 1. Example Business Processes

Figure 1 is a simplified, but illustrative example showing five business processes, which are depicted as annotated finite state automata [14]. The used messages are RosettaNet messages, where i is a purchase order request (PIP3A4), a is a purchase order confirmation (PIP3A4), b is a purchase order cancellation request (PIP3A9) while s and s0 are used for order status request and response respectively (PIP3A5) [13]. In Figure 1 states depicted as circles represent business states. A state with an incoming transition with no source state is a start state. States with concentric circles represent final states, meaning, a sequence from the start state to this state represents an execution sequence that represents a valid business process interaction. Arcs connecting two states are transitions, representing a change

in business state, triggered by a business event such as receiving a purchase order. Arcs are labeled with messages, representing a message that is sent or received. The business process A1 represents the following scenario: the business process is triggered by a purchase order request message i. After this message, a purchase order confirmation message a and a purchase order cancellation request message b must be supported. This semantics is denoted through the annotation a AN D b, showing the owner of the business process insists that these messages must be supported by a trading partner at this point. We say a and b are mandatory at this state of the execution. The business process A2 is structurally similar to A1 , except that at state 2, the annotation a AN D b is missing. Thus transitions leaving state 2 are not mandatory, but optional, which is the default semantics of a choice in automata. Thus either a or b must be fulfilled for A2 to be matched, while with A1 , both a and b must be fulfilled for a match to occur. Thus we distinguish between mandatory and optional messages within annotated finite state automata. A3 , A4 and A5 can similarly be explained. The business process A5 comprises a cycle which is caused by the purchase order status request and response, that can occur any number of times once a purchase order has been made. This implies that the number of message sequences is infinite. The aFSA approach supports this matching definition by means of an emptiness test on the intersection result. However, the computational complexity is more then quadratic thus the approach does not scale. Therefore, an indexing technique has to deal with infinite sequences and the evaluation of annotations denoting optional and mandatory messages at various states to determine a match. In this paper we present a novel indexing technique for matching business processes, addressing the outlined problems. The rest of the paper is structured as follows: Section 2 discusses the matchmaking of business processes and Section 3 presents an abstraction for representing business processes. Section 4 presents our indexing approach and Section 5 describes the evaluation. Related work is presented in Section 6 and conclusions and future work are presented in Section 7.

2

Matching Business Processes

Annotated finite state automata (aFSAs) are used to describe matchmaking of business processes1 [14]. Since we do not expect that processes are originally specified as aFSAs, mappings can be defined from process specification 1 Other

workflow models like for example Petri Nets could also have been applied. Since these approaches are more expressive also the computational complexity of operations is higher. We decided to keep it simple first.

languages like e.g. BPEL to aFSAs (see [15]). An aFSA as introduced above consists of a set of states, a start state, a set of final states, and a set of transitions, which are labeled with a business event, i.e. sending or receiving a message. Further, states are associated with logical expressions using messages as variables to express mandatory and optional messages. These annotations are evaluated during the emptiness test, where a variable is true in case the annotation of the target state following the transition labeled equally to the variable is evaluated to true. Two aFSAs match if their intersection is not empty, i.e., if there exists at least one common path (message sequence) between the start and final states, where all mandatory and optional messages are fulfilled. The matchmaking of business processes using aFSAs has been described in [14]. Therefore, only the definition of an aFSA itself is presented here. Definition 2.1 (annotated FSA (aFSA)) An annotated FSA A is represented as a tuple A = (Q, Σ, ∆, q0 , F, QA) where – Q is a finite set of states, – Σ is a finite set of messages, – ∆ ⊆ Q × Σ × Q represents transitions, – q0 a start state with q0 ∈ Q, – F ⊆ Q a set of final states, and – QA ⊆ Q × E is a finite relation of states and logical terms within the set E of propositional logic terms. As an example, process A1 in Figure 1 insists that both purchase order confirmation a and cancellation b must be supported. This is expressed by a logical term a AND b on state 2 of the aFSA. Thus the business process matchmaking problem can be presented as follows: Given a collection D of aFSAs representing business processes with D := {A1 , . . . , AN }, find a subset R of the collection D containing all aFSAs Ai with a non-empty intersection with the query aFSA A; thus R := {Ai ∈ D|L(Ai ) ∩ L(A) 6= ∅}, where L(Ai ) and L(A) are languages of aFSAs Ai and A respectively; The detailed description of intersection and emptiness computation (see [14]) are omitted due to lack of space. We approach the aim of improving the performance by applying an abstraction φ on the aFSA matchmaking problem. An abstraction always introduces a loss of information which influences the search result. Therefore, we state optimization goals to be met by an abstraction. We must ensure that no false misses are introduced. In particular, the abstraction must guarantee that if the languages of aFSAs A1 and A2 , denoted L(A1 ) and L(A2 ) respectively, match, the abstracted representations φ(L(A1 )) and φ(L(A2 )), will also match. The second goal is to minimize the false match occurrence rate, i.e., the abstractions are matching while the original aFSAs do not.

3

Abstraction for Indexing aFSAs

The traditional approach used in databases to speed-up search operations is to use external indexes to reduce the number of aFSAs from the collection to be compared. However, the matching of aFSAs is based on the language rather than the structure of an aFSA, which rules out the application of traditional database indexes. In [10] we presented an approach on indexing infinite languages of finite state automaton (FSA), which has been based on constructing a finite representation of the corresponding language. Here the matchmaking criterion has been also non-empty intersection. Thus, we came up with an abstraction where repeating parts in a single word are removed. So non-empty intersection can be tested whether the abstracted FSAs have a single word in common. However, false matches do occur due to the loss of information. Therefore the concept of look-back has been introduced to minimize the occurrence of false matches. Lookback, which is similar to the look-ahead operator concept in language theory, allows history information about previously visited states to be recorded and hence used during the matchmaking operation. The history information is context information about the particular aFSA state. However, this approach does not support the differentiation of mandatory and optional messages, thus complete sequences can be compared and there are no logical expressions which have to be evaluated at each state of an execution sequence.

3.1

Finite Representation of aFSAs

In [10] the abstraction generated a finite set of words from an infinite language of the corresponding FSA. In this paper we propose an abstraction of aFSAs on the level of states, where each state is represented by its context information denoted as an n-gram, i.e. the concatenation of the transition labels resulting in this particular state. Be aware that the state sets of two aFSAs are not comparable because state names are arbitrary identifiers. However, by adding context information, i.e. transitions labels that have been passed to reach a certain state, the state representation becomes comparable. In the following we introduce the construction of an ngram based state representation, called n-gram set, via the n-gram representation of an execution sequence, called ngram list. 3.1.1

string by several strings of fixed-length n. The length n controls the complexity of all operations performed on the strings and also influences the precision of the operation result. The definition of n-grams of an aFSA language L(A) for A = (Q, Σ, ∆, q0 , F, QA) is as follows: Definition 3.1 (N-Grams) Let A = (Q, Σ, ∆, q0 , F, QA) be an aFSA with language L(A). A message sequence ω ∈ L(A) is transformed to an n-gram representation such that, if ω =< b1 · · · bM > with bi ∈ Σ, ϕn (ω) =< `1 , · · · , `N > such that each `i is an n-gram where ` i = [ai,1 · · · ai,n ] f or i = 0 · · · N and N = M + 1 if i + k ≤ n  $ # if i = N and k = n ai,k =  bk+i−n otherwise

N-grams use special characters, $ and #, which are not in the input alphabet of the aFSA. They designate the start and end of a message sequence respectively. The special character $ is also used as a place holder for n-gram positions that are immediately unoccupied. As an example, if < p > is a valid message sequence for an aFSA, it is represented by a 2-gram as < [$$], [$p], [p#] >. The substrings [$$], [$p] and [p#] are called 2-grams, where the two denotes the number of terminals used in each substring. Each n-gram list represents a message sequence. For a lookback of n, the language derived from an aFSA consists of message sequences, where each message sequence is represented by a list of n + 1-grams. With regard to aFSA A5 in Figure 1, the message sequence < iss0 ss0 b > can be represented as the following 2-gram list < [$$], [$i], [is], [ss0 ], [s0 s], [ss0 ], [s0 b][b#] >. An algorithm deriving the n-gram representation from finite state automata has been presented in [10] and can not be restated here due to the lack of space. Formally, the language consisting of n + 1-gram lists derived from an aFSA A with context information based on a look-back of n, can be represented as Φn (L(A)) where [ Φn (L(A)) := {ϕn (ω)} (1) ω∈L(A)

The language derived from the abstracted aFSA is potentially infinite. We want to ensure that the language is finite, so that it can be used to create an index. The next definition removes duplicate n-grams from the language Φn (L(A)) of the abstracted aFSA to make it finite, thus usable in a database index.

N-Gram Representation

N-grams [2] are subsequences contained in a message sequence and they have been used in text indexing approaches in particular for substring matching for a long time [2]. The general idea behind n-grams is representing a single long

Definition 3.2 (Duplicate Removal) Let A = (Q, Σ, ∆, q0 , F, QA) be an aFSA with language L(A) and Φn (L(A)) := {< `0 , · · · , `k >} is a potentially infinite language of A’s abstracted aFSA, computed with

a look-back of n. A finite language Ψ(Φn (L(A))) is computed from Φn (L(A)) by removing duplicates from Φn (L(A)), where: Ψ(Φn (L(A))) := {< `00 , · · · , `0k0 >} with k 0 ≤ k and `j if 6 ∃i < j.`i = `j 0 `j := otherwise If duplicates exist in the abstracted language Φn (L(A)), they are removed. With regard to the example 2-gram list < [$$], [$i], [is], [ss0 ], [s0 s], [ss0 ], [s0 b][b#] > derived from aFSA A5 depicted in Figure 1 the second occurrence of the 2-gram [ss0 ] is removed resulting in the following 2gram list: < [$$], [$i], [is], [ss0 ], [s0 s], [s0 b], [b#] >. Operationally duplicates can be removed during the construction of n-gram lists, without first creating a potentially infinite language. 3.1.2

N-Gram Sets

We have so far described an aFSA language abstraction based on n-gram lists. As stated at the beginning of this section, the evaluation of annotations associated to a state requires an abstraction based on states rather than complete execution sequences. However, an n-gram list represents an execution sequence, where each n-gram is related to a particular state. Therefore we introduce a set representation of n-gram lists being an abstraction of the set of states passed in that execution sequence. The next definition introduces a set abstraction of a language. Definition 3.3 (N-Gram Set Abstraction of a Language) Let A = (Q, Σ, ∆, q0 , F, QA) be an aFSA and Ψ(Φn (L(A))) := {< `00 , · · · , `0k0 >} is the abstracted aFSA language, with duplicates removed. The n-gram set abstraction of a language is τ (Ψ(Φn (L(A)))) where: 0

τ (Ψ(Φn (L(A)))) :=

[

k [

{`0i }

∈Ψ(Φn (L(A))) i=0

Definition 3.3 converts the finite language obtained after the removal of duplicate n-grams, into a set of n-grams, thus ignoring the order in which individual n-grams appear. Thus the resulting language contains a set of n-grams. 3.1.3

Complex Cycles

Above we motivated the usage of n-grams by representing an infinite language represented by an aFSA by a finite set of n-gram lists. However, there are different types of cycles: simple cycles, i.e. cycles that do not share a single state with any other cycle contained in an aFSA, and complex cycles, i.e. at least two cycles contained in an aFSA share a single

state. An example of a complex cycle is depicted in Figure 2. There are two self loops labeled a and b both sharing the only available state. It turns out that for increasing lookback n the number of n-gram lists explodes even for this very simple example2 . Therefore, we are focusing on the following on aFSAs with simple cycles and will consider complex cycles in future work.

a

b

Figure 2. Example aFSA with Complex Cycles

Be aware that most real-life business process specifications are based on a combination of descriptions with no cycles as well as those with simple cycles, as exemplified by business process descriptions from RosettaNet [13]. We believe that business processes with complex cycles are highly unlikely within the problem domain addressed in this paper.

3.2

Representing Annotations

Logical expressions representing annotations to states must be represented in the repository by associating them to n-grams as a representation of states. Thus a relation between an aFSA state and a set of n-grams resulting into this state has to be constructed. When traversing an aFSA graph, at every state, the current n-grams related to that state can be determined. The look-back information is the context information supporting the differentiation of states. We formally describe a function θ for mapping aFSA states to a set of n-grams below. Definition 3.4 (Mapping N-Grams to aFSA States) Let A = (Q, Σ, ∆, q0 , F, QA) be an aFSA. n-grams derived from A via the abstraction τ (Ψ(Φn (L(A)))) are related to states Q by a function θ where: n o [ θ(q) := [a1 . . . an ] (2) [a1 . . . an ] ∈ τ (Ψ(Φn (L(A))))∧ an a1 q∈∆ q2 , . . . , qn −−→ q1 −→

with q, q1 , . . . , qn ∈ Q and ai ∈ |Sigma Definition 3.4 implies that we can derive an n-gram from any target state of a transition. Based on this relation between states and n-grams annotations can be represented in the n-gram abstraction of aFSA. We have illustrated that using a series of abstractions φn (L(A)) ≡ τ (Ψ(Φn (L(A)))), on an aFSA represented 2 For

n = 0, 1, 2 there can be 5, 53 and 2555 n-gram lists constructed.

by its language L(A), the resulting language is finite. For the rest of the paper, we will use φn as an abbreviation of the abstraction function. We can also show that using the abstractions, no false misses are introduced. In addition, we can illustrate that false matches can be kept minimal by increasing the look-back. The discussion of these properties will be contained in the journal version of this paper. Next, we introduce the indexing approach based on this abstraction.

4

Indexing aFSAs

4.1

The indexing mechanism for aFSAs is based on a ngram index relating an aFSA to a collection of n-grams derived from that aFSA and an annotation index relating an aFSA and the corresponding collection of derived n-grams to annotations for each particular n-grams. Each of the two indexes are used independently by the index based matching algorithm. In particular, the algorithm is based on a depth first traversal of the query aFSA, which terminates in case a final state has been reached. As an initialization of the backtracking all aFSAs in the collection are considered as matching aFSAs. During the backtracking, in each aFSA state, aFSAs are removed from the intermediary result set in case (i) they are not supporting the n-gram related to the current state or (ii) they are supporting the n-gram but have non-matching annotations. This calculation makes use of the two indexes. The intermediary results of each state reachable from the current state are used to evaluate the annotations of the current state at the query aFSA. The final result of the query is equivalent to the intermediary result calculated by the start state of the query aFSA. A1 $$

A2

1

$$

a AND b

i $i

$i

2 b

3

ia

4

ib

# a#

b#

2 b ib

# 6

a#

4

5

b#

ia

a#

3

4

2

i ii

b ab

# 6

a

3

ia

#

$i

2

$i

The n-gram index relates n-grams computed using φn and the corresponding aFSA. Thus, the index consists of a classical relational database table containing an aFSA identifier and an n-gram. This can be formally stated as Definition 4.1 (N-Gram Index) If the database is a collection D of aFSAs with D := {A1 , . . . , AN }. The n-gram index IM is: [ [ IM := {(Ai , `)} (3) Ai ∈D

`∈φn (L(Ai ))

The query used in the algorithm is based on the n-gram `0 representing the current state and can be denoted in relational algebra as R1 (`0 ) := Π{aFSA} σ(`0 =`) (IM )

where aF SA is an attribute of IM representing an aFSA and ` is an attribute representing an n-gram in table IM . Table 2 shows the 2-gram index for the aFSAs given in Figure 3. The table shows for example that all aFSAs in Figure 3 contain the 2-gram $$ as well as $i.

4

ia

7

a#

4.2

Annotation Index

5 a

# b#

N-Gram Index

1 i

a 3

ia

A4 $$

1 i

a

# 5

A3 $$

1 i

a

tions are not explicitly represented in aFSAs. For example, the difference between aFSA A1 and aFSA A2 (Figure 3) is that, in state 2 of both aFSAs, both transition choices must be followed in A1 , while in A2 , the two transitions leaving state 2 represent optional choices. In terms of matchmaking, aFSA A1 matches with aFSA A2 , because all mandatory transitions of A1 are supported by A2 , leading to finite states. However, A1 does not match with A3 because A3 does not support the mandatory transition labeled with b at aFSA state 2 (Figure 3).

6 # 8

Figure 3. Example aFSAs Based on 2-grams In the following the two indexes are briefly introduced based on an example and finally an example run of the search algorithm is discussed. Figure 3 is an example showing a collection of four aFSAs, A1 to A4 being a 2-gram representation of the aFSAs depicted in Figure 1. Each aFSA has, associated with each state, a 2-gram representing a look-back of one. As already explained, the default semantics of a choice in automata is a disjunction, so disjunc-

We will assume in this paper that logical expressions, whose expressiveness are that of propositional logic, are in disjunctive normal form (DNF). This assumption is reasonable because in logic, it can be shown that every statement can be expressed in disjunctive normal form [11]. For each disjunct of an expression in DNF, a set comprising all conjuncts is computed. As an example, for an expression in DNF such as (a∧b∧c)∨(b∧c)∨(a), the sets {a, b, c}, {b, c} and {a} will be constructed respectively, from (a ∧ b ∧ c), (b ∧ c) and (a). However, the set of annotations cannot be used as index directly since set intersection and complement are not supported in relational databases. Thus, we encode each message as a unique bit vector, which allows to represent a set of messages as a bit vector constructed by the disjunction of

Table 1. Message Mapping to Bit Vectors message name

bit vector

a b i #

0001 0010 0100 1000

the individual message bit vectors. For the example above the three messages a, b and c can be mapped to bit vectors of fixed-length four, as shown in Table 1. The length of the bit vectors can be increased if required by introducing an extra byte (or more if necessary) and reorganizing the index. The annotation {a, b} can be mapped to a bit vector 0011 by using bitwise or operator (0001 or 0010 = 0011), where a 7→ 0001 and b 7→ 0010 as shown in Table 1. Based on this mapping the annotation index can be defined formally: Definition 4.2 (Annotation Index) Let the database be a collection D of completely annotated aFSAs with D := {A1 , . . . , AN } with Ai = (Qi , Σi , ∆i , q0,i , Fi , QAi ). The annotation index IA is: IA :=

[

Ai ∈D

[

q∈Qi

[

`∈θ(q)

m n o [ (Ai , `, di )

(4)

i=1

where (q, e) ∈ QAi and e := d1 ∧· · ·∧dm with di being the bit vector representations of the disjuncts of the expression e in DNF form. The query used in the algorithm is based on the n-gram ` representing the current state and the bit vector representation of a disjunct d0 of the annotation of the current state. The query can be denoted in relational algebra as 0

R2 (`0 , d0 ) := Π{aFSA} σ (` = `0 ) (d ∩ d0 6= ∅) (d ∩ d¯0 6= ∅)

∧ (IA ) ∧

where d¯0 is the negation of d0 and aF SA, `, and d are attributes used in the table representing the IA index. The selection operator has three predicates that are all joined by a logical AND operator. The predicate ` = `0 selects aFSAs from the collection with the same n-grams as the query aFSA. The predicate d0 ∩ d 6= ∅ selects aFSAs with an annotation that share the same variable term with the corresponding query annotation. An example of such an annotation is a AND c and a AND b, mapping to the sets {a, c} and {a, b} respectively, where one of the annotations belongs to an aFSA from the collection and the other

to the query aFSA. Based on the sets {a, c} and {a, b}, the variable a is shared by both aFSAs, so the predicate evaluates to a non-empty set. The predicate d ∩ d¯0 6= ∅ is needed to ensure that only aFSAs whose annotations completely match, are added to the result set. This is achieved by checking that for conjunction, a matching annotation will contain the same conjuncts. For example, a AND c and a AND b must not match because though they contain a common variable a, the other conjuncts are different, thus are not considered a solution of the matchmaking. In other words, a service finder who is insisting on the support for messages a and b (a AND b) is not satisfied to receive a and c (a AND c), because even though a is supported, c is not. The effect of all the three predicates is to eliminate aFSAs that have matching n-grams, but with corresponding annotations that do not match. Table 3 shows the association of annotations with aFSAs and n-grams for the aFSAs given in Figure 3. The table contains the set representation of the annotations as well as the bit vector encoding based on the mapping given in Table 1. For example, using aFSA A1 in Figure 3 as an example, state 1 is associated with annotation i, i.e., the label for the only transition leaving 1. This is represented in the annotations table (see Table 3) by the first row. Next, state 2 of aFSA A1 is associated with annotation a AND b which is represented as a set {a, b} which is depicted in row five. n-grams a# and b# are associated with no outgoing transitions, thus are not explicitly represented in the annotations index.

4.3

Example

We now illustrate the algorithm using the above tables by considering aFSA A3 to be the query applied to the remaining aFSAs (A1 , A2 , A4 ) depicted in Figure 3. As stated in the introduction of this section, a depth first traversal is performed initiated at the start state 1, passing the states 2 and 3, and reaching the final state 4. Since a final state terminates the recursion, the intermediate result is derived by querying IM with the 2-gram representing this state, i.e. a#. The result set of the query contains all remaining aFSAs of the collection. Annotations do not have to be considered, since a final state has no outgoing transitions. The intermediary result is added as the first step of the backtracking (1) in Table 4. In the step of the backtracking (state 3) the result of state 4 is used as a variable assignment for message # connecting state 3 and 4 (see (2) in Table 4). The new intermediary result is state 3 (ia) is calculated by int := v(#) ∩ (R1 (ia) \ R2 (ia, #)) resulting in {A1 , A2 , A4 } getting the new intermediary result (see (3) in Table 4).

Table 2. 2-gram table (IM ) aFSA A1 A2 A3 A4 A1 A2 A3 A4 A1 A2 A3

2-gram $$ $$ $$ $$ $i $i $i $i ia ia ia

aFSA A1 A2 A4 A4 A1 A1 A2 A2 A3 A4 A4 A4

2-gram ib ib ab ii a# b# a# b# a# a# b# ia

Table 3. Annotations table (IA ) aFSA A1 A2 A3 A4 A1 A2 A2 A3 A4 A4

In the next step of the backtracking (state 2) the result of state 3 is used as a variable assignment for message a connecting state 2 and 3 (see (4) in Table 4). The new intermediary result of state 2 ($i) is calculated by int := v(a) ∩ (R1 ($i) \ R2 ($i, a)) resulting in {A2 , A4 } getting the new intermediary result (see (5) in Table 4). The aFSA A1 is excluded since R2 ($i, a) = {A1 }. In the remaining step of the backtracking no further change happens, so we skip the explanation of the step. The final result is the intermediate result of the initial state, i.e. {A2 , A4 } indicating the matching aFSAs. Be aware that A2 is a correct match while A4 is a false match. However, increasing the look-back by one will resolve this false match.

state 1 2 3 4

5

Table 4. Query Evaluation Example result variable mapping (7){A2 , A4 } (6) < i → {A2 , A4 } > (5){A2 , A4 } (4) < a → {A1 , A2 , A4 } > (3){A1 , A2 , A4 } (2) < # → {A1 , A2 , A4 } > (1){A1 , A2 , A4 } ∅

Evaluation in Web Services Domain

The data used for the experiments was business process data derived from RosettaNet PIP atomic processes. The processes were generated from RosettaNet PIPs based on RosettaNet process generation rules. We implemented a configurable tool which allowed variation of input parameters used to generate process data. The number of processes used was between 100 and 600. The goal of the experimental evaluation was to characterize indexed search complexity, performance and quality of search results.

2-gram $$ $$ $$ $$ $i $i $i $i $i $i

5.1

annotation {i } (0100) {i } (0100) {i } (0100) {i } (0100) {a, b } (0011) {a } (0001) {b } (0010) {a } (0001) {a } (0001) {i } (0100)

aFSA A4 A4 A1 A2 A3 A4 A1 A2 A4

2-gram ia ii ia ia ia ia ib ib ab

annotation {b } (0010) {a } (0001) {#} (1000) {#} (1000) {#} (1000) {#} (1000) {#} (1000) {#} (1000) {#} (1000)

Environment of the Experiments

An indexing engine was implemented based on J2EE container managed persistence (CMP) and was hosted in a JBOSS 3.2.3 application server. The experiments were conducted on a Dell machine, with a Pentium 4 processor 2.00GHz clock speed and 1000 MB RAM with 74 GB total disk space. MySQL server version 4.0 was the used database engine. The machine was running under Windows XP operating system. No buffering/caching of query results was allowed and all tests were run under cold start conditions.

5.2

Evaluation Results

This section presents the experimental results and their analysis. The results show a significant performance gain over sequential scanning. The index performance factor over sequential scanning increased as the data set size was increased. Results also showed that the index search time varied linearly with the data set size, indicating that the index scales with the data set size. The number of false matches was very low as anticipated no false misses were reported in all the experiments that were carried out. The results are discussed in more detail in the next subsections. No false misses were reported for the index search approach as anticipated. This result was found by comparing the output from sequential scanning and that from the index search, for the same data sets. We found that for every input aFSA, all the sequential scan results were also returned by the index search. Performance results in Figure 4 show that the index search performs much better than sequential scanning, with performance factors over sequential scanning ranging from a median of 34 to 246. The best median performance factor result of 246 occurred for the largest data set of 600 aFSAs and for a look-back of 3. The explanation for this is that the effort to perform sequential scan search operations rises

250 look-back=0 look-back=1 look-back=2 look-back=3 look-back=4

200 150 100

search space is reduced by increasing the look-back. In our experiments look-backs above 3 showed an increasing performance gain as shown in the same figure. 900

0 100

831%

800

50

200

300

400

500

false match rate [%]

performance gain over sequential scan

300

600

data set size

700 600 500 400 300 200

108%

100

Figure 4. Performance Gain vs Data Set Size

39%

5%

5%

0 0

1

2

3

4

look-back

at least linear (complexity to do intersection is the product of query size and database size) with the database size. In case of the index search querying the database once is logarithmic to the size of the database, however, the database has to be queried several times and additional overhead of merging intermediary results is added. Thus as the database size increases, the time needed to perform sequential scan operations increases much faster than that to perform index search, which increases more slowly. Figure 4 also shows how performance factors changed with different data set sizes for different look-backs, using only median performance factor values. The figure shows that the look-back of zero starts-off with a performance gain of 77 for the 100 aFSA data set which rises to 98 for the 200 aFSA data set before it starts falling to 93 and 76 for the 400 and 600 aFSA data sets respectively. The same behavior can be observed for the look-back of 1, which starts-off with a performance factor of 78 for the 100 aFSA data set, peaks at the 400 aFSA data set to a performance factor of 129, before falling 116 for the largest data set of 600 aFSAs. The lookback of 2 has overall increasing performance factor values, starting-off with a performance factor of 78, which rises to 133, 214 and 218 for the 100, 200, 400 and 600 aFSA data sets respectively. The behavior of the curves with look-backs 0 and 1 respectively (Figure 4) where they peak and then fall as the data set size increases can be explained as follows: for lookbacks of 0 and 1, the context information used in matchmaking operations is rather minimal (look-back=0 and lookback=1) resulting in a high number of false matches (see Figure 5). This means that the size of the intermediary result is large due to the high false match rate, which significantly affects the overall search performance. Since the data collection was increased by adding copies of the same data to keep the complexity of the data collection constant, the number of false matches was multiplied as the size of the data collection increased, thus multiplying the search space by the same factor. The result is a deterioration in the performance factor if the false matches keep on multiplying as shown in Figure 4 for look-backs of 0 and 1. The

Figure 5. False Match Rate

For example, in the data collection used for the experiments, a 5% false match rate was obtained when look-backs of 3 and 4 were used. So a performance factor of 246 gives a good factor as well as a good false match rate ( of 5% ). Figure 5 shows how the false match rate varies with the look-back. With a look-back of zero, the false match rate was rather high at 831%. When the look-back was increased to one, the false match rate dropped to 108%; for a lookback of 2 it dropped to 39% before stabilizing at 5% level for look-backs of three and four. These results are consistent with the theoretical approach. When look-back is zero, no context information is being taken into account. This means that individual messages are being compared during query evaluation, thus chances that a match is found are high. Increasing the lookback reduces the ambiguity of which sequences are matching because more context information is taken into account when matching.

6

Related work

The paper presented an indexing approach for efficient matchmaking of business processes, where business processes are modeled as aFSAs. In [10] we present an indexing approach for finite state automata, which are not able to differentiate mandatory and optional transitions. Further, the approach presented in this paper is based on ngram sets, while the approach in [10] is based on n-gram lists. The main contribution of this paper is the handling of annotations and the combination with n-gram sets in an efficient and scalable way. Table 5 summarizes additional related work in this domain. The acronyms used for designating columns are as follows: C1− finite language, C2− infinite language, C3− intersection operation, C4− query evaluation complexity and C5− represents the number of false matches. A check mark on the corresponding box means that the index supports this feature, where features

Table 5. Indexing Techniques Assessment Property\ approach B+ -tree, Hashing OODB indexes GraphGrep DataGuides 1-2-T-Index Set Indexing Patricia trie RE-trees

C1 X X ✕ X X X X X

C2 ✕ ✕ X ✕ ✕ X ✕ X

C3 ✕ ✕ ✕ ✕ ✕ X ✕ X

C4 N/A N/A N/A N/A N/A High N/A High

C5 N/A N/A N/A N/A N/A High N/A None

are represented by columns. From this Table 5, all indexing techniques presented can express finite languages, but only two (Signature trees [9] and RE-trees [4]) are capable of expressing infinite languages as can occur in business process specifications. The two techniques also support intersection queries. However, both Signature trees (set indexing) and RE-trees exhibit poor query evaluation performance for intersection-types of queries. The poor performance is due to the high computational complexity for evaluating intersection queries. In addition, signature trees exhibit a high number of false matches due to the information lost during the encoding of messages to fixed-length bit vectors. Table 5 shows that Patricia tries [5], DataGuides [7], 1-Index, 2-Index and T-indexes [12], GraphGrep [6], object oriented database indexes [3] and B+ -tree/ hash indexes [8] do not support intersection and emptiness test operations, which are necessary for the problem motivated in this paper.

7

Conclusions and Future Work

The paper presented an approach for indexing business processes for efficient matching in Web service infrastructures. The approach is based on i) an abstraction mechanism to reduce the number of message sequences to be compared to a finite number which can be indexed ii) a bit vectorbased index mechanism for indexing and evaluating logical expressions annotating aFSA states. The input query is a business process that is modeled as an aFSA and the repository is a collection of business processes to be matched, also modeled as aFSAs. The paper presented a description of the formal structures needed for storing and evaluating queries, as well as experimental evaluation of the approach. Evaluation results show search performance improvement by several orders of magnitude against sequential searching for the best case, average case and worst case. An evaluation of the results also show that the indexing mechanism scales well with increasing data set sizes. Future work will explore other models for representing business processes for matchmaking and their efficient implementation. Future work should also explore models that are based on similarity measures and mechanisms for rank-

ing matchmaking results. The indexing of annotated finite state automata with complex cycles has also not been addressed in this paper. Future work will address this.

References [1] T. Andrews, F. Curbera, H. Dholakia, Y. Goland, J. Klein, F. Leymann, K. Liu, D. Roller, D. Smith, S. Thatte, I. Trickovic, and S. Weerawarana. Business process execution language for web services, version 1.1, May 2003. [2] R. A. Baeza-Yates. Text retrieval: theory and practice. In J. van Leeuwen, editor, Proceedings of the 12th IFIP World Computer Congress, pages 465–476, Madrid, Spain, 1992. North-Holland. [3] E. Bertino and P. Foscoli. Index organizations for objectoriented database systems. TKDE, 7(2):193–209, 1995. [4] C. Chan, M. Garofalakis, and R. Rastogi. Re-tree: an efficient index structure for regular expressions. The VLDB Journal, 12(2):102–119, 2003. [5] B. Cooper, N. Sample, M. J. Franklin, G. R. Hjaltason, and M. Shadmon. A fast index for semistructured data. In The VLDB Conference, pages 341–350, 2001. [6] R. Giugno and D. Shasha. Graphgrep: A fast and universal method for querying graphs. In 16th International Conference in Pattern recognition (ICPR), Quebec, Canada, August 2002. IEEE Computer Society. [7] R. Goldman and J. Widom. Dataguides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd VLDB Conference, Athens, Greece, 1997. [8] J. Gray and A. Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, San Mateo, California, 1993. [9] S. Helmer and G. Moerkotte. A performance study of four index structures for set-valued attributes of low cardinality. The VLDB Journal, 12(3):244–261, 2003. [10] B. Mahleko, A. Wombacher, and P. Fankhauser. A grammarbased index for matching business processes. In The IEEE International Conference on Web Services (ICWS), pages 21–30, Los Alamitos, California, 2005. IEEE Computer Society. [11] E. Mendelson. Introduction to Mathematical Logic. Chapman and Hall, 1997. [12] T. Milo and D. Suciu. Index structures for path expressions. Lecture Notes in Computer Science, 1540:277–295, 1999. [13] RosettaNet. http://www.rosettanet.org. last visited: 15 November 2004. [14] A. Wombacher, P. Fankhauser, B. Mahleko, and E. Neuhold. Matchmaking for business processes based on choreographies. International Journal of Web Services, 1(4):14–32, 2004. [15] A. Wombacher, P. Fankhauser, and E. Neuhold. Transforming BPEL into annotated deterministic finite state automata for service discovery. In IEEE International Conference on Web Services (ICWS 2004), Los Alamitos, California, 2004. IEEE Computer society.