Developing a Concurrent Service Orchestration ... - Semantic Scholar

4 downloads 8031 Views 91KB Size Report
result the service is becoming the major part of modern software ar- chitecture. .... Inside the CCR, as shown in Fig.1, a dispatcher creates one or two threads for ...
Developing a Concurrent Service Orchestration Engine in CCR Wei Lu

Thilina Gunarathne

Dennis Gannon

Indiana University Bloomington, IN 47405

Indiana University Bloomington, IN 47405

Indiana University Bloomington, IN 47405

[email protected]

[email protected]

[email protected]

ABSTRACT

1. INTRODUCTION

As the Grid application models move towards Web services and the service oriented architecture (SOA), the service orchestration is becoming the key to build the large-scale system. With the significant attention, WS-BPEL is widely adopted as the standard web service orchestration language. As a concurrent workflow language WSBPEL introduces a set of complex and sophisticated concurrent and coordination semantics. Meanwhile the centralized architecture makes the orchestration engine be the inherent candidate for the performance bottleneck. Therefore implementing a correct and highly concurrent WS-BPEL engine presents significant challenge. The conventional thread based concurrent programming model is inadequate here. Instead, we believe an alternative model, namely the event-driven programming model aided with the high level coordination constructs such as join patterns, is more suitable for this case, from the perspective of system performance as well as the programmability. In this paper we present the implementation of a high performance WS-BPEL engine prototype, which is built upon the event-driven architecture and join patterns provided by the Microsoft Concurrent Coordination Runtime(CCR). We focus on how to interpret the concurrency semantics in WS-BPEL by using the event and join patterns, and how to drive the execution of a workflow in a reactive manner. Also our experience shows that the event driven architecture enables the orchestration engine to efficiently handle the massive concurrency on the multicore machine.

Service-oriented architecture(SOA) is gaining popularity due to its standards-based and highly customizable architecture. As the result the service is becoming the major part of modern software architecture. Web service technology is the mainstream to implement the SOA. By wrapping the existing systems with Web services we enable those functional components accessible over standard Internet protocols without dependency on the platforms or programming languages. However there are concerns about the performance of the Web-service based SOA. Particularly the emergence of the multicore raises the issue of how SOA can benefit from the hardware enhancement. One critical characteristics of the SOA is its composability, which means we can build complex services out from the simpler ones. However compared with the traditional component models (e.g., objects), services have no calls to each other embedded in them. Instead, standard messaging protocols are defined to describe how one or more services can talk to each other. This architecture then relies on the domain experts to link services together in a process known as orchestration, to build a new service. Usually the composition can be depicted as a workflow graph either implicitly or explicitly; and a centralized orchestration engine will be involved to manage the execution of the workflow. Accordingly, with the significant attention WS-BPEL[14] is introduced to describe the orchestration based Web service workflows. In fact, a great deal of large-scale distributed systems can be viewed as use cases of this orchestration paradigm even though they are designed with variant purposes. For example Google MapReduce[5] solves number of the distributed data parallelism problems by dividing the tasks into two sets (i.e., mapping and reducing) and connecting the two sets in a simple N-by-M topology; Drayd[10],mainly designed for large-scale data mining applications, provides a more general-purpose engine by allowing an acyclic graph of the connections among the tasks; Similarly, Condor DAGMan[18], a distributed batch job manager, describes the dependence within a batch job as a directed acyclic graph on which the job execution is scheduled upon; LEAD[8] does large-scale weather forecast computations by building Web-service workflow graphs over the distributed resources across the national wide area. In all these systems, the flow graphs essentially act as an abstraction layer separating the underlying concurrency and parallelism mechanisms from the users and guides the users towards an appropriate level of granularity. Also the explicitly encoded dependency in the flow graph helps the runtime to exploit the potential concurrency and parallelism The burden, however, is shifted from the users to the orchestration engine, which now needs to efficiently handle the massive concurrency, complex synchronization and coordination during the

Categories and Subject Descriptors D.1.3 [Concurrent Programming]:

General Terms Performance, Design

Keywords Service Orchestration, WS-BPEL, Event-driven, CCR

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICSE’08, May 10–18, 2008, Leipzig, Germany. Copyright 2008 ACM 978-1-60558-079-1/08/05 ...$5.00.

orchestration. Moreover, as the bottleneck of the entire composition the central engine should have the capability of handling hundreds of thousands of concurrent processes with a good throughput. Therefore implementing a correct and high-performance orchestration engine presents a significant challenge. We believe the conventional thread/lock based concurrent programming model is inadequate from the perspective of system performance as well as the programmability. Instead, the event-driven model aided with the high level coordination constructs such as join patterns[6, 7], should be the right programming model for this complex workflow orchestration engine. In this paper we present an event-driven implementation of WSBPEL workflow engine which is built upon the event architecture and join patterns provided in the Microsoft CCR library. The WSBPEL engine is chosen as our study object not only because WSBPEL is the standard way to build workflow over the web services, but also because WS-BPEL allows very sophisticated and complex concurrent semantics which usually requires the formal methods to describe and analyze[15, 12]. We are going to study how to use the event model together with the join patterns to represent the concurrent semantics in WS-BPEL.

2.

BACKGROUND

2.1 BPEL WS-BPEL (BPEL for short), standardized by OASIS, is the current de facto standard for the orchestration of Web services. It enables users to specify how existing Web services should be chained together in various ways to design an executable workflow which is web service as well. By automating the execution of distributed workflow BPEL greatly facilitates the distributed resource sharing and collaboration. The strength of BPEL lies not only in its seamless support for web services, but also in its rich set of control structures (i.e., the activities). Its basic activities include activity which receives the request message, activity which sends back the reply message and which invokes a remote web service. Upon those basic activities, BPEL offers a set of structured activities which describes how to compose those basic activities into complex structures. The widely-used structured activities include ordinary sequential composition(), branching composition (), loops(), non-deterministic choice (). and the parallel composition(), In addition, BPEL has the good support for the isolation(i.e.), exception handling and reversing changes by using compensation.

2.2 Event vs. Thread The performance issue of the server applications, particularly those HTTP based static web servers, has been well studied in the past decade. Usually, processing of a HTTP request involves number of steps, some are I/O bound while others are CPU bound. In order to obtain the high performance, the servers should be able to overlap the computation with I/O operations as much as possible so that multiple requests can be processed concurrently. Two competitive concurrent programming models, the thread based model and the event driven one, are widely used to achieve this kind overlapping. Threaded programs typically process each request in a separate thread; when one thread blocks waiting for I/O, other threads can be scheduled to run. In contrast, the program in the event-driven model is structured as a collection of callback functions, each registering interest with a particular event such as the completion of I/O. Internally the program is driven by a loop that keeps polling for events and executing the registered callback

once the corresponding event is received. In this manner the overlapping is achieved by serially executing the callbacks belonging to the different requests. Threads provide an intuitive sequential programming model by automatically maintaining a stack per thread during the blocking I/O operations. However, as most thread libraries rely on the kernel resource of the operating system, threads are usually considered as heavy in terms of the performance overhead and the memory usage. Moreover, most modern threads are preemptive in pursuit of the fairness and responsiveness, but that requires the exclusive accesses to the shared state by using locks even on a uni-core processor. With the lock the programming actually becomes more error-prone (e.g., dead-lock or live-lock). In contrast, an event is much light-weight. The scheduling of the callbacks is controlled by the application without any involvement from the kernel. Also the scheduling is so called co-operative[1] in which the context switching only happens when the callback voluntarily yield the CPU. This implies that on a uni-core processor the callbacks are actually executed serially and the lock is unnecessary.The major downside of event, however, is its unintuitive programming style, in which the program has to be “ripped” into multiple callbacks[1] and the developer has to explicitly maintain the state across the callbacks. As a result, the event-driven code is harder to understand, maintain and debug. As far as the server performance is concerned the event-driven model has been proved to be more appropriate for the static HTTP web servers than the thread model[16, 19], particularly when coping with the massive concurrency.

2.3 Concurrent Coordination Runtime To ease the concurrent programming for the robotic applications Microsoft Robotics Studio introduces a .NET library, Concurrency and Coordination Runtime (CCR)[3], which offers a typical eventdriven programming environment. But unlike the ordinary eventdriven libraries which usually directly builds upon the asynchronous I/O operations, CCR constructs an abstraction layer formed by two concepts: Port and Arbiter. A port is similar to the event channel concept in the Pi-Calculus[13]. The sender can post the event to a port by calling p o r t . P o s t ( me s s a g e ) ;

However there is not a rendezvous receiver waiting on the other end of the port. Instead, it is the arbiter who is waiting for the event and decides which registered callback should be executed to consume the message. This indirection enables CCR to provide the join patterns[7] and other high-level coordination primitives. For example the below pseudo code illustrates how to use the Choice arbiter to wait for the message from either of the two ports. A r b i t e r . Choice ( A r b i t e r . Receive ( fa ls e d e l e g a t e ( Message ) { / / when r e s p o n s e ... }) , A r b i t e r . Receive ( fa ls e d e l e g a t e ( Message ) { / / when r e s p o n s e ... })));

, p o rt _ A ,

i s r e c e i v e d from p o r t A ,

, p o rt _ B ,

i s r e c e i v e d from p o r t B ,

Notice that in C# a callback can be represented as an anonymous delegation (i.e., delegate {...} in the code) which is rather like a closure in functional languages but is automatically transformed by

the C# compiler. Thus with the CCR port and the anonymous delegation the event-driven programming is much like the continuation passing style (CPS). We briefly list the arbiters which are used in our BPEL engine implementation. For more detail information please refer to CCR documents[17]. The single item receiver registers a callback that takes a single message with the port. If the persist parameter is true, the callback will keep the registration after receiving a message, otherwise the receiver will atomically un-register itself from the port after receiving a message. A r b i t e r . Receive ( p e r s i s t , port , delegate ( . . . ) { . . . } ) ;

The joined receiver registers a callback that attempt to receive messages from multiple ports. The callback will be fired when there is a message for each port. A r b i t e r . JoinReceive ( p e r s i s t , port1 , port2 , delegate ( . . . ) { . . . } ) ; ,

Port

Arbiter

ReceiverTasks

DispatcherQueue Dispatcher

CPU

CPU

...

CPU

The multiple item receiver registers a callback that attempt to receive multiple messages from a single port. It essentially acts as a gather operation in the scatter/gather scenarios. / / N i s t h e number o f t h e m e s s a g e s t o be r e c e i v e d A r b i t e r . M u l t i p l e I t e m R e c e i v e ( p e r s i s t , p o r t , N, delegate ( . . . ) { . . . } ) ;

The Choice arbiter is the compound arbiter which contains receiver tasks as its branches. As long as one of its branches is ready to fire all other receiver tasks will be atomically removes from the respective ports. A r b i t e r . Choice ( ReceiveTask [ ] r e c e i v e r s ) ;

As another compound arbiter, the Interleave arbiter consists of three groups of receivers: ConcurrentReceiverGroup, ExclusiveReceiverGroup, and TeardownReceiverGroup. The receivers in the ConcurrentReceiverGroup can run concurrently while the receiver in the ExclusiveReceiverGroup has to run exclusively and no any other receiver, no matter which group it belongs to, can be scheduled to run. Conceptually it is similar to the classical reader/writer lock paradigm. Moreover the receivers in the ConcurrentReceiverGroup or ExclusiveReceiverGroup are persistent until the receiver in the TeardownReceiverGroup receives the message. At this moment all the receivers are removed from the ports atomically. A r b i t e r . I n t e r l e a v e ( T e a rd o w n R e c e i v e rG ro u p , ExclusiveReceiverGroup , ConcurrentReceiverGroup ) ;

Inside the CCR, as shown in Fig.1, a dispatcher creates one or two threads for every CPU/core, all the threads in the dispatcher shared one or more Dispatcher queues from where they pick the ready callbacks for execution. During the execution of a callback, new callback may be generated and registered to the Arbiter, or events may be posted to the ports, causing some registered callbacks to be ready. The Arbiter then moves the ready callbacks into the Dispatcher queues.

3.

MOTIVATION AND OUR APPROACH

As mentioned earlier, the event-driven model has been proved to be more efficient than the thread based model for developing highly concurrent static HTTP web servers. It is reasonable to conjecture that the event-driven model is also a good one for developing the BPEL orchestration engine. However the BPEL orchestration engine has its own characteristics.

Figure 1: CCR

First, A BPEL orchestration engine may experience more concurrency than a static web server can have. Just as a static web server the primary source of concurrency comes from the concurrent workflow instances initiated by independent requests. A BPEL workflow instance, however, may involve more I/O operations caused by the I/O bound activities, such as and . Thus when there are large number of concurrent workflow instances, the system may demonstrate more overlapping of I/O operations and CPU computation. The second concurrency source, which is not present in a static web server, comes from the running of a single workflow instance. As BPEL supports several concurrent constructs, such as , there may be substantial parallelism even within a single workflow instance. This highly concurrent situation indicates that in term of the service performance the event-driven model is more appealing for the implementation of BPEL orchestration engine. Secondly, the behavior of a static web server is relatively simpler and static, namely a sequence I/O operations (e.g., disk or networking) mixed with some necessary computing, so that people can model the server processing in a specific pattern on which the special optimization will be designed. A good example is the staged event-driven architecture[19], in where the processing of the HTTP request is divided into stages connected via events. For the web service orchestration, however, the situation is more complicated, dynamic and customized. A BPEL workflow could be defined as simple as the static web server, but also could involve very complicate concurrent processing. The ordinary event-driven architecture which may be suitable for the static web server does not fit in the requirement of the orchestration engine. Not to mention the conventional thread/lock based model. We believe that the ports and arbiters of CCR provides the appropriate “nuts and bolts” for the concurrent programming of the service orchestration. Our approach is interpreting the concurrency semantics of BPEL into the CCR ports and its coordination constructs, and then orchestrating the workflow execution on CCR in the event driven manner. We use Windows Communication Foundation (WCF), which is the major platform for building Web services on .NET, as the communication model and the service host-

Completion Exception Port Port

Abort Port

Activity Instance

Similarly, the will Choice-listen on multiple ports of endpoints as well as the abort port. The activity sends the SOAP request to the remote partner service and waits the response. Its implementation begins with sending the SOAP message asynchronously, then registers the receive tasks (i.e., callbacks) for the response, exception or abort signal respectively. / / s e n d t h e SOAP r e q u e s t a s y n c h r o n o u s l y / / t h e r e s p o n s e w i l l be p o s t e d t o t h e r e s p o n s e P o r t B PE L D i s p a t c h e r . R e q u e s t ( r e q u e s t , r e s p o n s e P o r t , l E x p P o r t ) ;

Figure 2: Activity Instance ing environment. For the sake of brevity, in the rest of the paper we will skip the implementation detail, such as data model and process model, and focus on how to map the BPEL concurrent semantics into CCR primitives.

4.

MAPPING BPEL TO CCR

In our prototype every activity instance is represented as an object, which is equipped with at least three ports(Fig.2): abstract cla ss A c ti v it y In s ta n c e { P o r t < bool > c o m p l e t i o n P o r t ; Port e x c e p t i o n P o r t ; P o r t a b o r t P o r t ; abstract public bool Execute ( ) ; }

Executing an activity instance is just invoking the instance method Execute(). The execution, however, can be either asynchronous or synchronous. The caller determines that by the return value of the method call. If it is an asynchronous call the caller can tell the completion of the execution by checking the completion port or the fault by checking the exception port. The boolean value posted to the completion port indicates if the execution completes in the normal mode or in the abort mode. The abort port, on the other hand, is posted by the caller to inform the activity to abort its execution immediately.

4.1 Basic Activities The activity waits on the messages at the specified end-point address where it expects the partner to invoke. As illustrated in the below pseudo code 1 , the implementation of the Execute() of the activity simply issues the asynchronous message accept operations,and then registers a Choice arbiter which reactively listen either on the incoming message or on the abort signal from the abort port. / / a s y n c h r o n o u s l y w a i t s on t h e URI , / / once message i s r e c e i v e d , t h e r e q u e s t P o r t i s s i g n a l e d B PE L D i s p a t c h e r . B e g i n R e c e i v e R e q u e s t ( u r i , r e q u e s t P o r t ) ; A r b i t e r . Choice ( A r b i t e r . R e c e i v e ( f a l s e , r e q u e s t P o r t , d e l e g a t e ( msg ) { i n p u t V a r = msg ; / / s a v e t h e r e q u e s t / / " t r u e " means t h e e x e c u t i o n s u c c e e d s c o m p l e t i o n P o r t . Post ( true ) ; }) , A r b i t e r . Receive ( fa ls e , a b o rtPo rt , delegate ( ) { / / " f a l s e " means t h e e x e c u t i o n i s a b o r t e d completionPort . Post ( f al s e ) ; })); 1 For brevity, we omit the Arbiter.Activate( ) call, which performs the registration, in all the pseudo codes.

/ / r e g i s t e r the r eceive tas ks / / f o r t h e response , e x c e p t i o n or a b o r t . A r b i t e r . Choice ( A r b i t e r . Receive ( fa ls e , responsePort , delegate ( response ) { outputVar = response ; / / save the response c o m p l e t i o n P o r t . Post ( true ) ; }) , A r b i t e r . R e c e i v e ( f a l s e , l E x p P o r t , d e l e g a t e ( exp ) { e x c e p t i o n P o r t . P o s t ( exp ) ; }) , A r b i t e r . Receive ( fa ls e , a b o rtPo rt , delegate ( ) { completionPort . Post ( f als e ) ; }));

4.2 Structured Activities A activity contains one or more activities that are executed sequentially in the order in which they are listed within. < s e q u e n c e s t a n d a r d −a t t r i b u t e s > activity+

Its implementation is a typical recursive continuation passing style, namely the continuation of the completion of the former activity invokes the next activity in a recursive manner. void RecursiveExec ( i n t i ) { / / check t h e a b o r t s i g n a l f o r each i t e r a t i o n i f ( abortPort . Test ( ) ) { completionPort . Post ( f als e ) ; return ; } i f ( i < a c t s . Count ) { / / execute the current a c t i v i t y asynchronously a c t s [ i ] . Execute ( ) ; A r b i t e r . Choice ( A r b i t e r . Receive ( fa ls e , a c ts [ i ] . completionPort , delegate ( bool ) { / / when t h i s a c t i v i t y i s done , move t o n e x t one ; RecursiveExec ( i ++); }) , A r b i t e r . Receive ( fa ls e , a c ts [ i ] . e x c e p tio n Po rt , d e l e g a t e ( exp ) { e x c e p t i o n P o r t . P o s t ( exp ) ; }) , A r b i t e r . Receive ( fa ls e , t h i s . a b o rtPo rt , delegate () { / / abort the current running a c t i v i t y a cts [ i ] . abortPort . Post ( ) ; / / wait the termination of the current a c t i v i t y A r b i t e r . Receive ( fa ls e , a c t s [ i ] . completionPort , delegate ( bool ) { this . completionPort . Post ( fa ls e ) ; }); }));

} else / / a l l a c t i v i t i e s i n t h e s e q u e n c e a r e done t h i s . c o m p l e t i o n P o r t . Post ( true ) ; }

Notice that by the BPEL definition, at the beginning of every iteration the abort port needs to be check to determine weather the execution should be stopped. Also notice when a structural activity is going to be aborted it has to first abort all the running enclosed activities and wait for their termination. That is why there is a nested receive task after the abort signal is posted to the enclosed activity. Other sequential structural activities such as can be implement in a similar manner. Several BPEL activities allow the parallel composition, in which a set of activities are running in parallel. Some ones take the simple fork/join parallel paradigm, where there is not dependency between the activities. For example in the activity with the “parallel” attribute the enclosed activities will run in parallel and join at the end of the execution. For those concurrent activities, their implementations are straightforward with the help of MultipleItemReceive arbiter. The caller assigns all the enclosed activities with the same completion port, and then registers that completion port with the callback which will be fired when the expected number of messages are received from this port.

The requires its source activity has to be executed prior to its target activity. Thus we can express the workflow as a direct acyclic graph (i.e., DAG), in which each node represents the activity while each edge represents a link. While it is challenging to execute the DAG in parallel by threads, it is fairly straightforward by using CCR primitives. That is each link in the flow can be simply represented as a port, which will be listened by the target activity and will be posted when the source activity completes its execution. Considering an activity may be the source of multiple outgoing links, its completion signal should be broadcast to all the outgoing ports. Meanwhile an activity with multiple incoming links needs to join-wait on all the incoming ports. Thus each activity in the flow will be wrapped as a proxy activity(Fig.3), whose pseudo code of Execute() is listed below. A r b i t e r . Choice ( f a l s e , A rb ite r . JoinReceive ( false , incomingPorts , delegate ( bool [ ] i n p u t s ) { i f ( EvaluateJoinCondition ( inputs )) { / / performance the a c t i v i t y here a c t . Execute ( ) ; / / continuation for the completion of the a c t i v i t y A r b i t e r . Choice ( f a l s e , A r b i t e r . Receive ( fa ls e , a c t . completionPort , delegate ( bool ) { / / then f i r e a l l the outgoing ports foreach ( p o r t in the outgoing p o r t s ) p o r t . Post ( true ) ; }) , A r b i t e r . Receive ( fa ls e , a c t . e x c e p tio n Po rt , d e l e g a t e ( exp ) { e x c e p t i o n P o r t . P o s t ( exp ) ; }) , A r b i t e r . Receive ( fa ls e , t h i s . a b o rtPo rt , delegate ( ) { act . a bortPort . Post ( ) ; foreach ( p o r t in o u t g o i n g P o r t s ) port . Post ( f al s e ) ; }));

/ / execute a l l enclosed a c t i v i t i e s in p ar a l le l f o r ( i n t i = 0 ; i < a c t s . Count ; i ++) { a c t s [ i ] . c o m p l e t i o n P o r t = l C o mp Po rt ; a cts [ i ] . exceptionPort = lExpPort ; a c t s [ i ] . Execute ( ) ; } / / j o i n −w a i t s f o r t h e i r c o m p l e t i o n A r b i t e r . Choice ( A r b i t e r . M u l t i p l e I t e m R e c e i v e ( f a l s e , lCompPort , a c t s . Count , delegate ( bool [ ] re s p o n s e s ) { c o m p l e t i o n P o r t . Post ( true ) ; }) , A r b i t e r . Receive ( f a l s e , lExpPort , delegate ( Exception e ) { exceptionPort . Post ( e ) ; }) , A r b i t e r . Receive ( fa ls e , t h i s . a b o rtPo rt , delegate ( ) { / / terminate a l l enclosed a c t i v i t i e s foreach ( a c t i v i t y in a c t s ) a c t i v i t y . a bortPort . Post ( ) ; A r b i t e r . M u l t i p l e I t e m R e c e i v e ( f a l s e , lCompPort , a c t s . Count , delegate ( bool [ ] re s p o n s e s ) { completionPort . Post ( f als e ) ; }); }) );

Also notice that the abortion of the structural activity has to wait the termination messages from all its enclosed activities.

} else { / / Join condition evaluation f a i l s if ( suppressJoinFailure ) { foreach ( p o r t in o u t g o i n g P o r t s ) { port . Post ( fa ls e ) ; } } else { e x c e p t i o n P o r t . P o s t ( new E x c e p t i o n ( "JoinFailure" ) ) ; } } }) , A r b i t e r . Receive ( fa ls e , t h i s . a b o rtPo rt , delegate ( ) { foreach ( p o r t in o u t g o i n g P o r t s ) port . Post ( fa ls e ) ; }));

4.3 Control Link & Flow The most flexible concurrent activity in BPEL is , which allow to define the control dependencies between enclosed activities via . < f l o w s t a n d a r d −a t t r i b u t e s > < l i n k s >? < l i n k name="NCName">+ activity+

Notice that BPEL defines that when all the incoming links of a target activity are signaled, their values need to be evaluated against a join condition expression before the execution. If the join condition evaluated to be true, the target activity can start as normal. Otherwise, a join failure fault occurs. However if the “suppressJoinFailure” attribute is set, the activity should inform the failure to all its outgoing links and completes as a normal completion so that the entire flow activity can continue. In this case all the outgoing ports will be posted with false value.

...

Incoming ports JoinCondition

Completion Exception Port Port

Abort Port

Event Ports

Wrapped Activity

Root activity

Completion Port Fault Port Compensation Port

...

Outgoing ports

Figure 4: Scope Instance

Figure 3: Activity in the Flow The execution of the activity simply starts the execution of each enclosed activity asynchronously and then join-waits the completion of all the activities which are not connected with any outgoing link.

4.4 Scope A is the basic isolation unit in the BPEL. It consists of a root activity, a set of variables and a set of handlers(e.g. event handler and fault handler). All the enclosed activities share those resource. < v a r i a b l e s >? ... < e v e n t H a n d l e r s >? < f a u l t H a n d l e r s >? < c o m p e n s a t i o n H a n d l e r >? ... activity

As depicted in Fig.4, a scope can have multiple event handlers, each of them associating an handling activity with an incoming message. During the execution of a scope instance, once the message is received the activity in the corresponding event handler will be triggered to run and its execution is concurrent with the main activity of the scope. Intuitively, each event handler can be represented as a CCR Port paired with a receive task for its handling activity. However BPEL requires the event handler can be triggered multiple times if the expected message event occurs multiple times, and when the execution of the scope completes or a fault occurs, all the event handlers should be disabled. That means the event handler has to be persistence, but only during the running time of the scope instance. This requirement makes the Arbiter.Interleave arbiter be the right operation. In this scheme each event handler is represented as a persistence receive task and all the event handlers are grouped as the ConcurrentReceiverGroup for the Interleave arbiter, while the TeardownGroups contains a disable port and when it is posted the whole set of event handlers are teared down from the arbiter. Whenever a fault is thrown during the execution of the scope, it will be caught either by a defined fault handler which matches the fault or by the default fault handle. As soon as a fault is caught, all the running activities in the scope should be terminated immediately and all the event-handles and other fault handlers that may occur will be disabled. That means it is never possible to run more than one fault handler for the same scope under any circumstances. So unlike the mapping of the event handlers, the set of fault handlers is represent as a single port paired with the receive task which will dispatch the fault to the matching fault handling activity. Also

the receive task should not be persistent so that after receiving a fault message it will atomically remove from the arbiter, automatically disabling another more fault handling in this scope. In a scope instance, the fault port is actually connected to the exceptionPort of its root activity. Whenever the exceptionPort is signaled from the root activity, the scope automatically enters the fault handling mode. Before the fault dispatching, the abort signal will be sent to each enclosed activity as well as the disable port of the event handlers. Note we are not terminating the running event handle since the running event handler is allowed to finish during the fault handling. Also a fault handler can not start its handle activity until the completion of the root active. Hence after sending the abort signals, the fault handler must wait for the completion signal of the root activity before executing the fault handling activity. A scope can have a compensation handler which contains the activity to be performed in the case we are willing to compensate the activity of the scope. By BPEL definition, the scope managing the signaled fault has not a normal termination and consequently its compensation handler will not be installed; Thus the scope instance is equipped a compensation port, but its receive task for the compensation handler will be registered to the port only after the "true" completion message of the root activity is received. For brevity, we only list the pseudo code of the scope instantiation below. / / i n s t a l l the event handlers Arbiter . Interleave ( new T e a rd o w n R e c e i v e r G r o u p ( A r b i t e r . Receive ( fa ls e , t h i s . d is a b le Po rt , delegate ( ) { })) , new E x c l u s i v e R e c e i v e r G r o u p ( ) , new C o n c u r r e n t R e c e i v e r G r o u p ( A r b i t e r . R e c e i v e ( tr ue , e v e n t 1 P o r t , d e l e g a t e ( me s s a g e ) { / / processing the event1 e v e n t 1 _ a c t i v i t y . Execute ( ) ; }) , A r b i t e r . R e c e i v e ( tr ue , e v e n t 2 P o r t , d e l e g a t e ( me s s a g e ) { / / processing the event2 e v e n t 2 _ a c t i v i t y . Execute ( ) ; }) , ... ) ); / / i n s t a l l the f a u l t handlers A r b i t e r . R e c e i v e ( f a l s e , f a u l t P o r t , d e l e g a t e ( exp ) { / / aborts the root a c t i v i t i e s rootAct . a b o rtPor t . Post ( ) ; / / f i r s t dis able the event handlers this . di s a ble P ort . Post ( ) ;



800



CCR-based Thread-based ThreadPool-based 700



600



Figure 5: The test workflow / / wait the termination of the root a c t i v i t y A r b i t e r . Receive ( fa ls e , rootAct . completionPort , delegate ( bool ) { / / d is p a tc h in g the f a u l t to the matching f a u l t handler s w i t c h ( exp ) { case f a u l t 1 : case f a u l t 2 : ... } }); }); / / execute the root a c t i v i t y rootAct . Execute ( ) ;

/ / i n s t a l l the compensation handler A r b i t e r . Receive ( fa ls e , rootAct . completionPort , delegate ( bool r e s u l t ) { if ( result ) { A r b i t e r . Receive ( fa ls e , compensationPort , delegate (){ / / e x e c u te the compensation a c t i v i t y ... }); } }); ...

As we have seen scope actually introduces a number of subtle concurrency semantics. Particularly, the fault handling mechanism requires control transferring and the execution abortion. While those semantics are challenging to program with threads, it is quite intuitive to describe them by using the CCR events, ports as well as coordination constructs. Furthermore the event-based implementation is robust and less error-prone while the thread-based implementation should be very careful to deal with the random execution abortion.

5.

PERFORMANCE

To show the performance benefit brought by the event-driven model we performed a simple experiment on our prototype implementation. As the reference, a simple thread-based workflow engine is also implemented. The thread-based implementation only supports the basic BPEL activities and leaves those complex concurrent BPEL constructs, such as the flow links and scope handlers, unimplemented. It will launch one thread for one workflow instance and each activity in the workflow will be executed synchronously. We performed the experiments on a 4-core machine, which has 2GB memory and two Intel Xeon 5150 processors, each having two cores inside. The operating system is Windows XP and the version of the .NET framework is 3.5. The experiments are performed by giving numbers of concurrent workflow instances and calculating the throughput of the engine. The sample workflows is designed as a a simple workflow containing six independent activities, each of them invoking a web services in parallel(Fig.5). This workflow is chosen because it represents the basic parallel structure, namely the fork/join pattern, in most scientific workflows. Also it is simple

throughput (trans/sec)



500

400

300

200

100

0 0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

concurrent workflow instances

Figure 6: The performance of the CCR-based BPEL engine enough so that the thread-based implementation can handle it. It should be noted that in order to put the stress on the engine only and minimize the effect of the back-end Web services we “dryrun” the engine by simulating the Web-service invocation using the timeout event together with the cached response messages. From Fig.6 we can see the thread-based implementation presents a poor performance and it is unable to sustain the large number of concurrent instances. When there are more than 2000 concurrent workflow instances, the thread-based implementation crashed with the “Out-of-Memory” exception caused by too many threads created in a short period. In contrast, the CCR-based implementations achieved a good throughput and tends to be robust to load, with little degradation. As the number of instances increases the engine throughput increases until it becomes saturated at the number of 1000. Thereafter the throughput keeps nearly constant and the number of the totally threads is about 22 even with very large number of concurrent instances. We also test the thread-based implementation with a thread pool configuration. The thread pool does help to improve the robustness of the engine since it limits the number of the threads and those created threads are reused for new requests. However the cost is the significant loss of the system throughput as the concurrent requests may be queued up by the thread pool, thus limiting the potential concurrency.

6. RELATED WORK The formal semantics of BPEL has been studies in various formal methods[15, 12]. The work in[15], which substantially influenced our work, comprehensively describes the BPEL concurrent semantics in Petri net structures. During the developing of our prototype, we realize that it is fairly natural to map the major Petri net structures into CCR primitives. Although there are numbers of production quality BPEL-enabled workflow engines from the industry few of them published the internals of the implementations. There are a few prototype implementations of BPEL engine by the academic efforts. BPWS4J, a full implementation of the BPELWS 1.1, is introduced in [4]. BPEL-Mora, a light-weight WS-BPEL implementation designed for the embedded environment, is introduced in [9]. Both of them describes the overall design and architecture of their implementation and to some extend both adopt the event-driven architecture, but neither of them provides the detail on how the concurrent semantics in BPEL were implemented. The BPEL implementation introduced in [2] probably is the most related work. It is based on

ReSpecT tuple centres extended from LINDA; the BPEL semantics are translated to the logic tuple rules and the workflow is executed as a procedure of the reactions. Essentially, the concurrent model of CCR is equivalent to the reaction model of ReSpecT. However building upon the mainstream platform helps our implementation to take advantage of the rich set of libraries, such as WCF.

7.

CONCLUSION

It is well-known that compared with the thread the event-driven concurrent programming model helps improve the performance of server applications but at the cost of the programmability. However our experience of implementing the BPEL orchestration engine based on CCR shows that with the help of high level coordination constructs and language-supported continuation programming style the event-driven model substantially ease the concurrent programming. The complex and sophisticated concurrency semantics in BPEL, which are challenging for the threads, can be described and implemented in an elegant and robust manner. Moreover our measurement shows the event-driven architecture enables the orchestration engine to handle the massive concurrency with the good performance on the multicore machine, while the conventional thread-based implementation simply fails under the same stress.

Acknowledgment We wish to thank George Chrysanthakopoulos from Microsoft for his important feedback on early drafts of this paper.

8.

REFERENCES

[1] A. Adya, J. Howell, M. Theimer, W. J. Bolosky, and J. R. Douceur. Cooperative task management without manual stack management. In Proceedings of the General Track: 2002 USENIX Annual Technical Conference, Berkeley, CA, USA, 2002. USENIX Association. [2] M. Cabano, E. Denti, A. Ricci, and M. Viroli. Designing a bpel orchestration engine based on respect tuple centres. In Proceedings of the 4th International Workshop on the Foundations of Coordination Languages and Software Architectures, 2006. [3] G. Chrysanthakopoulos and S. Singh. An asynchronous messaging library for c#. In SCOOL Conference Proceedings, 2005. [4] F. Curbera, R. Khalaf, W. A. Nagy, and S. Weerawarana. Implementing bpel4ws: the architecture of a bpel4ws implementation: Research articles. Concurr. Comput. : Pract. Exper., 18(10), 2006. [5] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, 2004.

[6] C. Fournet and G. Gonthier. The reflexive cham and the join-calculus. In POPL ’96: Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 372–385, New York, NY, USA, 1996. ACM. [7] C. Fournet and G. Gonthier. The join calculus: A language for distributed mobile programming. In Applied Semantics, International Summer School, APPSEM 2000, pages 268–332, London, UK, 2002. Springer-Verlag. [8] D. Gannon, B. Plale, S. Marru, G. Kandaswamy, Y. Simmhan, and S. Shirasuna. in Workows for eScience: ScienticWorkows for Grids, chapter Dynamic, Adaptive Workflow for Mesoscale Meteorology. Springer Verlag, 2007. [9] T. Gunarathne, D. Premalal, T. Wijethilake, I. Kumara, and A. Kumar. Emerging Web Services Technology, chapter BPEL-Mora: Lightweight Embeddable Extensible BPEL Engine. 2007. [10] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. SIGOPS Oper. Syst. Rev., 41(3), 2007. [11] H. C. Lauer and R. M. Needham. On the duality of operating system structures. SIGOPS Oper. Syst. Rev., 1979. [12] R. Lucchia and M. MazzaraCorresponding. A pi-calculus based semantics for ws-bpel. In Journal of Logic and Algebraic Programming, January 2007. [13] R. Milner. The polyadic pi-calculus: A tutorial. Logic and Algebra of Specification, 1993. [14] OASIS. Web services business process execution language v2.0. http://docs.oasis-open.org/wsbpel/2.0/OS/wsbpel-v2.0OS.html, 2007. [15] C. Ouyang, E. Verbeek, W. M. P. van der Aalst, S. Breutel, M. Dumas, and A. H. M. ter Hofstede. Formal semantics and analysis of control flow in ws-bpel. Sci. Comput. Program., 67(2-3):162–198, 2007. [16] V. S. Pai, P. Druschel, and W. Zwaenepoel. Flash: An efficient and portable Web server. In Proceedings of the USENIX 1999 Annual Technical Conference, 1999. [17] J. Richter. Concurrency and coordination runtime, 2006. [18] D. Thain, T. Tannenbaum, and M. Livny. Distributed computing in practice: the condor experience. Concurrency Practice and Experience, 17(2-4):323–356, 2005. [19] M. Welsh, D. E. Culler, and E. A. Brewer. SEDA: An architecture for well-conditioned, scalable internet services. In Symposium on Operating Systems Principles, pages 230–243, 2001.