Distributed Network Protocols - Computer Science

12 downloads 0 Views 2MB Size Report
application in mind, several examples of problems for which DNP's have been ..... Protocol CT3: Suppose we use protocol CTI, except that for each node we ...
IEEE TRANSACTIONS

ON INFORMATION

THEORY,

VOL.

IT-29, NO.

1, JANUARY

23

1983

Distributed Network Protocols ADRIAN SEGALL. SENIOR MEMBER. IEEE

Abstract-A unified approach to the formal description and validation of several distributed protocols is presented. After introducing two basic protocols, a series of known and new protocols for connectivity test, shortest path, and path updating are described and validated. All protocols are extended to networks with changing topology.

I.

INTRODUCTION

ONSIDER the situation where a number of physically distinct computation units work on a common problem, while their operation is coordinated via comm u n ication channels connecting some of these units. Each computation unit has certain processing and memory capability and is preprogrammed to perform its part of the computation, as well as to receive and send control messages over the communication channels. The program residing in each node will be referred to as the node algorithm and the ensemble of all algorithms providing the solution to the common problem is named a distributed protocol. For the purpose of this paper it is convenient to regard the computation units as nodes in a network whose links are the connecting communication channels. The specific protocols considered here will be collectively called distributed network protocols (DNP) to indicate the fact that the common problem that has to be solved is connected with the network topology. Many of the “classical” graph algorithms have their distributed version and, in addition, several new distributed network protocols appear from practical problems. The m a in application considered so far for DNP’s is in data or voice communication networks. In such networks, geographically dispersed devices must transmit information to one another and must somehow coordinate this transmission. W ith the advances of m iniand m icrocomputers, it is certainly feasible that nodes will have their own processing and memory unit and will serve as communication processors or as switches. In principle, the common goal of these units is to efficiently transmit the required information to achieve certain performance goals, like m inimum delay or maximum throughput. W ith this application in m ind, several examples of problems for which DNP’s have been proposed or are currently under

C

Manuscript received March 1980; revised March 2, 1982. This research was conducted on a consulting agreement with the Laboratory for Information and Decision Systems at Massachusetts Institute of Technology, Cambridge, MA, with partial support provided by the Advanced Research Project Agency of the U.S. Department of Defense (monitored by ONR) under Contract N00014-75-C-1183. The author is with IBM Thomas J. Watson Research Center, Box 218, Yorktown Heights, NY 10598, on leave from the Department of Electrical Engineering, Israel Institute of Technology, Technion City, Haifa 32000, Israel.

investigation are routing of information, network resynchronization, shortest path, m inimum weight spanning tree, common channel coordination, information broadcast, and others. The m a in purpose of this paper is to give a formal description and rigorous validation to a number of DNP’s, some of which have been presented previously in an intuitive way and some of which are proposed here for the first time. W e m a inly consider DNP’s for the purpose of connectivity tests, shortest path in terms of number of links, and routing-path updating. In addition, we give a unifying approach to the validation of the protocols by presenting several basic simple DNP’s that provide building blocks to the presented protocols. The presented protocols have one additional important feature. Since nodes and links may fail and be added asynchronously to the network, the protocols must be able to work under arbitrarily changing network topology. Although we first consider DNP’s for networks with fixed topology, in Section VII we extend those protocols to incorporate casesof changing topology. It is useful to emphasize at this point the importance of formal specification and validation of DNP’s. The design of DNP’s has two m a in stages.F irst, the construction of its general idea and then specification of the node algorithms intended to implement it. As with many computer programs, it is normally hard to make sure that the node algorithms indeed perform their intended purpose under all circumstances and the validation proofs attempt to provide this confidence, as well as deeper understanding of the algorithms. The more formal these proofs are, the more confidence one can have in the correctness of the algorithm. Although intuitive presentation and validation of DNP’s is important, it is certainly not sufficient, and the present paper tries to emphasize this point by introducing formal methods for description and validation of DNP’s. In spite of this, because of restrictions on the length of papers published in this TRANSACTIONS, it was inevitable to delete or just outline several validation proofs of the presented protocols or the specification of the node algorithms. The complete details appear in [8]. II.

THEGENERALMODEL

In this section we give the general m o d e l and assumptions used in all presented DNP’s. Consider a network (V,E)whereVisasetofnodesandEC I/X Visasetof links. For the first part of this paper, we assume that the network has fixed topology. W e shall use the following assumptions.

O O lS-9448,‘83,‘0100-0023$01.0001983 IEEE

24

IEEE TRANSACTIONS

a) Each link is bidirectional; the link connecting node i with node j considered in the direction from i to j is denoted (i, j). b) All messages referred to in this paper are control

ON INFORMATION

THEORY,

VOL.

IT-29,

NO.

1, JANUARY

1983

first such message, it sends a similar message to its own neighbors. All other messagesreceived at k are disregarded. We shall now formally present the algorithm for each node. PROTOCOL

PI:

Variables of the Algorithm

at Node i.

messages.

m, shows whether node i has already entered the algoc) Associated with each link in each direction, say (i, j), rithm (values 0,l). there is a link protocol whose purpose is to ensure Messages sent and received by the algorithm at node i: that each message sent by node i will arrive correctly within finite nonzero undetermined time, and all mesMSG message sent by node i, sages are received at node j in the same order as sent MSG( I) message received from neighbor I, by i. (Observe that we do not preclude channel errors, message received from the outside world. START provided that there exists a proper detection/retransIt is assumed that each message MSG carries the piece of mission or correction algorithm on each link.) information that has to be propagated. d) All messages received at a node i are stamped with Algorithm for node i: (Assumption: just before entering the identification of the link from which they came algorithm, node i has mj = 0.) and then transferred into a common queue; each node uses one processor for the purpose of the algo(1) For START’ or MSG(I) rithm; the processor extracts the control message at if m, = 0, then mi +- 1, send messages to all (2) the head of the queue, proceeds to process it and neighbors. discards the message when processing is completed; The properties of the protocol appear in the following no other operation related to the protocol is perTheorem whose proof is quite straightforward and appears formed by the processor while a message is being in [8]. processed; consequently we may assume that the processing of each message takes zero time. Theorem PI-I: Suppose a node j receives START. Then e) Each node has an identification; before the protocol we have the following properties of the protocol. starts, each node knows the identity of all nodes that a) All nodes i connected to j (i.e., that are in the are potentially in the network; it knows nothing connected network containing j) will set mi +- 1 in about the topology of the network and in particular finite time. about what nodes actually belong to the network. We b) During the execution of the protocol, exactly one denote by 1,2; . . ,I v] the nodes that are potentially MSG is being sent on each link in each direction. in the network and, when needed, by 1,2; . . , ] VI the c) The propagation of information is the fastest possible nodes actually belonging to the network. in the following sense: for a node i, let pi be the node f) Each node knows its adjacent links, but not necesfrom which node i receives the first MSG (see line sarily the identity of its neighbors, i.e., the nodes at (2) of the algorithm).2 For a link (i, I) let the weight the other end of the links; however, in our algorithms wi, of that link be the time it took for MSG to travel it will be convenient to use expressions such as “send from i to I, i.e., from the time i sends MSG on (i, Z) messages to all neighbors,” meaning “send messages until the time the processor at I starts operating on over all adjacent links.” The collection of all neighthe MSG (this includes propagation and queuing bors of node i will be denoted by Sj. time). Then the collection of links {(pi, i), for all i in g) Unless otherwise stated, the protocol can be started the network} forms the tree of shortest distances by any node or by several nodes asynchronously; a from j to all nodes. node starts the algorithm by receiving a special message “START” from the outside world; a standing Before proceeding to the second algorithm, we may note assumption is that, once a node has entered the that the protocol will work correctly even if “START” is algorithm, it cannot receive “START.” delivered to several nodes at arbitrary times, provided that each of these nodes has not entered the algorithm before III. BASICPROTOCOLS receiving “START.” Properties a) and b) still hold and the The two basic DNP’s presented in this section provide a propagation is still the fastest possible.

way for broadcasting information in the network. A. Propagation of Information

(PI)

Suppose that node j receives from the outside world a piece of information that has to be transmitted to all nodes in the network. The simplest procedure to accomplish this is “flooding” the network, namely node j transmits a message containing this information to all its neighbors and each other node k in the network, when it receives the

B. Propagation of Information

with Feedback (PIF)

Sometimes a node s that receives START and propagates information may want to be positively informed when the information has indeed reached all connected ‘The statement “For . ” means “the actions taken by the processor when receiving .” 2The notation (. ) will always denote the corresponding line in the algorithm under consideration.

SEGALL:

DISTRIBUTED

NETWORK

25

PROTOCOLS

In order to analyze the protocol, we shall need the nodes. Here of course the assumption is that only one node can receive START. The following protocol can be used following notation: for this purpose. W h e n receiving START, node s sends (a ),--the event of node i performing line (. ) of its MSG” to all neighbors3 W h e n receiving any MSG”, an algorithm; whenever the corresponding line conarbitrary node i marks the link from which it was received. tains an if operation, the notation refers only to W h e n receiving the first MSG” from neighbor I say, a node the caseswhen the condition indeed holds. i denotes this neighbor with a special mark pf, and sends t(*)-time when event * happens. MSG” to all neighbors except to pf. W h e n it observes that Theorem PIF-I: Supposenode s receives START. Then it has received MSG from all neighbors, a node i other than s sends MSG” top;. It is shown below that receipt of 4 all connected nodes i will perform the event (3), in MSG” from all neighbors at node s can be interpreted as finite tim e and exactly once; after this happens, the the signal that the information has indeed reached all links connected nodes. In this way, the propagation of MSG’s {(i, pf ) for all connected i} occurs in two waves: a) from node s into the network for purposes of propagating information, and b) from the will form a directed spanning tree rooted at s; in network back to node s for the purpose of acknowledgaddition, for all i ment. The formal description of the protocol follows. wherej = pf (3.1) W i) ’ tW,L PROTOCOL PIF: The algorithm for node s that receives START is different from the algorithm for all (Note: some nodes may perform (4) before all nodes other nodes. W e shall first give the algorithm for an have performed (3) .); arbitrary node i other than s and then for node s. b) node s and all connected nodes i will perform (4) in Variables of the Algorithm at Node i # s: finite tim e and exactly once; moreover rnf shows if node i is currently participating in the wherej = p,! ; t((3)i) 5 t(t4)i) < t((4)j), protocol (values 0,l); (3.2) N;“(Z) marks receipt of MSG” from neighbor 1 (values 0, l), 1 E si; also, when node s performs (4), all connected nodes p; is the neighbor from which MSG” was received first. will have completed the algorithm, i.e., performed (4); Messages Sent and Received by the Algorithm at Node exactly one MSG travels on each link in each direci # s: MSG” and MSG”(I) with the same m e a n ing as tion. MSG in PI. Algorithm for Node i # s: (Assumption: just before enProof: a) and c) follow from Theorem PI-l. To prove tering algorithm, node i has rns = 0, pT = nil, N:(I) = 0 for b) let k be a leaf of the tree referred to in a), i.e., $1 such all I E $.) that p; = k. Then all neighbors n of k will send MSG to k whenever they perform (3),. Node k will receive all these (1) For MSG”(Z) messagesand will be able to perform (4)k. At that tim e it N;‘(l) + 1; (2) will send MSG to pi. The same will be true for all leaves. if rnf = 0, then rns +- 1; pf + 1; send MSG” to Now nodes that are on the last-but-one level in the tree will (3) all neighbors except pj. be able to perform (4) and the procedure will continue if VI’ E Si holds N;(Z’) = 1, then send MSG” to downtree all the way to node s. This argument also proves (4) pf; rnf * 0; set N,“(I’) +- 0 VI’ E gi. (3.2) and completes the proof of the theorem. Algorithm for Node s: For node s, the variables are mz, Nl( I’) for all I’ E gs, the messagesare MSG”( Z) and START and the algorithm is as follows.

(3) (3) (4)

For START rn: + 1; send MSG” to all neighbors. For MSG”( I) N,“(I) +- 1; if Ns’(I’) = 1 holds, Vl’ E gX, then rni t 0; set Ns’(I’) +- 0 VI’ E $.

Note: The lines in the algorithm for s have been numbered to denote similar operations as in the algorithm for an arbitrary node i. 3 W e use superscript s throughout the description of the present protocol to explicitly indicate the node that propagates the information.

IV.

CONNECTIVITY

TEST PROTOCOLS

The purpose of this class of DNP’s is to allow each node to learn what nodes are connected to it. The simplest protocol to accomplish this, referred to as CTl, is to use protocol PI repeatedly: first to inform all nodes that the protocol is in progress, and then for each node to propagate its own identity. Every node (or several nodes) can start the protocol by receiving START. A node enters the protocol whenever it receives either START or the first control messagefrom any of its neighbors. The first action taken by a node when entering the protocol is to send a control message containing its own identity to all its neighbors, thereby starting propagation of this identity. In addition, whenever a node i receives the first control message MSGj with the identity of some other nodej, it marks

26

IEEE TRANSACTIONS

j as connected and sends a message MSGj with the identity of j to all neighbors. All further messages with the identity of j are discarded with no action taken. The main properties of the protocol are given in the following theorem. Its formal specification and validation appear in [8]. Theorem CT1 :

a) If node j is connected to i and START is delivered to any node connected to j (or to j itself), then the identity of i will be known at j in finite time and, conversely, if i and j belong to disconnected networks, then the identity of i will never be known at j. b) With protocol CTl, there is no way for node j to know for sure what nodes are disconnected from it. In other words, there is no way for j to know when the algorithm is completed, except for the case when all nodes are connected. Communication Cost: The number of bits transmitted on each link in each direction is 1 V 1log, ] v) . This is because every identity travels exactly once on each link in each direction, there are 1 V ] identities and it takes log, 1 v] bits to describe an‘identity. The total number of bits in the network is 2 I E ) ] VI log, 1 VI , where E is the number of bidirectional links.

The rest of this section is devoted to the presentation of several protocols that solve the problem raised in Theorem CTl-b), namely how to allow nodes to positively know that the protocol has indeed been completed. We shall say then that the protocol has the termination property. Protocol CT2 achieves the property by employing the basic protocol PIF, while the others use a different idea. Protocol CT2: The protocol is started and entered by nodes in the same way as in CTl. Whenever a node i receives the first message MSGJ with the identity of j from neighbor 1, say, a node i denotes this neighbor (as in PIF) with a special mark p{, and sends MSGj to all neighbors except p/. When it observes that it has received MSGj (for any j # i) from all neighbors, node i sends MSGj to p{. It is shown in Theorem CT2-1 that after having received MSG’ from all neighbors, node i can distinguish between the set of nodes to which it is connected and the nodes that are disconnected from it. Consequently, the termination property holds for Protocol CT2. Variables of the Algorithm Mi

d/ WC4 P:’

at,Node i:

WORK while i is participating in the protocol and NORMAL after completing the protocol; designates knowledge at i about connectivity to j (values 0,l) for all j; shows whether MSGj has been received already from neighbor 1 (values 0,l) for all j and I E gi; neighbor from which MSGj has been received first, for all j # i.

Messages Received and Sent by the Algorithm

Same as in CTl.

at Node i:

ON INFORhtATION

THEORY,

VOL.

IT-29,

NO.

1, JANUARY

1983

Algorithm for Node i: (Assumption: just before node i enters the algorithm, it has d/ = 0, N:(Z) = 0 for all j and all f E Si.)

(1) For START or MSGj(l), j # i, if MSG, then N;‘(I) + 1; (la> if di = 0, then M, + WORK, d; + 1, send (2) MSG’ to all neighbors; if MSG and d,j = 0, then d/ + 1; p,j +- I; send (3) MSGJ to all neighbors except p,!; if N;‘(I’) = 1 holds vl’ E Gi, then send MSGj (4) to p:‘; set Ni(1’) + 0 V’1’E gi. (5) For MSG’( I) Nii(l) + 1; (5a> if Ni’(Z’) = 1 holds VI’ E gi, then J4, + (6) NORMAL; set Nii(I’) +- 0 VZ’E gi. In order to analyze the protocol, we shall need the following notation (see also notations just before Theorem PIF-1): (. )/--the event of node i performing line (. )’ of its algorithm regarding node j (i.e., reacting to receipt of MSG’). The properties of the algorithm are given in the following. Theorem CT2-I: Suppose START is delivered to any node connected to a given node j (or to j itself). Then you have the following.

a> If node j is connected to i and START is delivered to any node connected to j (or to j itself), then dj will become one in finite time, and if i and j belong to disconnected networks, then dj will remain zero forever. Node j will perform (6)j in finite time and exactly b) once, and when this happens, it will have dJ! = 1 for all connected nodes k, and dJk = 0 for all disconnected nodes k. In other words, it will positively know at that time what nodes are connected, resolving the problem raised in Theorem CTl-b). Proof: The event Mk + WORK propagates as in PI and hence will happen in finite time at all nodes k connected to the node that received START. For a given node i, after di becomes one, the event di t 1 (i.e., (3yk in the present protocol) propagates in the same way as (3)k in PIF and hence (cf. Theorem PIF-1) it will happen in finite time at node j, completing the proof of a). The proof of b) proceeds by induction on the nodes of the tree rooted at j referred to in Theorem PIF-1. The details appear in [8]. Communication Cost: Observe that by Theorem CT2-1, ,the communication requirements of CT2 are the same as those of CTl, namely 1 VI log, ) 71 bits per link in each direction. Observe however, that the storage and processing requirements, as well as the required execution time are larger than in CTl. Protocols CT3-CT5 use a different idea for achieving the termination property. CT3 is quite wasteful in terms of communication requirements, but it is convenient in order

SEGALL:

DISTRIBUTED

NETWORK

to illustrate the idea and to be used as a basis for developing the more efficient versions CT4 and CT5. In addition, it can be used for different purposes, like learning the network topology. Protocol CT3: Supposewe use protocol CTI, except that for each node we propagate not only the identity of the node, but also of its neighbors. In other words MSGj of CT1 will now carry the identity of j as well as of all its neighbors, i.e., will have the format MSGj (X,), where XJ contains the identities of all neighbors of j. The termination property is achieved using the fact that, if a node k receives MSGj (X,), it will eventually receive MSG’ (Xi) as well, for any i E XJ, and the termination signal will occur when node k will have heard from all these nodes. The algorithm at each node has two stages, where in the first one the node learns the identity of its neighbors and in the second it proceeds with the protocol as described before. The description and validation of the protocol appear in [8]. Communication Cost: O n each link in each direction we need log, ( v] bits for the messagesbelonging to the first stage, and I V ] (D + 1) log, I p( bits for the second stage, where D is the average degree of the nodes (average number of neighbors). Clearly D = 2 ] E I/] VI and-hence the communication cost is (2 I E ) + 1 VI + 1) log, I V 1 bits per link in each direction. As mentioned before, protocol CT3 e m p loys too much communication and its performance can be considerably improved. One way is to use the position of a variable in a vector to indicate the identity of a node, instead of explicitly mentioning it. This idea was used in a protocol by F inn [3], and we present here an improved version of that protocol. Protocol CT4: Variables of the Algorithm at Node i. M, df

27

PROTOCOLS

same m e a n ing as in CT2; 0 before entering algorithm,

1 while looking for identity of neighbors, 2 while looking for all connected nodes; d,j 0 when i knows nothing about j (for j # i), 1 while i knows j only as a neighbor of another node, 2 while i knows j directly; Ni(l) shows if WAKE has been received from neighbor 1 (values 0, 1).

{0,0;~~,0,1,0;~~,0}, N,(l) +- 1; if d: = 0, then same as (la);

then

Pa) (2b) PC) W) (3) Pa)

ifD(l)=

(4)

l,O ;..,O} and M, = WORK, then if dli # 2, then set d,? - max {dk, dk} Vk else, if 3 j such that dj = 2 > d/, then set d,! t max{dl”, dk} Vk; send Di to all neighbors. if dj = 2 or 0 holds Vj, then M, + NORMAL.

(da) (5) (5a) (6)

d;+ I; if Ni(l’) = 1 holds Vl’ E $, then dr +- 2; set N,(1’) t 0 Vl’ E $; send Di to

all neighbors; if

D(1) # {O,O;..,O,

Observe that messages of type (0, 0, . . . , 0, 1, 0, . . . , 0} replace the messagesof the first stage in protocol CT3. Note also that F inn’s [3] protocol requires a node to send messagesevery tim e its table is updated, while here messages are sent only when relevant new information is received (see (4a), (5)). In this sense,the present version is more efficient than [3]. The properties of the protocol are summarized below. Theorem CT4: Suppose START is delivered to some node. Then we have the following.

4

Exactly one message{ 0, 0, . . . , 0, 1, 0, . . *, 0} traverses each link in each direction and this is the first message on each link. b) (3) happens at all connected nodes exactly once and then df Z- 1 for all neighbors 1 of i. c>No more than ) VI messages with format # {O,O; * .) 1,o;. .) 0} traverse each link in each direction. 4 For each node j in the connected network, d{ is increased to 2 exactly once at each node i and after this happens, holds dk 2 1 for all neighbors k of j. 4 Every node i will perform (6) in finite tim e and when this happens, it will have d/ = 2 for all connected nodesj and d{ = 0 for all disconnected nodesj.

Proof: Lines (l), (2c), and the fact that the propagation is as in PI imply a) and b). From the algorithm it is clear that d/ can only increase and from (3a) and (5a) follows that a message# { 0, 0, . . . , 1, 0, . . . , 0) can be sent by i only when some d/ is set from 0 or 1 to 2, and this can Messages Sent and Received by the Algorithm at Node i: happen once for eachj. Hence c). For a given node j, the increaseof d/ to 2 is triggered by START; d{ + 2 and propagates as PI, except that it is stopped at Di = {d;, d;; . ., dj’i}, messagesent; nodes i for which dl # 2 (cf. (5a), (4a)) and the propagaD(1) messagereceived from neighbor I; we denote its tion continues as soon as d/ + 2 (cf. (3a)). This proves the contents by {d’, d*; . .,di’l}. first part of d). To prove the second part, observe from Algorithm for Node i: (Assumption: just before node i (2d), (4a), (5a) that after a messagearrives at node i say, d,! 2 dk holds for any k (where d,! is an entry in the vector enters algorithm, it has all dJ’= 0 and Ni(l) = 0 for all j D, and dk the corresponding entry in the arriving message). and all 1 E 9,.) Also b) shows that whenever dz! t 2, we have d,! 2 1 for all (1) For START neighbors k of i. Consequently d/ = 2 implies dzk 2 1 for (la) df c 1; M, t WORK; send 0, to all neighbors. all neighbors k of j. This completes the proof of d). Since (2) For D(l) dj must remain 0 for all disconnected nodes, (6) will occur

28

IEEE TRANSACTIONS

in finite time with dj’ = 2 for all connected nodes, and dj = 0 for all disconnected ones. It remains to show that (6) cannot occur at node i before d/, becomes 2 for all connected nodes j. Suppose the contrary. Then there must exist a set of nodes ?I containing i, where 9 is not the entire network, and d,!’= 2 for n E ?!4 and d: = 0 for n @ ?I. From d) it then follows that d,! I 1 for all k that are neighbors of any node in 94, contradicting dk = 0 for all k @ ?I. This completes the proof of e) and of the theorem. Communication Cost: Each message contains 2 I VI bits and hence at most 2 ] v( (( V ( + 1) bits will travel on each link in each direction. Protocol CT3 can be improved in another way, resulting in a more efficient protocol CT5. Protocol CT5: Consider protocol CT3 with the following variation. Whenever receiving MSGj(T,), a node i consults its table containing {dlk}. If dj = 2, the M S G is discarded, since such a M S G has been previously received and forwarded to all neighbors. (This part is the same as in CT3.) If dj c.2, then di + 2 and the M S G is sent to all neighbors, but now, before sending MSGj( X,), the following pruning operation is performed. For all k E xj, if dzh 2 1, then k is deleted from 3c,; otherwise k is not deleted from Xj and the variable dk receives value 1. Then MSGJ(X,) is sent to all neighbors. Node k can indeed be deleted when d,k > 1 because in this case k has been sent before by i to neighbors, either as a neighbor of some node, in which case, d,k = 1 or in MSGk, in which case d,!‘= 2. One way or the other, there is no need to send k again. All properties of CT3 hold here as well, but the pruning operation assures that the identity of each node k travels no more than twice on each link in each direction: once as a neighbor of some node and once in MSGk. Hence the communication cost is bounded by 2 ) VI log, I VI bits per link in each direction.

V.

MINIMUM-HOP-PATHPROTOCOLS

The problem considered next is to obtain the paths with smallest number of links (hops) from each node to each other node. As before, at the beginning of the algorithm a node knows only its own identity and the. adjacent links. When the algorithm is completed at a node i, we want the node to know its distance d/‘ in terms of number of links to all other nodes to which it is connected and a preferred neighbor p,? through which it has the m inimum-hop path to k. Observe that we do not require nodes to know the entire m inimum-hop path. If the travel time of control messageswere identical on all links, then we could have accomplished the m inimumhop path by using protocol PI (see Theorem PI-l c)). However, as stated before, such an assumption is not practical, and the problem is to design a DNP where nodes will receive the first messagewith a given identity from the neighbor providing the shortest path, even if link delays are arbitrary. Such a protocol has been proposed by Gallager [l] and here we give its formal description and validation.

ON INFORMATION

THEORY,

VOL.

IT-29, NO.

1, JANUARY

1983

Protocol MH: A node enters the algorithm in the same way as in the CT protocols, namely when receiving START or the first control message,and at that time it sends its own identity to all neighbors. After having received the identity of all neighbors, node i knows all nodes that are at distance 1 from it. Node i keeps this information, sends it to all neighbors and then waits to receive the lists of all nodes that are at distance 1 from each of its neighbors. The union of these lists~minus the set of nodes already known to’i, i.e., those that are at distance 0 or 1 from it, is exactly the set of nodes that is at distance 2 from i:This information is kept again at i and also distributed to neighbors, and the procedure is repeated. If at some level, the union of the lists received from all neighbors contains no nodes that are unknown to i, then node i has completed the algorithm. It sends to all neighbors a messagesaying that it has no new node identities to send and stops. Any further messageit may receive is disregarded. Variables of the Algorithm at Node i: 4

P” ‘i

M,

Ni(l)

distance from i to k, set initially to I VI for all k (values 0, 1; . .,I VI); preferred neighbor from i to k, for all k; state of node i showing distance covered by the protocol up to now (values - 1, 0, 1,. . . , 171 -1); shows if node i is currently participating in the protocol (values NORMAL, WORK); level of last messagereceived on link (i, 1) (values - 1,O; . . ,I VI -l), for I E Si.

Messages Sent and Received at Node i:

MSG(LISTi)-message sent by node i, M S G (I, LIST) = MSG(LIST) received on link (i, l), START. Algorithm for Node i: (Assumption: just before node i enters algorithm, it has pf = nil, dlk = I VI for all k, Zi = q(m) ,= - 1 for all m E gz.)

(1) (2)

For START or MSG(1, LIST) if&= -1, thend!+O, M ,+-WORK,Z,+O, LIST, = {i}, send MSG(LIST,) to all neighbors; if M S G and M, = WORK, then (3) &(I) +- NJZ) + 1; (4) Vk E LIST, then (5) if d,! > Ni(Z) + 1, then dLf + Ni(l) + 1, (5a>

(6)

(6a) (7)

pk t 1; if Z, I Ni(m) holds V m E 9,, then Z, t Z, + 1, LISTi = {k I dk = Z,},

sent

MSG(LIS’I;) to all neighbors; if LISTi = +, then M, + NORMAL.

Note: Observe that the variable pk is not needed by the algorithm, and only designates the neighbor corresponding to the M H path to k. The following definition and theorem summarize the main properties of the protocol. The proof is given in the Appendix. Definition: The number of links on the m inimum-hop path from i to k is called the hop-distance from i to k.

SEGALL:

DISTRIBUTED

NETWORK

Theorem MH-1: SupposeSTART is delivered to a node (or several nodes). Then every connected node i a) will enter the protocol (i.e., perform (2)) in finite time; b) will complete the protocol (i.e., perform (7)) in finite time, with d,!, p,” corresponding to the m inimum-hop path from i to k for all connected nodes k and with dz! = 1 v] , pf = nil for all disconnected nodes k. Communication Cost: From the proof of Theorem MH-1 in the Appendix follows that the identity of every node travels exactly once on each link, and hence we need 1 V 1log, ] v 1 bits on each link in each direction.

VI.

29

PROTOCOLS

PATH-UPDATINGPROTOCOLS

In the protocol of [2], [4] each node m a intains a path to each other node in the network and updating “cycles” allow these paths to be changed so that they are improved in each,cycle and, in addition, the collection of paths to any given node form at any given tim e a loop-free pattern (i.e., a tree). Here we present first the fixed-topology part of the path-updating protocol and then show that protocol CT2 can be used to initialize it. The validation of both is based on the PIF basic protocol. Protocol PU: The protocol updates,paths from all nodes in the network to a given node s and can be repeated independently to update paths to each of the other nodes. Therefore, we can present only the protocol for a given “destination” node s. The protocol is very similar to the PIF protocol, except for two features: first, a tree is initially available and the protocol moves first up and then down on that tree, and second, when moving downtree, the protocol updates the initial tree, so that the resulting paths provide an improvement over the old ones.

N,“(P) = 1, rnf +- 1, send MSG”(d;) to all neighbors, except pf; if N,“(/‘) = 1 holds VZ’E Sj, then send MSG”(df) to p;, pf + k* that achieves m in D;‘(F) over 1’E gi, rns + 0, set Ni’(/‘) + 0 Vl’ E gj.

(4)

The algorithm for s is the same as in PIF except that all messagessent by s have format MSG”(0). Theorem PU-1: Suppose the Assumptions a) and b) given in the protocol hold. Then

a>theorem PIF-1 holds, where pf refers (only in this part) to the initial preferred neighbors; b) the collection of links {(i, pf)} forms at all times a tree rooted at s with the following properties:

i) rnf d$;

c>for

each link (i, Z) the “distance” d;, is measured exactly once by node i; at the end of the protocol, all nodes will have paths to s that are no longer than before the protocol starts, where the length of a path is the sum of the weights of the links; if initially the tree defined by {pf} is not identical to the shortest path tree in terms of the measured {d,,}, then there is a nonempty set of nodes that will strictly improve their paths.

Proof: Observe that the present protocol is identical to PIF, except that (3) is performed by a node i only when MSG is received frompJ (and not as soon as the first MSG is received, as in PIF), the new quantities d;, D,!(l), di, are introduced and the preferred neighbor pf is changed in (4). Protocol PU: Variables of the Algorithms at Node i # s. Now, if we denote by Pl” the initial, tree, (3) and (4) N,“(I), same as in protocol PIF; propagate here exactly as in PIF, provided that in that d,,-distance from node i to neighbor I as measured at protocol a MSG traversesany link in Pl” much faster than the tim e it is needed by the algorithm; can be time-vary- any other link. Since Theorem PIF-1 holds for arbitrary ing (values: any strictly positive real number), 1 E gi; link travel times, assertion a) follows. In order to prove b), df-estimated distance from i to s on the preferred path; suppose the assertionshold up to tim e t - throughout the pf-“preferred” neighbor of i for s; network, and we want to show that if (3) or (4) happens D:(l)-storage for d/ + d,,, for I E Si; at tim e t at some node i, the assertion continues to hold. rnf = 1 after performing (3) and before performing (4); Observe that if (3) happens at node i at tim e t, thenp; is 0 otherwise. not changed and hence the tree property continues to hold. Also, b) ii) is not affected by (3) and hence we only have Assumption: just before START is delivered to s, all to check that b) i) continues to hold. Since4 ms(t - ) = 0, connected nodes i have we have by the induction hypothesis mf( t) = 0 for any j a) p;, df with the property that the collection of links for which p,(t) = i and hence b) i) continues to hold for j (i, pf) form a directed tree rooted at s and also and i after tim e t. O n the other hand, when performing df > d[ where k = pf, i.e., di is strictly decreasing (3), node i receives MSG” from p;, so that pf must have while moving downtree; performed (3) before t and has not performed (4) yet, b) rns = 0, N;‘(l) = 0 for 1 E 9,. implying that rni:( t) = 1 and, since mf( t + ) = 1, assertion b) i) continues to hold after t for i and pf as well. Messages Received and Sent by the Algorithm at i: Now suppose (4) happens at some node i at tim e t. MSG”(df)-message sent; Observe that at that time, i has already received MSG from MSG”(1, d”)-message received. all neighbors and it performs rnf t 0. Consider first any node j such that p,“( t) = i. If p,“( to) = i, where to is the tim e Algorithm for Node i # s: the protocol started, then receipt of MSG at i from j means (1) For MSG”(1, d”) 4 W e write the time in parentheses to indicate the value of a parameter N,“(Z) + 1, Ds( r) + d” + di,; as a specific time. Also, t - and t + denote the time just before and after (2) time 1, respectively. if I = pf, then d,! + m inDf(1’) over I’ such that (3)

30

IEEE TRANSACTIONS

that j has performed (4) before time t. If p,“( to) # i, thenj has changed ps before time t and again this shows that it has performed (4) before time t. Consequently, in either case holds m:(t) = 0 and hence b) i) continues to hold after time t for j and i. Also, from the way dJ”is calculated and p,f is chosen follows that d,” 2 D;(i)

= ds + dji > df,

(6.1)

where the last inequality follows from the assumption dji > 0 (see definition of di,). Consequently b) ii) continues to hold after time t. Now, consider the pair i and k* = pf( t + ). The inequality in assertion b) i) holds trivially after t for i and k* since rnf + 0, while assertion b) ii) holds by the same argument as in (6.1). Now (i, k*) cannot close a loop since by b) i), all nodes 1 in such a loop must have rns = 0, and going around the loop this would imply by b) ii) that ds > df. The proof of c) is quite simple and will be deleted here. The reader is referred to similar proofs that appear in [4, sec. 41 and [6, appendix, lemma 11.

ON INFORMATION

THEORY,

VOL.

IT-29, NO.

1, JANUARY

1983

2) whenever a link fails or recovers, both ends are notified in finite time, but not necessarily at the same time; 3) failure or recovery of a node is considered as failure/recovery of all adjacent links. To make 2) above precise, let I;;(Z) denote a flag indicating the status of link (i, Z) as seen from node i, taking values DOWN or UP if i considers link (i, Z) as down or up, respectively. Then we assume a) if Z$(Z) = F,(i) = UP and e(Z) +- DOWN, then F,(i) becomes DOWN in finite time and before E;(Z) t UP; b) if Fi(Z) = F,(i) = DOWN and q(Z) t UP, then in finite time holds either F,(i) +- UP or c(Z) = F,(i) = DOWN.

Now the idea for extending DNP’s described in the previous sections to account for topological changes is the following: the cycles of the protocol will be labeled with increasing numbers R, every node i remembers the highest cycle number known to it so far, denoted by Ri, and each Communication Cost: Clearly there is exactly one mes- of the cycles corresponds now to the original (nonexsage on each link in each direction. Its size in bits depends tended) protocol. When a node wants to trigger a new cycle as a result of detecting a topological change in an adjacent on the number of bits assigned to d;. Protocol P UI (Path- Updating Initialization): In order to link, it resets its variables, increments the cycle number Ri allow proper evolution of the PU protocol, it is necessary and acts as if it has received START for a new cycle with to initialize it in the sense of building the initial trees this number. Here “reseting variables” means to adjust the {(i, pi’)} for all “d est’ma t’ions” j in the network. In order to appropriate variables to their required initial value as save space, we shall not present the protocol here explicitly, stated in the corresponding assumption in each of the but only mention that this can be done by using protocol algorithms (e.g., in MH, pz! +- nil, d/‘ + ] v] for all k and CT2 with additions of variables like dj’ and D/(Z) with the Z, - - 1, Ni(m) + - 1 for all adjacent m). The number of the new cycle will be carried by all messagesbelonging to same meaning as in protocol PU. For details, see [8]. this cycle and now, any node receiving a message with VII. TOPOLOGICALCHANGES cycle numbers lower than the one known to it so far discards this message.A node i receiving a message with The protocols presented so far assume fixed topology of higher cycle number R than the highest known to it, R;, the network. As such, the CT and M H protocols may be resets its own variables, increases its registered maximal performed only once and similarly, the path-updating procycle number R, accordingly, and acts as if it enters the tocol may be initialized only once (the PU protocol itself protocol now (i.e., the corresponding cycle of the extended should be repeated periodically to account for load variaprotocol). In this way the cycle with higher number will tions). In this section, we present extensions to the above “cover” the lower number cycles, in the sense that when a protocols that take into consideration failures and addihigher cycle reaches any node, the node will forget the tions of links and nodes, the main idea being that whenever previous knowledge and will participate only in the most a topological change is sensed at some node, a new “cycle” “recent” cycle. Observe that several nodes may start the of the protocol is triggered to inform the network of the same new cycle independently because of multiple toponew situation. Since we are working with a distributed logical changes, but the protocol allows this situation to network, we can make no a priori assumption regarding the happen, considering it in the same way as if several nodes number, sequence,or tim ing of topological changes, and as receive START in the nonextended protocol. such the extended protocols must work for all circumThere is a question, whether it is indeed necessary for all stances. nodes to forget their entire previous knowledge, as opposed With topological changes occurring in the network, the to protocols where only the information affected by the assumptions of Section II should be changed accordingly. topological change is discarded, while the rest of the netIn particular, assumptions a) and c) of Section II will be work adapts smoothly to the new situations. For the PU changed now to protocol, such a protocol appears in [2], [4], [7], but for a’) link (i, j) fails/recovers at the same time that link others this is still an open question. (j, i) fails/recovers, so that (i, j) belongs to the The protocol described so far will be referred to as network if and only if (j, i) belongs to the network; version A of the extended protocol. As an example, we c’) 1) each messagesent by node i on link (i, j) arrives shall consider here the Extended M H protocol. To save correctly in finite nonzero undetermined time or space we do not indicate here the explicit algorithm at each the link fails in finite time; node, but we may indicate that it is identical to the

SEGALL:

DISTRIBUTED

NETWORK

31

PROTOCOLS

algorithm for version B except that lines (2)- (2a’“) are replaced by Ri + R, + 1 for both failure and recovery, and line (Id) is deleted (see [8] for details). The properties of version A are summarized below. Consider an arbitrary finite sequence of topological changes with arbitrary tim ing and location. W ithin finite tim e after the sequence is completed, all nodes i in the final connected network will have J4, = NORMAL with the same cycle number R,, with correct_dlk and pk for all connected nodes k and with df = 1 T/I , pk = nil for all disconnected nodes k. Theorem EMH-A-l:

Proof: Each topological change increments the cycle counter Ri at nodes i adjacent to the change, every change in a link status affects two nodes, and every change in a node status affects a finite number of nodes. Let {in} be the collection of nodes that register change of status of an adjacent link, including those due to status changes of the node at the other end of a link, and let {tn} be the corresponding collection of times when the status change is registered. Since there is a finite number of topological changes, the collections {i,}, {t,} are finite. Let R = max{Ri,(tn+)} over all n. Then R is the highest cycle number ever known in the network and the cycle with number E is started by (one or more) nodes i E {i,} that increment their R, to R as a result of sensing a topological change. These nodes can be considered as if they receive START in the MH protocol and, indeed, the network covered by the cycle with number F registers no more topological changes, since no counter number Ri is ever increased to (E + 1). Consequently, the evolution of this cycle is the same as in protocol MH and therefore Theorem MH-1 holds here, completing the proof. In version A of EMH as presented above, as well as all other similar extended protocols, there is the problem that the cycle counter numbers {R,} increase without bound and hence the question of how many bits are enough to represent this number. In [l], [3] the authors propose a m o d ified version of the extended protocols that ensure bounded counter numbers and here we give a new proof for its validation. The procedure is illustrated as before on Protocol MH, but similarly can be implemented on CT and PUI. In order to provide a framework for understanding version C, it is convenient to first present and validate a nondistributed version B that will introduce the important features of the procedure and then to explain the equivalence between versions B and C.

Algorithm for Node i: (Definition: “reset variables” means pz!‘+ nil, dk + ] VI for all k, Zi + - 1, Ni(Z) + - 1 for all I for which c(Z) = UP.)

(1)

(la> (lb) (lc) (14 (2) Pa) (2’) (20 (2a”)

(2a”‘) (2a’“) Pb) PC> (3) Pa) W) PC)

(4)-P)

Node i becomes operational (Note: Node i becoming operational forces all operating links (i, 1’) with I’ operating, to become operational for I’); E;(I’) * UP for all operating adjacent links (i, 1’) with 1’operating: E;;(Z’)+ DOWN for all nonoperating adjacent links (i, I’) or those links (i, I’) with I’ nonoperating; reset variables; Ri + 0; M i +- NORMAL; for all I’ for .which q(1’) = UP, proceed as in (2a’). Adjacent link (i, 1) fails; Ri +- Ri ---,1 and proceed to (2b). Adjacent link (i, I) becomes operational; wait until &ii = M, = NORMAL and then if Ri < R,, then R, + R, + (R, - R,) for all nodes r that are connected to i and, furthermore, in all messagesthat have been sent by such node r and not received yet, increaseR by (R, Ri); if R, < R,, proceed as (2a”) for nodes connected to I with increments ( Ri 4); R, * Ri + 1 and proceed to (2b).

&(I) + DOWN or UP according to new status; reset variables; M, t WORK; d; +- 0; Z, + 0; LIST, = {i}; send MSG(R,, LISTi) to all m for which E].(m) = UP. For MSG(I, R, LIST); if R 1 R,, then if R > Ri, then Ri + R; same as (2~); if M, = WORK, then same as (4)-(7) in MH, except that MSG has format MSG( Ri, LIST).

Note that (2) and (3b) here correspond to (2) in MH, while (3~) corresponds to (3) in MH. Clearly, similar extended protocols can be given for the CT and also for the PUI protocols. The m a in property of version B is given in Theorem Protocol EMH (extended MH)- Version B: Variables EMH-B-l, but first we need several definitions. Used by the Algorithm at Node i. Same as in MH with the Definitions: A link (i, I) is said to operating if c(l) = addition F1(i) = UP. Two nodes are said to be connected if there is a highest cycle number known to i (values: 0, 1, . . . ), set of operating nodes and links connecting them. A set of Ri E;;(I) status of link (i, I) as known by i (DOWN, UP). nodes is said to be connected if every pair of nodes in the set is connected. A set of nodes s is said to be at level R if Messages Sent And Received at Node i: m in Ri = R for i E S. A set of nodes s (connected or not) MSG( Ri, LISTi)-sent, is said to be synchronized if either a) or b) below holds. MSG( I, R, LIST) = MSG( R, LIST) received on link a) All nodes i E S have M, = WORK. (4 0.

32

IEEE TRANSACTIONS

ON INFORMATION

THEORY,

VOL.

IT-29,

NO.

1, JANUARt

1983

b) There is at least one node i E S with M, = NORMAL and i) Rj = Ri holds Vj E S with Mj = NORMAL and ii) Rj 2 Ri holds Vj E S with M, = WORK. Theorem EMH-B-l: In EMH-version B, if at any time t a set of nodes S is connected, then it is also synchronized. Furthermore, if the set is at level R at time t and if any node j will be connected at any future time t’ > t to any node i E S, then it will have RJt’) > R. (Note: the first property is the important one; the second is only helpful in the proof.) Proof: We proceed by induction on events happening in the network. Suppose both properties above hold up to time t - . Explicitly, every set of nodes S’ that was connected at any time r < t was also synchronized at that time and every node j that was connected to any node in S’ at any time between r and t - had Rj 2 level of S’ at time r. The events that can happen at time t and affect the properties of the theorem are a) a node becomes operational, b) a link fails, c) a link is brought up, and d) M, or R, is changed. We proceed to check each of the possibilities. First, a node that becomes operational will not connect to the rest of the network until (Id) holds, so that this case reduces to c). Second, if the set was synchronized at t - and a link fails, it will remain synchronized just after the failure, except that one has to take into consideration that the failure causes changes of Mi and Ri at the adjacent nodes. However, these changes are treated in d). Observe next that c) can happen only if M,(t - ) = M,(t - ) = NORMAL, where (i, I) is the link under consideration. Suppose first that i and I do not belong to two disconnected sets at time t - . Then, since the set under consideration is synchronized at time t - by the induction hypothesis, it follows that Ri(t - ) = R,(t - ). Hence (2a”) and (2a”‘) do not apply, and therefore the only relevant variables that are changed are Mj, Ri, M,, R, (lines (2a’“) and on), and this again reduces to d). Suppose now that the new link (i, 1) does connect two previously disconnected sets. If R,(t - ) = R,(t - ), the same argument as before applies. If for example, Ri(t - ) < R,(t - ), let t’ be the time just after execution of (2a”), but before execution of (2a’“). Recall that Mi(t - ) = M,(t - ) = NORMAL and since each of the sets are synchronized at t - , we have R,(t - ) 2 Rj( t - ) for all r connected to i and R,(t - ) I R,(t - ) for all r connected to I, with equality in both cases for those nodes r that have M,( t - ) = NORMAL. Now, from t - to t’ all nodes r connected to i raise their cycle number R, by R,(t - ) - Ri(t - ) and so do all messages in transit, and hence, the new combined set remains synchronized at t’. Now the transition from t’ to t + is the execution of (2a’“) and on, and again this reduces to case d), which is treated next. Observe that Rj is increased if and only if M, is WORK or becomes WORK, and clearly if at t - the set was synchronized, it will remain so at t + . Therefore the only situation that remains to be treated is M, + NORMAL, in which case Ri is not changed. Let R be the value of Rj at time t - (and t + ). We must show

Fig. 1.

that for all nodes j E S,, where S, is the set of nodes connected to i at time t, we have Rj( t) L R, with equality if Mj( t) = NORMAL. But since S, is synchronized at t by the induction hypothesis, the condition M,(t) = NORMAL requires RJt) I R for allj E S,, and therefore it is sufficient to show that Rj(t) 2 R for all j E S,. At time t, node i performs Mj t NORMAL and let P be the set of nodes k for which dk(t ) -=c1 v( . Nodes k E P certainly have Rk(t) 2 R. Now take any node (Y such that (Y E S,, but LYti! P. We want to show that R,(t) I R. Observe that there must exist a node /3 E S,, ,8 4 P such that p is at time t a neighbor of a node y E S, fl P (see Fig. 1). Since /? 65 P, node p was disconnected from y at some time after R, + R. Let S,Y_be the connected set containing y at time 7 - , where 7 < t is the time when link (p, y) was brought up. At that time M, was NORMAL and R, 2 R and by the induction hypothesis, S,Y_was at level 2 R. In addition, by the second assertion of the Theorem that holds at time 7 - because of the induction hypothesis, the fact that cx is connected at time t > 7 to y E S,Y_implies R,(t) 2 level of S,Y_at 7 - 1 R. This completes the proof of case d) and shows that the connected set remains synchronized. It remains to prove that any node that will become connected to any node in the considered connected set S, will have at that time counter number 1 R. At time t, every node i E S, has Ri 2 R and R, never decreases. Let t’ be the first time after t when a node j’ becomes connected to any node i E 5,. Since until that time all connected sets are synchronized, it must hold that RjI(t’) 2 R by the same argument as in c) above. Consequently, after t’ all sets remain synchronized and the same argument shows that the property remains true for all future connections, completing the proof of the theorem. Having proved the main properties of version B, we can now make a few observations about this (nondistributed) version that will allow us to introduce an equivalent distributed version. Lemma EMH-B-I: In EMH-version B, the following properties hold a) If (2a’) holds at time t and nodes i and 1 are connected at time t - , then Ri(t - ) = R,(t - ) and hence (,a”)-(,,“‘) are not performed. b) Theorem EMH-A-I holds for version B as well. c) For any node r, R, is nondecreasing and, unless R, is increased by (2a”) or (2a”‘), it has increments of

SEGALL:

DISTRIBUTED

NETWORK

+ 1; if a node i sends two consecutive messageson a given link with counter numbers R’ and R” and if the second messageis not related to the performing of (2a”) or (2a”‘), then R” = R’ or R” = R’ + 1. d) If node i receivesMSG(I, R, LIST) and R > R,, then LIST = {I} and this message was sent by 1 while increasing R,, i.e., either in (2~) or (3b), but not in (6b). e) The values of R and R, are not necessary for the algorithm; we only need to know if R < R,, R = R, or R > R,. Proof: a) follows from Theorem EMH-B-l. b) can be

proved exactly as Theorem EMH-A-1. For c), the fact that R, and the counter numbers in consecutive messagescan

only increase is obvious from (2a), (3b), (2a”), (2a”‘). The rest can be proved by a common induction as follows: suppose both properties hold until tim e t - . The counter R, can be increased at tim e t only in (2a) and (3b) and in both casesonly by + 1. Furthermore, the messagewith R” can be sent by i only in (2a) while incrementing R, by 1, in (3b) while incrementing R, to R which is exactly Ri + 1 by the induction hypothesis, or in (6b) while m a intaining R, to the previous level. This proves both c) and d). F inally, e) is clear from the algorithm. Protocol EMH-version C: Variables Used by the Algorithm at Node i. Same as in MH and in addition c.(I) as in version A and Q ,(l) = R, - XR,(l) where XR,(l) is the largest counter number received from neighbor 1 in version

B. Messages Sent and Received at Node i:

MSG( A Ri, LIST;)-sent, where A Ri has the m e a n ing of the difference between Ri in version B and last R; sent on this link; MSG( I, AR, LIST)-received. Algorithm for Node i: Definition: “reset variables” has the same m e a n ing as in version A.

(1) (la-lb)

cc> (Id) (2) Pa) (2’) Pa’) (2b) PC)

33

PROTOCOLS

Node i becomes operational (same Note as in version A); same as in versions A and B; reset variables; Q i(/‘) + 0 for all 1’ for which E;( 1’) = UP; M, + NORMAL; if there is an operational link (i, 1’) for which M,, = NORMAL, proceed as in (2a’); else wait until this happens and then proceed as in (2a’). Adjacent link (i, 1) fails Q;(l’) t Q i(l’) + 1 ‘dl’ # 1 for which &(I’) = UP; proceed to (2b); Adjacent link (i, 1) is operational and q.(l) = F,(i) = DOWN and M, = M, = NORMAL; Q ;(l) t 1; Q i( 1’) + Qi(l’) + 1 VI’ for which &(I’) = UP; Q,(Z) (note: i.e., AR = 1, Q i( I) = 0), then Q i(l’) t Q ;(F) + 1 VI’ for which lql’) = UP; same as (2c); Q ;(l) + 0; if M, = WORK, then same as in (4)-(7) in MH, except that MSG has format MSG(0, LIST.).

Note: Observe that knowledge by i of the variables F,(i) and MI requires a local protocol on every link (i, 1). W e have numbered the lines in version C to correspond to the appropriate lines in version B. The note appearing in (3b) holds because of L e m m a EMH-B-l c). Observe that (2a’) in version C is equivalent to (2a”), (2a”‘) of version B. This is because if i and 1 are connected at tim e t - , where t is the tim e of the event occurrence, then in version B we have R,(t - ) = R,(t - ) from L e m m a EMH-B-1 a) and (2a’) in version C says exactly the same thing. If, on the other hand, i and 1 are disconnected at tim e t - , the effect of bringing Ri(t - ) and R,(t - ) to the same level while raising accordingly all appropriate counter numbers is equivalent to (2a’) of version C. This implies that versions B and C are equivalent. Furthermore, version C is distributed and the counter numbers are bounded as shown below. Theorem MHE-C-l:

a) The counter numbers AR in MSG take values 0 and 1 only. b) For every i and I, the variable Q i(l) 5 1E 1 , where 1E 1 is the number of links in the network. Proof: All messagessent in the algorithm have AR = 0 or 1 and this proves part a). To see that b) holds, observe that Q i( 1) can increaseonly if, while link (i, I) is operating, node i keeps sending MSG( 1, LISTi) to 1, but I does not respond. After the cycle corresponding to the first of these messagescovers the entire network (or is covered by another cycle), no link can be brought up, since lack of response from 1 does not allow any other node k to return to Mk = NORMAL. Therefore the worst case is when all links fail one after the other in such a way that each increments Q,(Z) and the total number can be no higher than 1E 1 - 1 (for all links except (i, 1)) plus 1 for the case when (i, 1) just came up.

VIII.

CONCLUSION

W e have addressed the problem of providing formal description and validation to a number of distributed

34

IEEE TRANSACTIONS

network protocols. After introducing two simple basic protocols in Section III that form building blocks and a unifying framework for the more complex ones, we introduced three classes of DNP’s-connectivity test, m inimum-hop paths, and path-updating. For each we provided the algorithm for the nodes participating in the protocol and formal proof of its validation, extensively using the properties of the basic protocol on which it is based. Finally, we presented a unified way to extend those protocols to the case of changes in the network topology.

4

ii)

iii) iv)

N,(l)(t-)=Z1 and 3mfl with Ni(m)(t-)=Z - 1; then N,(l)(t + ) = Zi(t + ) = Z and all other N(.) do not change, hence d) continues to hold at time t + ; N,(l)(t -) = Z - 1 and $m #I with N,(m)(t - ) = Z - 1; then N,(l)(t + ) = Z and Z,(t + ) = Z + 1, since (6) holds at t, and d) continued to hold at t + ; N,(l)(t -) = Z, in which case N,(I)(t +) = Z + 1 and Z,( t + ) = Z, hence d) continues to hold at time t + ; we claim that N,(I)( t - ) cannot be Z + 1. Suppose N,(l)(t - ) = Z + 1. Then Ni(!)(t + ) = Z + 2, and from b) it follows that at time tl < t, node 1 has sent MSG(LIST,) while Z, = Z + 2. From (6), (6a) we have Z,(tl - ) = Z + 1 and N,(i)(tl + ) Z Z + 1. This means that 3t2 < tl when i has sent MSG(LIST) to I, while

IT-29, NO.

1, JANUARY

1983

Lemma MH-2: Recall the definition of the term hop-distance just before Theorem MH-1. Under the same conditions as in Lemma MH-1, the following properties hold.

b)

If a node i has nodes at hop-distance r, then it sets Z, + r in finite time and then sends MSG(LIST,), where LIST, contains exactly all nodes k that are at hop-distance r; for all those nodes holds d,k = r and pf = first link on minimumhop path to k and these dk, p” are final Let S, be the largest hop-distance from node i in the network, i.e., node i does have nodes at hop-distance S,, but not at hop-distance (S, + 1); then node i will set Z, + (S, + 1) in finite time, at which time it sends MSG(LIST,) with LIST, = 0 and performs (7); node i will not increase Z, any further. Proof:

at the same time; if MSG is sent by i while Z, = Z, receipt of the MSG at neighbor 1 will cause N,(i) + Z; Z, and N,(m) for each m E 9, change only by increments of iI; for each m E gi, holds N,(m) = Z, or Z, * 1 and there is at least one m for which N;(m) = Z, - 1 (note: this implies Z, = min,Ni(m) + 1); no message can arrive on links (i, m) for which N,(m) = 2, + 1; if Z, is incremented at time t, then for all m E 9, holds N,(m)(t + ) = Z,(t + ) or Z,(t + ) - 1.

Proof: a) holds since propagation of (2) happens as in PI. Assertion b) holds since Z, is incremented whenever MSG is sent ((2), (6)), N,( 1) is incremented whenever MSG is received from [(cf. (4)) and both are initialized to - 1. In addition, c) follows from (2), (4), and (6a). Property d) is true immediately after the time node i enters the algorithm, at which time either Z, = 0 and min, N;(m) = - 1, or Z i = 1 and min, N,(m) = 0, the latter if i has only one neighbor and enters the algorithm by receiving MSG from it. Suppose now that the property is true at node i up to time t - and we want to show that it will hold at time t + as well. The variables N,( .) or Z, can change at time t only if a MSG is received, from neighbor I say. Let Z,( t - ) = Z. We have several cases:

VOL.

This completes the proof of d). Observe now that e) is exactly case iv) in d). Finally, observe that scanning cases i)-iv) of d), we see that Z, is incremented only in case ii) and f) clearly holds in this case, completing the proof of the Lemma.

4 i will enter the protocol in finite time; b) messages are sent by node i if and only if Z, is incremented cl

THEORY,

Z,(t2 + ) = Z + 1. But the latter and Z,(r - ) = Z contradicts the monotonicity of Z, (see c)).

APPENDIX The proof of Theorem MH-1 is given in the following two lemmas. The first one indicates several preliminary properties of Protocol MH connected to message exchanges and variable updates, while in the second we use Lemma MH-I to validate the basic properties of the protocol. Lemma MH-I: Suppose START is delivered to a node (or several nodes). Then for any connected node i the following properties hold.

ON INFORMATION

b)

Setting of Z, +- 0 while sending MSG(LIST,) with LIST, = {i} propagates as in PI and hence will happen at all nodes in finite time. Now suppose a) holds for all nodes that have nodes at hop-distance (r - 1). Consider a node i that has nodes at hop-distance r. Then it and all its neighbors m have nodes at hop-distance (r - 1) and, by the induction hypothesis, they set Z,,, + (r - 1) and send MSG(LIST,). When such a message arrives at i, it sets N,(m) + (r - 1) and after all such messages arrive, (6) will hold with Z, = (r - 1). This causes Z, + r. At this time we have from Lemma MH-1, N,(m) = r or (r - 1) for all m. Now suppose k is at hop-distance r from i. Then there is a neighbor m of i such that k is at hop distance (r - 1) from m and there is no neighbor m of i such that k is at hop-distance strictly less than (r - 1) from m. By the induction hypothesis, k was sent by m in MSG(LIST,) while Z, t (r - 1) and hence was received at i while N,(m) +- (r - l), but was sent by no neighbor m ’ while Z,, + Z < (r - 1). Hence at the time Z, + r we have df = r, and therefore k is sent in MSG(LIST,). From (5a) it is clear that this d,k and the corresponding p,” are final and correct. A similar argument shows that nodes at hop-distance other than r cannot be included in the LIST, considered above. First consider a node i such that S, = min{S,} where the min is over all nodes in the network. All its neighbors m have nodes at distance Si and by a) they send MSG(LIST,,,) while Z,, + S,. When all these messages arrive to i, Z, will become S, + 1, but since i has no nodes at hop-distance S, + 1, holds LIST, = 0 and hence i performs (7). Now suppose by induction that b) holds for all nodes i for which S, 5 S - 1. Consider a node j with S, = S. Node j has a node k at hop-distance S and k is included in LIST, when j sends MSG(LIST,) while Z, + S. For an arbitrary neighbor m ofj, node k is at hop-distance (S - l), S or (S + 1) from m and hence S,,, 2 S - 1. If S,, > S, then a) implies that Z, will become S in finite time. If S, = S - 1, then Z, will become S in finite time from the induction hypothesis. Hence from Lemma MH-I b), N,(m) will become S in finite

IEEE TRANSACTIONS

ON INFORMATION

THEORY,

VOL.

IT-29,

NO.

1, JANUARY

tim e for all neighborsm of j and hence Z j will become (S + 1). Sincej has no nodesat hop-distance(S + l), (7) will hold and this completesthe proof of the lemma. L e m m a MH-1 a) and L e m m a MH-2 a), b) are exactly TheoremMH-1 and this completesthe proof of the theorem.

1983

[3] [4] [5] [6]

REFERENCES

[7] [ 1] [2]

R. G. Gallager, “A shortest path routing algorithm with automatic resynch,” unpublished note, March 1976. A. Segall, P. M. Merlin, and R. G. Gallager, “A recoverable protocol

[8]

35

for loop-free distributed routing,” Proc. ICC, June 1978. S. G. Finn, “Resynch procedures and a failsafe network protocol IEEE Trans. Comm., vol. COM-27, no. 6, pp. 840-846, June 1979. P. M. Merlin and A. Segall, ‘A failsafe distributed routing protocol IEEE Trans. &mm.. vol. COM-27. no. 9. DD. 1280-1288. Sent. 1979. P. M. Merlin, “Specification and validaiion of protocols:’ IEEE Trans. Comm., vol. COM-27, no. 11, pp. 1671-1681, Nov. 1979. A. Segall, “Optimal distributed routing for virtual line-switched data networks, IEEE Trans. Comm., vol. COM-27, no. 1, pp. 201-209, Jan. 1979. -, “Advances in verifiable failsafe routing procedures, IEEE Trans. Comm., vol. COM-29, no. 4, pp. 491-497, Apr. 198 1. -, “Distributed network protocols,” EE Publ. 414, Dept. Elec. Eng., Technion-I.I.T., July 1981; also MIT Rep. LIDS-P-1015.

On Secret Sharing Systems EHUD D. KARNIN,

JONATHAN W . GREENE, MARTIN E. HELLMAN, FELLOW, IEEE

STUDENT MEMBER, IEEE,

Abstract-A “secret sharing system” permits a secret to be shared among n trustees in such a way that any k of them can recover the secret, but any k - 1 have complete uncertainty about it. A linear coding scheme for secret sharing is exhibited which subsumes the polynomial interpolation method proposed by Shamir and can also he viewed as a deterministic version of Blakley’s probabilistic method. Bounds on the maximum value of n for a given k and secret size are derived for any system, linear or nonlinear. The proposed scheme achieves the lower bound which, for practical purposes, differs insignificantly from the upper bound. The scheme may be extended to protect several secrets. Methods to protect against deliberate tampering by any of the trustees are also presented.

1.

INTRODUCTION

RYPTOGRAPHY is extremely useful for making data files unintelligible to anyone who does not possessthe secret key in which they were enciphered. But what happens if the legitimate owner of the file loses the key or is himself lost through incapacity or death? There is a clear need for providing a backup copy of the key to protect against these eventualities. A safe deposit box can easily store a backup copy of the key on a punch card or similar data storage m e d ium since most keys will be between 50 and 1000 bits long. But even a safe deposit box is vulnerable (e.g., to the “silverfish threat,” named for

C

Manuscript received February 13, 1981; revised February 25, 1982. This work was supported in part by the Joint Services Electronics Program under Contract DAAG29-81-0057 and in part by the National Science Foundation under Contract ENG-10173. This paper was presented at the International Symposium on Information Theory, Santa Monica, CA, February 9- 12, 198 1. The authors are with the Department of Electrical Engineering, Stanford University, Stanford, CA, 94305.

STUDENT MEMBER, IEEE, AND

an insect which eats punch cards), so it may be advantageous to provide m u ltiple backup copies. To guard against simultaneous destruction, these copies should be stored in physically separated safe deposit boxes. Letting oi denote the information stored in the ith safe deposit box and letting s denote the secret key, VI =v2 = . . . =vn=s. The secret s can be recovered even if n - 1 pieces have been destroyed, but theft of even one piece compromises the secret. A different approach protects against the threat of theft, but aggravatesthe “silverfish threat.” Divide the secret key s into n pieces v,, v2; .*, v,, in a manner such that no information about s is learned from any n - 1 pieces. This can be accomplished by letting v, to v,- , be independent random variables, uniformly distributed over S, the set of all possible secret keys, and letting v, = s + (v, + v2 + . . . +vn-,)(mod q), where q = 1S 1 is the cardinality of S. As a small example, when q = 2, v, is the exclusive or of s with v, through v~- r. W h ile a l-bit key is of no value, the technique can be applied to successivebits of the key. The advantage of this method is also a disadvantage: if even one of the vi is destroyed, the legitimate owner is unable to reconstruct the key from the remaining backup information. Motivated by a desire to protect against both threats, several researchershave investigated the following secret sharing problem.

001%9448,‘83/0100-0035$01.00 01983 IEEE