Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication  Marcos Kawazoe Aguilera Wei Chen Sam Toueg Department of Computer Science Upson Hall, Cornell University Ithaca, NY 14853-7501, USA. aguilera,weichen,sam@cs.cornell.edu May 30, 1997 Abstract We study the problem of achieving reliable communication with quiescent algorithms (i.e., algorithms that eventually stop sending messages) in asynchronous systems with process crashes and lossy links. We first show that it is impossible to solve this problem without failure detectors. We then show how to solve it using a new failure detector, called heartbeat. In contrast to previous failure detectors that have been used to circumvent impossibility results, the heartbeat failure detector is implementable, and its implementation does not use timeouts. These results have wide applicability: they can be used to transform many existing algorithms that tolerate only process crashes into This can be applied to consensus, aqtuoimesiccebnrtoaaldgcoarsitth, m¡ -ssetht aatgtroeleemraetnet,baottohmpircocceosmsmcriatmsheenst,aentdc.message losses. The heartbeat failure detector is novel: besides being implementable without timeouts, it does not output lists of suspects as typical failure detectors do. If we restrict failure detectors to output only lists of suspects, quiescent reliable communication requires ¢¤£ [ACT97a], which is not implementable. Combined with the results of this paper, this shows that traditional failure detectors that output only lists of suspects have fundamental limitations. 1 Motivation This paper introduces heartbeat, a failure detector that can be implemented without timeouts, and shows how it can be used to solve the problem of quiescent reliable communication in asynchronous message-passing systems with process crashes and lossy links. To illustrate this problem consider two processes, a sender ¥ and a receiver ¦ , connected by an asynchronous bidirectional link. Process ¥ wishes to send some message § to ¦ . Suppose first that no process may crash, but the link between ¥ and ¦ may lose messages (in both directions). If we put no restrictions on message losses it is obviously impossible to ensure that ¦ receives § . An assumption commonly made to circumvent this problem is that the link is fair: if a message is sent infinitely often then it is received infinitely often. With such a link, ¥ could repeatedly send copies of § forever, and ¦ is guaranteed to eventually receive § . This is impractical, since ¥ never stops sending messages. The obvious fix is the following protocol: (a) ¥ sends a copy of § repeatedly until it receives ¨©§ from ¦ , and (b) upon each receipt of § , ¦ sends ¨©§ back to ¥ . Note that this protocol is quiescent: eventually no process sends or receives messages. Research partially supported by NSF grant CCR-9402896, by ARPA/ONR grant N00014-96-1-1014, and by an Olin Fellowship. 1 The situation changes if, in addition to message losses, process crashes may also occur. The protocol above still works, but it is not quiescent anymore: for example, if ¦ crashes before sending ¨!©"#§ , then ¥ will send copies of § forever. Is there a quiescent protocol ensuring that if neither ¥ nor ¦ crashes then ¦ eventually receives § ? It turns out that the answer is no, even if one assumes that the link can only lose a finite number of messages. Since process crashes and message losses are common types of failures, this negative result is an obstacle to the design of fault-tolerant distributed systems. In this paper, we explore the use of unreliable failure detectors to circumvent this obstacle. Roughly speaking, unreliable failure detectors provide (possibly erroneous) hints on the operational status of processes. Each process can query a local failure detector module that provides some information about which processes have crashed. This information is typically given in the form of a list of suspects. In general, failure detectors can make mistakes: a process that has crashed is not necessarily suspected and a process may be suspected even though it has not crashed. Moreover, the local lists of suspects dynamically change and lists of different processes do not have to agree (or even eventually agree). Introduced in [CT96], the abstraction of unreliable failure detectors has been used to solve several important problems such as consensus, atomic broadcast, group membership, non-blocking atomic commitment, and leader election [BDM97, DFKM96, Gue95, LH94, SM95]. Our goal is to use unreliable failure detectors to achieve quiescence, but before we do so we must address the following important question. Note that any reasonable implementation of a failure detector in a message-passing system is itself not quiescent: A process being monitored by a failure detector must periodically send a message to indicate that it is still alive, and it must do so forever (if it stops sending messages it cannot be distinguished from a process that has crashed). Given that failure detectors are not quiescent, does it still make sense to use them as a tool to achieve quiescent applications (such as quiescent reliable broadcast, consensus, or group membership)? The answer is yes, for two reasons. First, a failure detector is intended to be a basic system service that is shared by many applications during the lifetime of the system, and so its cost is amortized over all these applications. Second, failure detection is a service that needs to be active forever — and so it is natural that it sends messages forever. In contrast, many applications (such as a single RPC call or the reliable broadcast of a single message) should not send messages forever, i.e., they should be quiescent. Thus, there is no conflict between the goal of achieving quiescent applications and the use of a (non-quiescent) failure detection service as a tool to achieve this goal. How can we use an unreliable failure detector to achieve quiescent reliable communication in the presence of process and link failures? Consider the Eventually Perfect failure detector $&% [CT96]. Intuitively, $&% satisfies the following two properties: (a) if a process crashes then there is a time after which it is permanently suspected, and (b) if a process does not crash then there is a time after which it is never suspected. Using $'% , the following obvious algorithm solves our sender/receiver example: (a) while ¥ has not received ¨©§ from ¦ , it periodically does the following: ¥ queries $&% and sends a copy of § to ¦ if ¦ is not currently suspected; (b) upon each receipt of § , ¦ sends ¨!©§ back to ¥ . Note that this algorithm is quiescent: eventually no process sends or receives messages. In [ACT97a], Aguilera et al. show that among all failure detectors that output lists of suspects, $&% is the weakest one that can be used to solve the above problem.1 Unfortunately, $'% is not implementable in asynchronous systems with process crashes (this would violate a known impossibility result [FLP85, CT96]). Thus, at a first glance, it seems that achieving quiescent reliable communication requires a failure detector that cannot be implemented. In this paper we show that this is not so. 1See [CHT96] for the concept of “weakest” for failure detectors. 2 2 The Heartbeat Failure Detector We will show that quiescent reliable communication can be achieved with a failure detector that can be implemented without timeouts in systems with process crashes and lossy links. This failure detector, called heartbeat and denoted (0) , is very simple. Roughly speaking, the failure detector module of (1) at a process 2 outputs a vector of counters, one for each neighbor 3 of 2 . If neighbor 3 does not crash, its counter increases with no bound. If 3 crashes, its counter eventually stops increasing. The basic idea behind an implementation of (0) is the obvious one: each process periodically sends an I-am-alive message (a “heartbeat”) and every process receiving a heartbeat increases the corresponding counter.2 Note that (1) does not use timeouts on the heartbeats of a process in order to determine whether this process has failed or not. (1) just counts the total number of heartbeats received from each process, and outputs these “raw” counters without any further processing or interpretation. Thus, (1) should not be confused with existing implementations of failure detectors (some of which, such as those in Ensemble and Phoenix, have modules that are also called heartbeat [vR97, Cha97]). Even though existing failure detectors are also based on the repeated sending of a heartbeat, they use timeouts on heartbeats in order to derive lists of processes considered to be up or down; applications can only see these lists. In contrast, (0) simply counts heartbeats, and shows these counts to applications. A remark is now in order regarding the practicality of (1) . As we mentioned above, (0) outputs a vector of unbounded counters. In practice, these unbounded counters are not a problem for the following reasons. First, they are in local memory and not in messages — our (1) implementations use bounded messages (which are actually quite short). Second, if we bound each local counter to 64 bits, and assume a rate of one heartbeat per nanosecond, which is orders of magnitude higher than currently used in practice, then (0) will work for more than 500 years. (0) can be used to solve the problem of quiescent reliable communication and it is implementable, but its counters are unbounded. Can we solve this problem using a failure detector that is both implementable and has bounded output? [ACT97a] proves that the answer is no: The weakest failure detector with bounded output that can be used to solve quiescent reliable communication is $54 . Thus, the difference between (0) , whose output is unbounded, and existing failure detectors, whose output is bounded, is more than “skin deep”. The results in this paper combined with those of [ACT97a], show that failure detectors with bounded output (including those that output lists of processes) are restricted in power and/or applicability. 3 Outline of the Results We focus on two types of reliable communication mechanisms: quasi reliable send/receive and reliable broadcast. Roughly speaking, a pair of send/receive primitives is quasi reliable if it satisfies the following property: if processes ¥ and ¦ are correct (i.e., they do not crash), then ¦ receives a message from ¥ exactly as many times as ¥ sent that message to ¦ . Reliable broadcast [HT94] ensures that if a correct process broadcasts a message § then all correct processes deliver § ; moreover, all correct processes deliver the same set of messages. We first show that there is no quiescent implementation of quasi reliable send/receive or of reliable broadcast in a network with process crashes and message losses. This holds even if we assume that links can lose only a finite number of messages. We then show how to use failure detectors to circumvent the above impossibility result. We describe failure 2As we will see, however, in some types of networks the actual implementation is not entirely trivial. 3 p sq t p q s r r (a) simple network case (b) general network case Figure 1: Examples of the simple and general network cases link is fair link is not fair detector (1) , and show that it is strong enough to achieve quiescent reliable communication, but weak enough to be implementable, in each one of the following two types of communication networks. In both types of networks, we assume that each correct process is connected to every other correct process through a fair path, i.e., a path containing only fair links and correct processes.3 In the first type, all links are bidirectional and fair (Fig. 1a). In the second one, some links are unidirectional, and some links have no restrictions on message losses, i.e., they are not fair (Fig. 1b). Examples of such networks are networks that contain several unidirectional rings that intersect. For each network type, we first describe quiescent protocols that use (0) to solve quasi reliable send/receive and reliable broadcast, and then show how to implement (1) . For the first type of networks, a common one in practice, the implementation of (1) and the reliable communication protocols are very simple and efficient. The algorithms for the second type are significantly more complex. We also briefly consider two stronger types of communication primitives, namely, reliable send and receive, and uniform reliable broadcast, and give quiescent implementations that use (0) . These implementations assume that a majority of processes are correct (a result in [BCBT96] shows that this assumption is necessary). We then explain how (0) can be used to easily transform many existing algorithms that tolerate process crashes into quiescent algorithms that tolerate both process crashes and message losses (fair links). This transformation can be applied to the algorithms for consensus in [Ben83, Rab83, BT85, CMS89, FM90, AT96, CT96], for atomic broadcast in [CT96], for  -set agreement in [Cha93], for atomic commitment in [Gue95], for approximate agreement in [DLP6 86], etc. Finally, we show that (1) can be used to extend the work in [BCBT96] to obtain the following result. Let 4 be a problem. Suppose 4 is correct-restricted (i.e., its specification refers only to the behavior of correct processes) or a majority of processes are correct. If 4 is solvable with a quiescent protocol that tolerates only process crashes, then 4 is also solvable with a quiescent protocol that tolerates process crashes and message losses.4 To summarize, the main contributions of this paper are: 3This assumption precludes permanent network partitioning. 4The link failure model in [BCBT96] is slightly different from the one used here (cf. Section 11). 4 1. This is the first work that explores the use of unreliable failure detectors to achieve quiescent reliable communication in the presence of process crashes and lossy links — a problem that cannot be solved without failure detection. 2. We describe a simple and implementable failure detector (0) that can be used to solve this problem. 3. (1) can be used to extend existing algorithms for many fundamental problems (e.g., consensus, atomic broadcast,  -set agreement, atomic commitment, approximate agreement) to tolerate message losses. It can also be used to extend the results of [BCBT96]. 4. (1) is novel: it is implementable without timeouts, and it does not output lists of suspects as typical failure detectors do [BDM97, CT96, Gue95, GLS95, LH94, SM95]. The results of this paper, combined with those in [ACT97a], show that lists of suspects is not always the best failure detector output.5 Reliable communication is a fundamental problem that has been extensively studied, especially in the context of data link protocols (see Chapter 22 of [Lyn96] for a compendium). Our work differs from previous results by focusing on the use of unreliable failure detectors to achieve quiescent reliable communication in the presence of process crashes and link failures. The work by Basu et al. in [BCBT96] is the closest to ours, but their protocols do not use failure detectors and are not quiescent. In Section 11, we use (1) to extend the results of [BCBT96] and obtain quiescent protocols. The paper is organized as follows. Our model is given in Section 4. Section 5 defines the reliable communication primitives that we focus on. In Section 6, we show that, without failure detectors, quiescent reliable communication is impossible. To overcome this problem, we define heartbeat failure detectors in Section 7, we show how to use them to achieve quiescent reliable communication in Section 8, and show how to implement them in Section 9. In Section 10, we consider two stronger types of communication primitives. In Section 11, we explain how to use heartbeat failure detectors to extend several previous results. In Section 12, we mention a generalization of our results for the case where the network may partition. A brief discussion of protocol quiescence versus protocol termination concludes the paper. 4 Model We consider asynchronous message-passing distributed systems in which there are no timing assumptions. In particular, we make no assumptions on the time it takes to deliver a message, or on relative process speeds. Processes can communicate with each other by sending messages through the network. We do not assume that the network is completely connected or that the links are bidirectional. The system can experience both process failures and link failures. Processes can fail by crashing, and links can fail by dropping messages. To simplify the presentation of our model, we assume the existence of a discrete global clock. This is merely a fictional device: the processes do not have access to it. We take the range 7 of the clock’s ticks to be the set of natural numbers. 4.1 Processes and Process Failures The system consists of a set of 8 processes, Π 9A@ 1BDCECECDBF8HG . Processes can fail by crashing, i.e., by prematurely halting. A failure pattern I is a function from 7 to 2Π, where IPQR denotes the set of processes that have crashed through time Q . Once a process crashes, it does not “recover”, i.e., STQ : IUQVXWYIUQa` 1 . We define 5The authors of [CHT96] anticipated this possibility: they put no restrictions on the output of unreliable failure detectors when they determine the weakest one necessary to solve consensus. 5 ibncedgI fehpaienq drIsift29vˆ u&beƒ„wyxc€‚cVieIPbE… rQRIs  and we beƒ„ccViebE… rIsa9 Π † say 2 is correct in bced„feh!ieq I. ‡I5 . If 2‰ˆ bDcVd„fehpieq rI5 we say 2 crashes (or is faulty) 4.2 Links and Link Failures Some pairs of processes in the network are connected through unidirectional links. If there is a link from process 2 to process of neighbors 3 o,fw2 eisdedneontoetetdhibsylin• ki–˜b—„yhd™R2ƒ„c ‘e2#3 , . and if, in addition, 3“9”’ 2 we say that 3 is a neighbor of 2 . The set With every link 2“f3 we associate two primitives: sendggh ig§ and receivei"h gp§ . We say that process 2 sends message § to process 3 if 2 invokes sendggh ig§ . We assume that if 2 is correct, it eventually returns from this invocation. We allow process 2 to send the same message § more than once through the same link. We say that process 3 receives message § from process 2 if 3 returns from the execution of receiveieh gd§ . We describe a link 2“f3 by the properties that its sendggh i and receivei"h g primitives satisfy. We assume that links do not create messages, i.e., every link 20j3 in the network satisfies: k Integrity: For all 0l 1, if 3 receives § from 2‚ times, then 2 previously sent § to 3 at least  times. A lossy link can fail by dropping messages. A link 2mn3 is fair if sendggh i and receiveieh g satisfy Integrity and: k Fairness: If 3 is correct and 2 sends § to 3 an infinite number of times, then 3 receives § from 2 an infinite number of times. 4.3 Network Connectivity A path e2 1 BDCECDCEB2po„ is fair if processes 2 1 BECDCECBq2ro are correct and links 2 1 f2 2, CDCEC , 2roEs 1 f2ro are fair. We assume that every pair of distinct correct processes is connected through a fair path. Without loss of generality, we can assume that this path is simple (i.e., no process appears twice in that path). 4.4 Failure Detectors Each process has access to a local failure detector module that provides (possibly incorrect) information about the failure pattern that occurs in an execution. A process can query its local failure detector module at any time. A failure detector history t with range u is a function from Π v‰7 to u . twe2HBxQV is the output value of the failure detector module of process 2 at time Q . A failure detector y is a function that maps each failure pattern I to a set of failure detector histories with range u‚z (where u{z denotes the range of failure detector outputs of y ). ymrIs denotes the set of possible failure detector histories permitted by y for the failure pattern I . We stress that the output of a failure detector depends only on the failure pattern I ; it cannot depend on the behavior of applications. This means that failure detectors can neither obtain feedback from applications nor be used by applications to transmit information in any manner. As an example, consider a Strong failure detector y [CT96]. Each failure detector module of y outputs a set of processes that are suspected to have crashed, i.e., u“z|9 2Π. y satisfies the following two properties: k Strong Completeness: Eventually every process that crashes is permanently suspected by every correct process. More precisely: SI}BxSt~ˆ{yrI5DBVQ€ˆ07{B‡S2ˆ bcedgfehpieq rIsB S#3sˆ beƒ„ccViebD… rI5DB‡STQx‚pƒ„Q : 2ˆtwr3BFQ‡‚q 6 k Weak Accuracy: Some correct process is never suspected. More precisely: SI}B‡S#t…ˆ0ymrIsBe20ˆ beƒ„ccViebE… rIsB STQ€ˆm7†BxS3Xˆ Π †‡IPQV : 2‰ˆmtwr3BxQV The class of all failure detectors that satisfy the above two properties is denoted Š . Let ‹ be a class of failure detectors. An algorithm solves a problem using ‹ if it can solve this problem using any yŒˆ‚‹ . An algorithm implements ‹ if it implements some yˆ“‹ . 5 Quiescent Reliable Communication In this paper, we focus on quasi reliable send and receive and reliable broadcast, because these communication primitives are sufficient to solve many problems (see Section 11.1). We also briefly consider stronger types of communication primitives — reliable send and receive, and uniform reliable broadcast — in Section 10. 5.1 Quasi Reliable Send and Receive Consider any two distinct processes ¥ and ¦ . We define quasi reliable send and receive from ¥ to ¦ in terms of two primitives, sendŽRh  and receiveh Ž , that must satisfy Integrity and the following property: k Quasi No Loss6: For all ‚l 1, if both ¥ and ¦ are correct and ¥ sends § to ¦ exactly  times, then ¦ receives § from ¥ at least  times. Note that Quasi No Loss together with Integrity implies that for all ‚l 0, if both ¥ and ¦ are correct and ¥ sends § to ¦ exactly  times, then ¦ receives § from ¥ exactly  times. We want to implement quasi reliable send/receive primitives using the (lossy) send/receive primitives that are provided by the network. In order to differentiate between these two, the first set of primitives is henceforth denoted by SEND/RECEIVE, and the second one, by send/receive. Informally, an implementation of SENDŽRh  and RECEIVEh Ž is quiescent if a finite number of invocations of SENDŽVh  cause only a finite number of invocations of sends throughout the network. 5.2 Reliable Broadcast Reliable broadcast [BT85] is defined in terms of two primitives: broadcast§ and deliver§ . We say that process 2 broadcasts message § if the following fields: the identity 2ofinitvsoskeensdberro, addencoatsetd§fDi . • Wq‘iDec assume that § , and a every broadcast message § sequence number, denoted ifDniec’ lu§de s. These fields make every message unique. We say that 3 delivers message § if 3 returns from the invocation of deliver§ . Primitives broadcast and deliver satisfy the following properties[HT94]: k Validity: If a correct process broadcasts a message § , then it eventually delivers § . k Agreement: If a correct process delivers a message § , then all correct processes eventually deliver § . k Uniform Integrity: For previously broadcast by fDeiv• erqyic message § . § , every process delivers § at most once, and only if § was 6A stronger property, called No Loss, is used in Section 10.1 to define reliable send and receive. 7 1 For every process 2 : 2 3 To execute broadcast§ : 4 5 deliver § for all 3Xˆ‰• iD–˜—gh“™Rƒ„c e2# do SENDggh i § 6 return 7 8 upon RECEIVEggh iE§ do 9 if 2 has not previously executed deliver§ then 10 11 deliver § for all 3Xˆ”• i–q—gh“™Rƒ„c •2 do SENDggh i§ Figure 2: Quiescent implementation of reliable broadcast using a quiescent implementation of SEND and RECEIVE primitives between neighbors We want to implement reliable broadcast using the (lossy) send and receive primitives that are provided by the network. Informally, an implementation of reliable broadcast is quiescent if a finite number of invocations of broadcast cause only a finite number of invocations of sends throughout the network. 5.3 Relation between Reliable Broadcast and Quasi Reliable Send and Receive From a quiescent implementation of quasi reliable send and receive one can easily obtain a quiescent implementation of reliable broadcast, and vice versa. Remark 1 From any quiescent implementation of reliable broadcast, we can obtain a quiescent implementation of the quasi reliable primitives SENDggh i and RECEIVEi"h g for every pair of processes 2 and 3 . The implementation is trivial: to SEND a using the given quiescent implementation message § of reliable btoroa3 d, c2 assitm, wplhyerberofDaid• cqasiDtc sr–˜the™m9šes2 saagned– fDie’9—r–˜§w™Bq2H9›Be3 BV,Ta sequence number that 2 has not used before. Upon the delivery of §”B2œBe3Be“ , a process ¦ RECEIVEs § from 2 if ¦™93 , and discards § otherwise. This implementation of SENDggh i and RECEIVEi"h g is clearly correct and quiescent. Remark 2 Suppose that every pair of correct processes is connected through a path of correct processes. If we have a quiescent implementation of quasi reliable primitives SENDggh i and RECEIVEieh g for all processes 2 and 3Xˆ neighbor e2# , then we can obtain a quiescent implementation of reliable broadcast. The implementation of reliable broadcast, a simple flooding algorithm taken from [HT94], is given in Figure 2 (the code consisting of lines 9 and 10 is executed atomically7). It is clear that this implementation is quiescent. Indeed, for every message § , an invocation of broadcast§ can cause at most 8ž† 1 invocations of SEND per process. Moreover, since the implementation of SEND is quiescent, each invocation of SEND causes only a finite number of invocations of sends. Thus, a finite number of invocations of broadcast causes a finite number of invocations of sends. Ÿ Ÿ7A process executes a region of code atomically if at any time there is at most one thread of in this region. 8 6 Impossibility of Quiescent Reliable Communication Quiescent reliable communication cannot be achieved in a network with process crashes and message losses. This holds even if the network is completely connected, only a finite number of messages can be lost, and processes have access to a Strong failure detector. Theorem 1 Consider a network where every pair of processes is connected by a fair link and at most one process may crash. Let ¥ and ¦ be any two distinct processes. There is no quiescent implementation of quasi reliable send and receive from ¥ to ¦ . This holds even if we assume that only a finite number of messages can be lost, and the implementation can use Š . Proof (Sketch). Assume, by contradiction, that there exists a quiescent implementation   of quasi reliable SENDŽRh  and RECEIVEh Ž using Š . We now construct three runs of   , namely, ¡ 0, ¡ 1 and ¡ 2, in which only ¥ may SEND a message – to ¦ and no other process invokes any SEND. In run ¡ 0, ¥ SENDs no messages, all processes are correct, all messages are received one time unit after they are sent, and the failure detector behaves perfectly (i.e., no process suspects any other process). Since   is quiescent, there is a time Q 0 after which no messages are sent or received. By the Integrity property of SEND and RECEIVE, process ¦ never RECEIVEs any message. Run ¡ 1 is identical to run ¡ 0 up to time Q 0; at time Q 0 ` 1, ¥ SENDs – to ¦ , and ¦ crashes; after time Q 0 ` 1, no processes crash, and all messages are received one time unit after they are sent; at all times, the failure detector behaves perfectly (i.e., ¦ is suspected by all processes from time Q 0 ` 1 on, and there are no other suspicions). Since   is quiescent, there is a time Q 1 ƒ„Q 0 after which no messages are sent or received. In run ¡ 2, ¦ and its failure detector module behave exactly as in run ¡ 0 (in particular, ¦ does not crash and ¦ receives a message § in ¡ 2 whenever it receives § in ¡ 0); all other processes and their failure detector modules behave exactly as in run ¡ 1 (in particular, a process 2‡9¢’ ¦ receives a message § in ¡ 2 whenever it receives § in ¡ 1). Note that, in ¡ 2, if messages are sent to or from ¦ after time Q 0, then they are never received. We now show that in ¡ 2 the send and receive primitives satisfy the Integrity property. Assume that for some ”l 1, some process 3 receives § from some process 2ž times. There are several cases. (1) If 3“9Y¦ then ¦ receives § from 2ž times in ¡ 0 (since ¦ behaves in the same way in ¡ 0 and ¡ 2). In ¡ 0, by the Integrity property of send and receive, 2 sent § to ¦ at least  times. This happens by time Q 0, since there are no sends in ¡ 0 after time Q 0. Note that by time Q 0, 2 behaves exactly in the same way in ¡ 0 Be¡ 1 and ¡ 2. Thus 2 sent § to ¦ at least  times by time Q 0 in ¡ 2. (2) If 319v’ ¦ and 2ž9v¦ , then 3 receives § from ¦X times in ¡ 1 (since 3 behaves in the same way in ¡ 1 and ¡ 2). In ¡ 1, by the Integrity property of send and receive, ¦ sent § to 3 at least  times. This happens by time Q 0, since ¦ crashes at time Q 0 ` 1 in ¡ 1. By time Q 0, ¦ behaves exactly in the same way in ¡ 0 BV¡ 1 and ¡ 2. Thus ¦ sent § to 3 at least  times by time Q 0 in ¡ 2. (3) If 3“9v’ ¦ and 2£9¤’ ¦ , then 3 receives § from 21 times in ¡ 1 (since 3 behaves in the same way in ¡ 1 and ¡ 2). By the Integrity property of send and receive in ¡ 1, 2 sent § to 3 at least  times. Note that 2 behaves exactly in the same way in ¡ 1 and ¡ 2. Thus 2 sent § to 3 at least  times in ¡ 2. Therefore, the send and receive primitives in ¡ 2 satisfy the Integrity property. We next show that in ¡ 2 the send and receive primitives satisfy the Fairness property, and in fact only a finite number of messages are lost. Note that ¦ sends only a finite number of messages in ¡ 0 (since it does not send messages after time Q 0), and every process 2„9¢’ ¦ sends only a finite number of messages in ¡ 1 (since it does not send messages after time Q 1). So, by construction of ¡ 2, all processes send only a finite number of messages in ¡ 2. Therefore, only a finite number of messages are lost, and the send and receive primitives satisfy the Fairness property. Finally, we show that in ¡ 2 the failure detector satisfies the properties of a Strong failure detector. Indeed, there are no crashes and therefore Strong Completeness holds vacuously; also there exists a process, namely ¥ , which 9 is never suspected by any process, and so Weak Accuracy holds. We conclude that ¡ 2 is a possible run of   using Š in a network with fair links that lose only a finite number of messages. Note that in ¡ 2: (a) both ¥ and ¦ are correct; (b) ¥ SENDs – to ¦ ; and (c) ¦ does not RECEIVE – . This violates the Quasi No Loss property of SENDŽRh  and RECEIVEh Ž , and so   is not an implementation of SENDŽRh  and RECEIVEh Ž — a contradiction. ¥ Theorem 1 and Remark 1 immediately imply: Corollary 2 There is no quiescent implementation of reliable broadcast, even if the implementation can use Š . To overcome these impossibility results, we now introduce the heartbeat failure detector. 7 Definition of ¦š§ A heartbeat failure detector y has the following features. The output of y at each process 2 is a list e2 1BF8 1DBEe2 2BF8 2BECDCECBEe2roBx8#o¨ , where 2 1B2 2 BDCECDCEB2po are the neighbors of 2 , and each 8p© is a nonnegative integer. Intuitively, 8 © increases while 2 © has not crashed, and stops increasing if 2 © crashes. We say that 8 © is the heartbeat value of 2 © at 2 . The output of y at 2 at time Q , namely twe2HBxQV , will be regarded as a vector indexed by the set @F2 1 B2 2 BDCECDCEB2po‘G . Thus, twe2HBxQVDª 2 ©V« is 8 © . The heartbeat sequence of 2 © at 2 is the sequence of the heartbeat values of 2 © at 2 as time increases. y satisfies the following properties: k (1) -Completeness: At each correct process, the heartbeat sequence of every faulty neighbor is bounded. Formally: SI}B‡St~ˆ‚ymrIsBxS‘2mˆ beƒ„ccViebE… rI5DB‡S3sˆ bced„feh!ieq rIs#¬ž• i–q—gh“™Rƒ„c •2DBV“­~ˆ1®}B‡S¯Qsˆ17 : twe2HBxQVDª°3 «± ­ k (1) -Accuracy: – At each process, the heartbeat sequence of every neighbor is nondecreasing. Formally: SI}BxSt~ˆ“ymrIsBxS‘2mˆ ΠB S3Xˆ”• i–˜—„h“™Rƒgc e2#BxSTQ²ˆm7 : twe2HBFQRDª°3 «³± twe2HBxQ` 1Dª°3 « – At each correct process, the heartbeat sequence of every correct neighbor is unbounded. Formally: SI}B‡S#t…ˆ‚ymrIsBxS‘2mˆ beƒ„ccViebE… rIsB S3Xˆ beƒ„ccViebE… x´5“¬‰• i–˜—„h“™Rƒgc e2#B S­µˆ0®}BeQsˆ17 : twe2HBFQREª¶3 « ƒ¢­ The class of all heartbeat failure detectors is denoted (1) . By a slight abuse of notation, we sometimes use (1) to refer to an arbitrary member of that class. It is easy to generalize the definition of (1) so that the failure detector module at each process 2 outputs the heartbeat of every process in the system [ACT97b], rather than just the heartbeats of the neighbors of 2 , but we do not need this generality here. 10 8 Quiescent Reliable Communication Using ¦š§ The communication networks that we consider are not necessarily completely connected, but we assume that every pair of correct processes is connected through a fair path. We first consider a simple type of such networks, in which every link is assumed to be bidirectional8 and fair (Fig. 1a). This assumption, a common one in practice, allows us to give efficient and simple algorithms. We then drop this assumption and treat a more general type of networks, in which some links may be unidirectional and/or not fair (Fig. 1b). For both network types, we give quiescent reliable communication algorithms that use (1) . Our algorithms have the following feature: processes do not need to know the entire network topology or the number of processes in the system; they only need to know the identity of their neighbors. In our algorithms, y5g denotes the current output of the failure detector y at process 2 . 8.1 The Simple Network Case Wof¦ r,foepqmaruosatschsueiimssrseSel¥ iEtahfibNarltseDatS.lflToElraiNnkskskDstrŽVhienh epteathaanestdknsReertenEwpdCoera¦„EktBxI§waVsreEeB nfDbdieh iŽ ’d¦„fiBFor,§”erwcttBhhifDieociench’ aa lrsuwaennh¦Psdeˆ”irfneai•trshiDe(e–˜Fq—gbih“igas™R.cƒ„a1kcagf)rrr.¥eosWuh(nsedsefie,eqrrFseutipeggen.iacv3teee).dnalTuqymousiSbeeesEnrc,dNeanDRnt·¹diam¸¯tmhpºaeleeBFns§”msiateB gnrfDeetieat§’ utironttnoos ¦ , where ·P¸pº is a tag. This send occurs heartbeat value of ¦ has increased. The task from ¦ . the first This time iatcrkencoewivleedsgRe·¹m¸¯enºatBF§”is BsfDeien’ t by . every time ¥ queries its repeat send terminates if ¦ every time it receives fR¥ a·Prileu¸¯crºteeBFivd§”eetBsefDacient’oa r.cmkPnoroodwcuellseesdag¦ nedRmEneonCttiEcye»²IVs¼aEt½¤hsaB §tfDieth’aet The code consisting of lines 7 and 8 is executed atomically, as well as the code consisting of lines 23 and 24. If there are several concurrent executions of the repeat send task (lines 11–18), then each execution must have its own private copy of all the local variables in this task, namely, ¦ , § , seq, hb and prev hb r. We now show that When ambiguities the algorithm of Fig. 3 is correct and quiescent. Note that may arise, a variable local to process 2 is subscripted by 2al,lev.agr.i,ah“b™leg siasrtehelolcoaclatloveaarciahbpleroh“ce™ sosf. process 2 . Lemma 3 (Integrity) For all ‚l 1, if ¦ RECEIVEs § from ¥} times, then ¥ SENT § to ¦ at least  times. PfbSFreioronformocoerefea¥.¦c¦ hfNRor¿0oEretcCeˆweaeEtci@ hvhI1aVeBEt¿”sE,CDCEfsˆÀRoCD·¹§¾rBe@ ¯¸¯a1GnºtBD,ytCEBFiT§”CDmfaCÁ•seBBVk,spf •¦tGrh©e.oep nnBeflarytyohtmetRshreeEen¥ Ca.dInrEeTt¦„eIhBF gV§”irsdEiitBsfsyffe•e§ pnr© redonfprcctoeaavmrnnatyloou¥ nneoollsfyynfsbo•teehc1necfBEduoCDfirrCEarkCDisnenBtddf tl•piirdmneo uecerseiu1niic6tvghreoeat,fnhc¥Taeitanisv¦sveekornsecrdaecRstpe·PieoiRv¸¯an·¹etºtos¸¯sBFfeºa§”RSn·PBFdB§”E¸¯fN¦„•pºtB BFfDBF§”• §”fŽR©rBh o¨B f mf •t§•o© ©¥ ¦ .  . , and each such invocation forks only one task repeat send. Therefore, ¥ SENT § to ¦ at least  times. ¥ Let § ¿{l e9|t’9 f ¿ • 1 © be a BDCECECDBe message , we can and ¤l associate 1, and consider a a sequence number rfu• n© in which with the ¿ ¥ invokes SENDŽRh  § exactly  -th invocation of SENDŽRh g§ as be then tfh•¯eÃ}v9’ aluf e• © of the global variable seq after , since during each invocation line 7 is executed during the ¿ -th invocation.9 of SENDŽVh  the global variable seq is increased times. For follows: we Note that if by one, and it can never be decreased. Ÿ&ÄvÅ ÅaÄ|Ÿ Å¤Æ ÇeŸ¨È8In our model, this means that link is in the network if and only if link is in the network. In other words, neighbor if and only if ŸÉÆ neighborǘÅÈ . Ê Ë Ì9If crashes during the -th invocation before executing line 7, we let sn be equal to one plus the value of the global variable seq at the time of the invocation. 11 1 For process ¥ : 2 3 4 InitiafDliiez’ÉatÍ ion0: @ seq is the current sequence number G 5 6 7 8 To exfDeiec’ÉutÍÎe SfDEie’N` DŽV1h g§ : fork task repeat send¦„BF§”B fDie’  9 return 10 11 12 task ÏrecepieÐ ath“s™ enc5dÍ ¦„BF§”B †1 fDie’  : 13 14 15 16 17 18 ruenpteilaih“rfte™&ÏpcÍceeeirsÏ iÐiveyÑcVoeindh“gЎVdi™ hc¨ŽRh“ayc™h ¨»²™llRÒÓy¼a·Pcs½¤hd¸¯ÍÎB™ºtfDª ¦BFieh“§”«’ ™ tª hB¦ ffD«erieon’m ¦ @ query the heartbeat failure detector G 19 20 For process ¦ : 21 22 23 24 upon receiveh Ž R·P¸pºaBx§wB fDie’  do isfetnhdisih Žsyt»ah¼²e ½¤firB sfDtiet’ im e ¦ receives R·P¸¯ºtBF§”B fDie’  from ¥ then RECEIVEh Ž § F¦¹igˆ‰u•reiD–˜3—g:h“™Rƒ„Sc imr¥p le network case — quiescent implementation of SENDŽVh  and RECEIVEh Ž using (0) for Lemma 4 (Quasi No Loss) For all ‚l 1, if both ¥ and ¦ are correct and ¥ SENDs § to ¦ exactly  times, then ¦ RECEIVEs § from ¥ at least  times. afbnPi¦ srs.eerosvsoWmtooehocmrefhi¥.aseseteeneSe¿‚vqndu¥eudˆwprweisypnn@iovyctt1s»²hioemeBEk¼ana,CDeeu½¤CEbsuCDmiyBntSBefibr¯cqE•eeoGu©NrcneseatDurtsimaovcsŽRdheoeh ¥¨iscs.ctsih§tfaBaiarotgoytenemdfV,t,oh·Pwt¦ re¥h¸¯itastIahºthtneemt¦BFnteh¿§”degRe-srstB sih¿Etfay-yC•t»agtih©pemE¼ario½ÕIenoiVf,vpsB Eioetfnthrc•sefteayo©v§tfreiookororfsnfmrnrseaooleycmftRnae·¹Sisdfi¥vkE¸¯iaetlNºarendersdBFDeps§”bcerŽVyteaehhBe„tc¦iÔ¨ave§nsfeeirtvsonhe.mdaVt,Sti·Ptm¥ioi¥ nt¸¯.nercdºtseeSei.vpdBFiaene§”NlnraclooteBrteehttf dei•ecrcale©eeycf c ih• t.vseh© eesiaW’vesnstqeeda¦yu»²rbceeRRe¼aon·¹dfEnc½¤oi¸¯cesCrBltºteunifE.n•dBFuIc§”Lem©Vt,etbEB thtfefhrsaf•roet•§©mri¦e©s troec¦ e. iTvehiss×t»aas¼ak½ÕwB if l•l be © referred to as task Ö from ¦ . Therefore, . Task Ö the loop never in lines terminates, because it can only do so if ¥ crashes or if ¥ 13–18 of task Ö is repeated infinitely often. Moreover, since ¦ is correct, the (1) -Accuracy property guarantees that the heartbeat sequence of ¦ at ¥ is nondecreasing and usR·Pnebn¸podºtusnBxR§wd·PeBd¸¯f,º²•pa©ÁBxn§w dftBrhof um•ds©t¥ htaeotc¦ loeanandstiitnoifionnncieitne—lniunamec1bo5enrterovafadtliiucmatiteoesns..(BinyttahsekFÖ a)irtnoetsrsuepraonpeinrtfiynoitfesneunmdbaenrdofreticmeeivse. ,T¦ hreerecfeoirvee¥ ¥s We now show that the implementation in Fig. 3 is quiescent. In order to do so, we focus on a single invocation of SEND and show that it causes only a finite number of invocations of sends. This immediately implies that a finite number of invocations of SENDs cause only a finite number of invocations of sends. So consider one 12 p¥R·Padrot¸piecºtsuBxln§waortB ifcn•rvaoschfar,ot imocna¥  u,soietfsiSinnvEvoNokcDeastŽRihosgne§sndo ,fhasŽgneyd»an¼alde½ÕŽRth ¨BsnRf ·P•b ¸pe. ºtthSBxe§wosB efq•rmueaniynceatalnssoukmcrebapueersaeatsisnsoevncoidcaat¦„etiBxdo§wnwsB iftoh•#f  . It is clear that, if .seWnhdenh Žgy¦ »²r¼aec½ÕeB if v•re s, and it sendŽVh i¨sV·Pcle¸¯aºtrBFt§”haB tf   • dooressennodt ch aŽguy»²se¼aa½ÕnB yf other •r . invocations of sends. Therefore, a send caused by   is either We next show that ¥ sends V·P¸¯ºaBF§”B f • to ¦ only a finite number of times, and that ¦ sends y»²¼a½¤B f • to ¥ only a finite number of times. This implies that   causes only a finite number of sends. Lemma 5 ¥ sends  MSGBF§”B sn to ¦ an infinite number of times if and only if ¦ sends  ACKB sn to ¥ an infinite number of times. Proof. If ¥ sends V·P¸¯ºtBF§”B f • to ¦ an infinite number of times, then ¥ is correct, and the condition in line 15 ei(0nvfi)anl-uiCtaeotemnsuptmolebtteerunreeosfasnt,i¦miniesfisnc.oiStrerinenccute.m¦ Tbsheeernnodfbsytiymt»ahee¼²sF.½¤aBTirf hn•pee rsetsofop¥rreoe,paetchrhteythiomefaesrteibtnerdeatcaseneidvqeureescnRec·Peiv¸poeºaf, Bx¦¦ §wraetB cf¥ e•#iisv,eu¦ snsbRe·Ponu¸pdnºtdseBxyd§w»a.¼²B Sf½¤•oB , f by an • to ¥ an infinite number of times. Conversely, if ¦ sends y»a¼a½ÕB f •p to ¥ an infinite number of times, then ¦ number of times, so, by the Integrity property of send and receive, ¥ sends V·Pre¸¯cºteBFiv§”eBsf R·P¸¯ºtBF§”B f •p an infinite • to ¦ an infinite number of times . ¥ Corollary 6 ¥ sends  MSGBF§”B sn to ¦ only a finite number of times. tPeyr»avureo¼²eno.½¤tfuT.B afhFl•leoyr rerftaeoocrce¥eo,niavttnheraesdinti×acfi»astnik¼aoit½Õnree,B pnsf euu•ramp tpbfsoreeosrnemdothf¦ ¦„a.tBFti§”mT¥ heB ssuf e•sbn, ydtehsvLeeRec·Pnmotu¸pnmaºtdalilBxty§wi5o.tneB rf Bim•rnyilnittnaohteee¦ s1Fa8aanni(ridonnfefissotnasisittpkersnroeeupnpmeedrbatsyet Rrso·Peofnf¸psdtºtiem¦„Bxn§wBxed§wsB .afB nT•rf d•³h e tr)noebc¦¦ eecsoivoenemlny,deas¥s finite number of times — a contradiction. ¥ Corollary 7 ¦ sends  ACKB sn to ¥ only a finite number of times. Proof. From Lemma 5 and Corollary 6. ¥ Lemma 8 The algorithm of Fig. 3 is quiescent. Proof. From Corollaries 6 and 7, and the remarks before Lemma 5. From Lemmata 3, 4, and 8 we have: ¥ Theorem 9 For the simple network case and any ¦¹ˆ neighbor ‡¥ , Fig. 3 is a quiescent implementation of quasi reliable SENDŽVh  and RECEIVEh Ž that uses (1) . From this theorem, and Remarks 1 and 2, we have: Corollary 10 In the simple network case, quasi reliable send and receive between every pair of processes and reliable broadcast can both be implemented with quiescent algorithms that use (0) . 13 8.2 The General Network Case In this case (Fig. 1b), some links may be unidirectional, e.g., the network may contain several unidirectional rings that intersect with each other. Moreover, some links may not be fair (and processes do not know which ones are fair). Achieving quiescent reliable communication in this type of network is significantly more complex than before. For instance, suppose that we seek a quiescent implementation of quasi reliable send and receive. In order for the sender ¥ to SEND a message § to the receiver ¦ , it has to use a diffusion mechanism, even if ¦ is a neighbor of ¥ (since link ¥Ø‘¦ may be unfair). Because of intermittent message losses, this diffusion mechanism needs to ensure that § is repeatedly sent over fair links. But when should this repeated send stop? One possibility is to use an acknowledgement mechanism. Unfortunately, the link in the reverse direction may not be fair (or may not even be part of the network), and so the acknowledgement itself has to be “reliably” diffused — a chicken and egg problem. Figure 4 shows a quiescent implementation of reliable broadcast (by Remark 1 it can be used to obtain quasi 2rtmthoerlaaiu@Ftia2Ùnnb3 tsG laheiiannansssedthndafedeovlbraaikvarnsiecdarktbeargldesrecko§ e—‘uq„iƒ¨nv.– …dÚ}eIg .nÛpbª §IfDeonitr«wdtch§eeoiresn nt;ttoaafieisbvnnkreai,onrl2lyagydppa2ceaasrirseireotttoduaofirfcmnppasrelrolfosycrscoeaecsmgshseseeetsch§ s)ke..s,IiF2 ninfovt,fiurofrieocstaiartvctdseihooelymnlmi,voeeaefsnrpsbseraoirg§ gocehae;b§sdtoshcret3ahn3†sias2ttˆ’ ii§nisn—‘bi—ƒ¨ t.rƒ¨…ioag …Tlagª ih§dzª §ece« as,t«astvitshf,ake2erihq„aahce– bÚ}ahalsreÛpptefDbr—‘vioeƒ¨iacd…§tegesonª s§fca2 e3 t« at to 2 has those increased, and if who are already sino,—‘2 ƒ¨…sg eª n§ d« s.1a0 message The task containing terminates § whteonaallllnneeigighhbboorsrswohfo2 searheeacrotnbteaaitneindcirnea—‘sƒ¨ed… g — ª§ even «. aAdaÏ nped¨ldlpl…ihevmaaeiprerspesasdeatanm§gsdeesoqsasuintstesdeone,nnclicftefebtnooiyofntÏt,pÏhd¨ired¨t…oh…dach .ele.glsiFosverieinsrt.ahslml§Uy,paa2ornendfootffhrowetrhkaresredftcosaerstmikhpetq„–o§”nÚ}feÛpBws—‘fDuiƒ¨cm… h§eÜPsa s.fRam— TgeBxehÏsesd¨na§”…gh2 eB ,a—‘wdpƒ¨dhr… oseÜPcrteehfRseg—socBF2 otÏ nd¨mfit…ersh sng ttsitcsohoaefacsl—‘lkeƒ¨sti… tosiÜPffnpifRetr— iohgtcahoesbs—‘oasƒ¨relsr…seg taaªh§dnaydt« The code consisting of lines 18 through 26 is executed atomically. Each concurrent execution of the diffuse task (lines 9 to 16) has its own private copy of all the local variables in this task, namely § , hb, and prev hb. We now show that this implementation is correct and quiescent. Lemma 11 (Uniform Integrity) For every message § , every process delivers message § at most once, and only if § was previously broadcast by sender§ . Proof (Sketch). Let § be a message and 2 be a process. Line 19 guarantees that 2 delivers § at most once. Now suppose process and clearly 2ž9 2fDi d• eqliivc er§s § . . It can do that either In the second case, 2 in line 4 or must have line 20. In the first case, 2 previously broadcast § received a message of the form §wBVÔ!BeÔ¨ . By the Integrity property of send and receive, a shows that this can only happen if § was message of the form previously broadcast b§”y BefDÔ!i BV• Ô¨q‘ iDwc a§s p .reviously sent. An easy induction ¥ Lemma 12 (Validity) If a correct process broadcasts a message § , then it eventually delivers § . Proof. If a correct process broadcasts § then it eventually executes line 4 and delivers § . ¥ iWnietianleixzat tsiohnowofth—‘aƒ¨t… g every process in ª § « takes place. —ƒ¨… g ª § We do « delivered not assume —‘§ ƒ¨…. g But first, we should be concerned about when the ª § « is initialized to the empty set at start-up. Doing so would be impractical since the set of all possible messages § that can ever be broadcast can be infinite. Instead, Ÿ Ý„Þ ß&à10It may appear that does not need to send this message to processes in got , since they already got it! The reader should verify that this “optimization” would make the algorithm fail. 14 1 For every process 2 : 2 3 To execute broadcast§ : 4 5 d—‘eƒ¨…liªv§ e«rÍ § @F2³G 6 fork task diffuse§ 7 return 8 9 10 task diffuse§ : for all 3Xˆ‰• iD–˜—gh“™Rƒ„c e2# do Ï cViDÐ h“™ ª°3 « Í †1 11 12 13 14 15 16 urenpteilaih“f•t™&fiDpo͖˜e—grrfÏh“soiyÑcVoo™Rriƒ„dmgÐac ieclel2#h“a3X3X™€ltlˆ”yÍáWˆ‰• •—‘h“iƒ¨i–˜™ –˜…—„—„ª h“§ hd™R™R«ƒgƒ„c c e2# , 3¹ˆ’ —‘ƒ¨… ª § e2# such that Ï « cVaiDÐndhdϙ ceiÐ ª°3 « Òvh“h“™ @ ª¶™ 3 qª¶«3 u«Òvedryoh“™tshª°e3e«nhtdehggaehrigntb§”eaB t—‘fƒ¨a…ilª §ure« B2#de tector G 17 18 upon receiveg„h ig§wB —ƒ„… ÜPfR— BFÏ d¨…h  do 19 if 2 has not previously executed deliver§ then 20 21 d—eƒ„…liªv§ e«rÍ § @F2ÙG 22 23 24 25 26 fϗoƒ¨d¨r……hUªa§ fslÍloe« r3 nÍjkÏsdud¨tgga—c…h ighPhsƒ¨k… §”ãtªh2§daBitf«d—‘f3Øu⃨s… ˆ”e—‘ª § ƒ¨•§…« iDBFÜPϖ˜—gd¨fRh“…— ™Rh ƒ„ c e2# and 3 appears at most once in Ï d„…h do Figure 4: General network case — quiescent implementation of broadcast and deliver using (1) each process 2 initializes —‘ƒ¨… g ª § l—‘inƒ¨e… g 21). ª§ «, This guarantees that but one should always —‘«ƒ¨…eg itª h§ e«r when it broadcasts § is never used before (see line 5), or when it first “hears” about § (see it is initialized. We next establish invariants for keep in mind that these invariants only hold after initialization has occurred. Lemma 13 For any processes 2 and 3 , (1) if at some time Q , 3Uˆ gotg ª § « then 3Uˆ gotg ª § « at every time Q ‚ l£Q ; (2) When gotg ª § « is initialized, 2ˆ gotg ª § « ; (3) if 3Xˆ gotg ª § « then 3 delivered § . Proof (Sketch). (1) and (2) are clear from the algorithm and (3) follows from the Integrity property of send and receive. ¥ Lemma 14 For every § and path, there is a finite number of distinct messages of the form §wBVÔ!B path . Proof. Any message of the form §”BeÔ!BxÏ d¨…h  is equal to §wBxäpBFÏ d¨…h  for some ä‚W Π, where Π is a finite set. ¥ Lemma 15 Suppose link 2„ 3 is fair, and 2 and 3 are correct processes. If 2 delivers a message § , then 3 eventually delivers § . Proof. it forks Stauspkpoq„s– Ú}e,ÛpbfDyi contradiction, that § . Since 3 does 2 delivers § not deliver § and 3 , by never delivers § Lemma 13 part . Since 2 delivers § (3) 3 never belongs toa—‘nƒ¨d… it gª § is « correct, . Since 15 2 is correct, this implies that 2 executes the loop in lines 11–16 an infinite number of times. Since 3 is a correct neighbor of 2 , the (1) -Accuracy property guarantees that the heartbeat sequence of 3 at 2 is nondecreasing and unbounded. Thus, the condition in line 13 evaluates to true an infinite number of times. Therefore, 2 executes line 14 an times. By infinite Lemma 1n4u,mthbeerreoefxtiismtseas,suanbdsesto—¨2å sends a message of the form §wBVÔ!B2# to 3 an infinite number of W Π such that 2 sends message §”BFä 0 Bq2# infinitely often to 3 . So, by the Fairness property of send and receive, 3 eventually receives §”BFä 0 Bq2# . Therefore, 3 delivers § . This contradicts the assumption that 3 does not deliver § . ¥ Lemma 16 (Agreement) If a correct process delivers a message § , then all correct processes eventually deliver §. Proof. Suppose that some correct process 2 delivers § . For every correct process 3 , there is a simple fair path e2 1 BECDCECDB2rog from 2 to 3 with 2 1 9æ2 and 2ro†9¾3 . By successive applications of Lemma 15, we conclude that 2 2 Bq2 3 BECDCECEBq2ro eventually deliver § . Therefore 3Ñ9w2o eventually delivers § . ¥ We now show that the implementation of Fig. 4 is quiescent. In order to do so, we focus on a single invocation of broadcast and show that it causes only a finite number of invocations of sends. This implies that a finite number of invocations of broadcast cause only a finite number of invocations of sends. Let § be a message and consider an invocation of broadcast§ . This invocation can only cause the sending of messages of form §wBVÔ!BeÔ¨ . Thus, all we need to show is that every process eventually stops sending messages of this form. Lemma 17 Let 2 be a correct process and 3 be a correct neighbor of 2 . If 2 forks task diffuse § , then eventually condition 3Uˆ gotg ª § « holds forever. Proof. By Lemma contradiction, that 3 13 part (1), we never belongs to —‘oƒ¨n… lgyª and 2 o 4 à 9ée2 çÑ9è3 1 B2 2 . Let BDCECECDB2 à e2 . o ç×Bq2 o ç Note t6 h1aBDtCEfCDoCErB12po„±  b eÒ § need to show that eventually 3 belongs to « . Let e2 1B2 2BDCECECDB2roeç˜ be a simple fair path a simple fair path from 3 to 2 with 2#o‰9¾2  , process 2 à 6 1 appears at most once in 4 à . —‘ƒ¨… g ª § from 2 . For «. to 1 3 ± Suppose, by w ithÒ 2 1 9w2  , let We claim that for each ¿™9 1BECDCECEBVՆ 1, there is a set ä © containing @F2 1 Bq2 2 BECDCECBq2 © G such that 2 © sends §”BFä © Be4 ©  mrftoeacec2 tse©sti6ahvga1eeta,3inm2 nimenavdfieedndrsiitabettheenlelouyncmogimbnsetpterolniote—sfsƒ„to… tigfhmªa䁧 etoEs2«s..ož1Fto9šor i¿‚2ts9éevva¹erina† tbula1el,ly—thƒ¨ir…seg ccª § leaii«vm.eSstiong§wceetBxhä‘äeoEoErss with the Fairness property of send and 1 BV4³oEs 1 . Upon the receipt of such a 1 contains 2roeça9š3 , this contradicts the We show the claim by induction on ¿ . For the base case note that, since 3 never belongs to —‘ƒ¨… g ª § « and 3 is a neighbor of 2 1 9v2 , then 2 1 executes the loop in lines 11–16 an infinite number of times. Since 3 is a correct neighbor of 2 1, the (0) -Accuracy property guarantees that the heartbeat sequence of 3 at 2 1 is nondecreasing and unbounded. Thus, the condition in line 13 evaluates to true an infinite number of times. So 2 1 executes line 14 infinitely often. Since 2 2 is a correct neighbor of 2 1, its heartbeat sequence is nondecreasing and unbounded, and so 2 1 sends a message of the form §wBVÔ!B2 1 to 2 2 an infinite number of times. By Lemma 14, there is some ä 1 such that 2 1 sends §”BFä 1 B2 1 to 2 2 an infinite number of times. Note that Lemma 13 parts (1) and (2) implies that 2 1 ˆ“ä 1. This shows the base case. 2msFtfhro©ooaem6 rmts1setä ha2r© äege6© ©eci1anecincdsoiovuinnnect§”fiattsaiinoBFniinäpn§witnseBVsg4Bx@Fntäe2©u@F©p61m2 Be,B114 2bs B©ue22 fprBDo2,pCEorBDiCEoftCEsCDsCEtosBeiC"me2mBtn©2hee6 da©sä1t.sG G .fStaohaiBrmanntyc¿deecst2 Òoh2s© nae©6 g™t6 aF1e2i†asnoiiessrf1nnab,tedh2onsests©ehifpsgo§wä rehro© mnbBxpaäoden©rsr6d§”toy1f§w2 BeBeo2©Ô!4 fBx6© BV©ä6 1s46© .1e©1Bea6nB4 n1d©tydo aLta2tnoope© dmp262 er©2©mea66 arca21nse.1aaiiInvn4ttfie,miin,nstohifi2 etseen©atr6iesonty1enuecmrtxnoeeiubscimsteneesrbei4vä oet©e©rfh66 stao11itfm,§”eWteeiaamsBFccΠ.äheh© ssstBe,iuu4 mfcco© ¥ hher 16 Corollary 18 If a correct process 2 forks task diffuse§ , then eventually 2 stops sending messages in task diffuse§ . Proof. For every neighbor 3 of 2 , there are two cases. If 3 is correct then eventually condition 3Xˆ“ädêDQgdª § « holds oafofftree3 rvawetrh2 bicyishLbtehomeungmduaeadr1d,7a.innIdflisn3 oeise1v3faeiunslttuayal,wltlhyayecnsoftnahdlesite(0i.o)nH-eÏ CncVociDemÐ , p2 hdle™ egtveª¶en3 n«etsul saplh“lry™ogdpsª°te3or« ptyhsogsluedasnrdfaoinnrtegevemesre.thsTsaathgteherseefihonereata,rsttbkheeq„art–eÚ}siÛpesqfDaui et§inmc ee. ¥ Lemma 19 If some process sends a message of the form §”BeÔ!B path , then no process appears more than twice in path. Proof (Sketch). By line 25 of the algorithm, a process sends a message §wBxäpBFÏ d¨…h  to a process 3 only if 3 appears at most once in path. The result follows by an easy induction that uses this fact and the Integrity property of send and receive. ¥ Lemma 20 (Quiescence) Eventually every process stops sending messages of the form §wBVÔ!BeÔ¨ . Proof. Suppose for a contradiction that some process 2 never stops sending messages of the form §wBVÔ!BeÔ¨ . fipNBrnyooitctLeeeetsshmseat2mto2 iafnmvv1oa4ulk,suetfeosbsr.esseTconohmrdereggreehciäfto.§wWrBe,BxyΠäpfoLBF,Ïr2ed¨mss…ohmemnaead1nÏs9d¨ia,n…nthfih,nien2 ittfiehsnienritdnuemdcnosbumeamrnpbooienfnrfietionnmftiteomesf.neasusmmagbeesessraog§”femoBFä¯feBFsthÏsaed¨g…feh osr.mofSot§”h,efBeofÔ!orBVrsÔ¨mo mra§”engpBeerÔ!osBFcÏoevd¨se…shr3  a . , fmTrncaoeanuherncdmemsersolbeieacvetg§wcaereÏursesorBVd¨fÔ!a…tniwnhBFiemnÏ ioi9¾vtd¨ethsoc…sehacsetr2‚qsaaaeigts1nisakeBEo.nCDlnioq„CEFnsif–nCEieÚ}roBqfits2rh1Ûpfnteog4,fDistiieenf,foonw§Ïnrrdmud¨higgm…en.hh rigbel§wSiie§”snirnBVeeocBFÔ!ml2äpefBF6ÏpBxt21i.Ï td¨my.d¨F…ohe…Cwonhs‚ rlo.eytrwBhiooimecnylhlcvmeasturoehreeryckesdoeÏ1Iinsaan8d¨tdtt…seeslh celgihy‚nanroesi9…rdwteey2,asi6ps•nc2 tu.rhho1ptaaEBDapptsCEeaocktCDrcshotCEheyenq„Br2pt–tsoÚ}erhuoEafaÛpsicdstshfD1ieÏcia tind¨.int§o…divhSmnoaocecnsoo2aidanrntficrsrioteeeinenscrctalseeciwnsiaiovvheenfeeinca2o,sdht6tchal,c2weeutmarhisrsetethetoonisospnesnsamlaayengnspepwetiiynnrnohofififdpecnnnaietiinhsttt2hgeees sneumndbser§wofBxsä e‚ BFnÏ dd¨s…h of ‚ a message of this form to 2 . to 2 . Therefore, there exists By Lemma 14, for a correct process 2 ‚ some such ä‚W that Π, there is an sendg ç h g §”BFä ‚ BFinÏ fid¨n…hit‚ e taainnmdiencsfio, nrwriethecetnrepurmÏ od¨bc…eeh srësooeìfsit2³simëeomeì sap.ntByd.y2#rë eoEps e1aì tsiuncghththisatarsgeunmdegEníïîxtðyXh gDíñ†îRò 1 1ð more §wBxädë toimì BFeÏ sd¨…wh eë ocì o niscliundveoktheadt there exist an infinite number of is ärë ionì vWokΠed, number of ¥ From Lemmata 11, 12, 16, and 20 we have: Theorem 21 For the general network case, Fig. 4 is a quiescent implementation of reliable broadcast that uses (0) . From this theorem and Remark 1 we have: Corollary 22 In the general network case, quasi reliable send and receive between every pair of processes can be implemented with a quiescent algorithm that uses (1) . 9 Implementations of ¦Y§ We now give implementations of (1) for the two types of communication networks that we considered in the previous sections. These implementations do not use timeouts. 17 1 For every process 2 : 2 3 4 Initialization: for all 3Xˆ‰• iD–˜—gh“™Rƒ„c e2# do y g ª°3 « Í 0 5 6 cobegin 7 óïó Task 1: 8 9 repeafot rpearllio3Xdˆ‰ica• lily–˜—„hd™Rƒ„c e2# do sendggh i¨ô}õ!»}ör÷&øaõp»¯÷} 10 11 óñó Task 2: 12 13 uponyÑreg!ª°c3 e« iÍveggyÑh iggdª°ô}3 «õp`»}öp1÷'øaõ!»¯÷' do 14 coend Figure 5: Simple network case — implementation of (1) 9.1 The Simple Network Case We assume all links in the network are bidirectional and fair (Fig. 1a). In this case, the implementation is obvious. Every process 2 executes two concurrent tasks (Fig. 5). In the first one, 2 periodically sends message ô€õp»}öp÷'øaõp»¯÷ to all its neighbors. The second task handles the receipt of ô€õp»}öp÷'øaõp»¯÷ messages. Upon the receipt of such a message from process 3 , 2 increases the heartbeat value of 3 . We now prove that the implementation is correct. Lemma 23 ((1) -Completeness) At each correct process, the heartbeat sequence of every faulty neighbor is bounded. Proof. Obvious. ¥ Lemma 24 At each process 2 , the heartbeat sequence of every neighbor 3 is nondecreasing. Proof. This is clear since y5gpª°3 « can only be changed in line 13. ¥ Lemma 25 At each correct process 2 , the heartbeat sequence of every correct neighbor 3 is unbounded. Proof. Since 3“ˆ¤• i–q—gh“™Rƒ„c •2 and all links are bidirectional, we have 2|ˆÀ• iD–˜—gh“™Rƒ„c r3g . Moreover, since 3 is correct, its Task 1 executes forever. Therefore, 3 sends an infinite number of HEARTBEAT messages to 2 . By the Fairness property of send and receive, 2 receives an infinite number of HEARTBEAT messages from 3 . Every time 2 receives HEARTBEAT from 3 , it increments ysgpª¶3 « in line 13. So, 2 increments y5gpª°3 « an infinite number of times. Moreover, by Lemma 24, ysgpª¶3 « can never be decremented. So, the heartbeat sequence of 3 at 2 is unbounded. ¥ Corollary 26 ((1) -Accuracy) At each process the heartbeat sequence of every neighbor is nondecreasing, and at each correct process the heartbeat sequence of every correct neighbor is unbounded. Proof. From Lemmata 24 and 25. ¥ From Lemma 23 and the above corollary, we have: Theorem 27 For the simple network case, Fig. 5 implements (1) . 18 1 For every process 2 : 2 3 4 Initialization: for all 3Xˆ‰• iD–˜—gh“™Rƒ„c e2# do y g ª°3 « Í 0 5 6 cobegin 7 óïó Task 1: 8 9 repeafot rpearllio3Xdˆ‰ica• lily–˜—„hd™Rƒ„c e2# do sendggh i¨ô}õ!»}ör÷&øaõp»¯÷ùBq2# 10 11 12 13 14 15 16 17 óñó Tasukp2o:nffÏ oord¨err…ch¹aaeyÑsllÍllievgp33 neª°Ï3ssdgguu«d¨hggigcc…Íh h™ighhô}ãttô€yùõphh2 õp»}aagptt»}ª°öp33X3Xöp÷'« ÷'ˆwˆw`øaøaõ!••1õp»¯iDiD»¯÷5–˜–˜—g—g÷ÑBxh“h“Ï BF™R™RÏd¨ƒ„ƒ„…ccd¨h …ee2#2#h daaonndd 3 3 appears in Ï d¨…h do does not appear in Ï d„…h do 18 coend Figure 6: General network case — implementation of (0) 9.2 The General Network Case In this case some links may be unidirectional and/or not fair (Fig. 1b). The implementation is more complex than before because each HEARTBEAT has to be diffused, and this introduces the following problem: when a process 2 receives a HEARTBEAT message it has to relay it even if this is not the first time 2 receives such a message. This is because this message could be a new “heartbeat” from the originating process. But this could also be an “old” heartbeat that cycled around the network and came back, and 2 must avoid relaying such heartbeats. The implementation is given in Fig. 6. Every process 2 executes two concurrent tasks. In the first task, 2 mphô€eeeraõpsirost»}badöpgeicae÷'atsløalvoyõpafl»¯tsuh÷ÑeeesnBFfÏ dood¨sfr…mhaml letsô€iotssõpaag»}nlleeöpiitg÷'Rshô€øanbõpeõpoi»}r»¯gsöph÷Ñbt÷'BFhoÏ øaard¨tsõp…at»¯h hp÷É ap.teBU2#daropoitnnonoÏttahad¨llep…hpirt.eesacTrneheiinipegntphoa2bftohasr.psu.pchenTmdhseesisstsaeegcloefnftdrootmÏ ad„sp…kh rohacanendsdslfeo3 s,rw2 thaierndcrsreemcaeseeispsstatghoeef We now proceed to prove the correctness of the implementation. Lemma 28 At every process 2 , the heartbeat sequence of every neighbor 3 is nondecreasing. Proof. This is clear since y5gpª°3 « can only be changed in line 14. ¥ Lemma 29 At each correct process, the heartbeat sequence of every correct neighbor is unbounded. Proof. Let 2 be a correct process, and 3 be a correct neighbor of 2 . Let 4ú9e2 1 BDCECECDB2rog be a simple fair path from 3 to 2 with 2 1 9š3 and 2ro†9¢2 . For ¿19 1BDCECECDBe , let 4 © 9—e2 1BECECDCEBq2 ©  . For each ¿19 1BDCECDCEBe†† 1, we claim that 2 © sends ô}õp»}öp÷'øaõ!»¯÷5BV4 For the base case (¿‰9 1), note that 2 © 1 to 2 9Œ3 © i6s 1 an infinite number correct, so its Task of times. We show this by induction on ¿ . 1 executes forever and therefore 2 1 sends ô€õp»}öp÷'øaõp»¯÷ÑB2 1 to all its neighbors, and thus to 2 2, an infinite number of times. For the induction step, let 19 ¿ Ò †† 1 and assume that 2 2nscoue© 6rmnr2edbcdsetoraeô€onsfdõpnt»}itomhtöpeea÷'lspiøa.pneTõpka»¯h2 ri÷Ñ©sinBeûs4 h4 ©o©2 6w6 © 1s16  1tat©honiesds2 efc©2 al6na©ir2d6i,msi22 n.©isl6ô}ina1õpern»}e1ecöp7ige÷'.hiTvøabehoõ!esr»¯roRe÷5ô€ffoBV2 õp4r©e»}© 6 , öp12 ,t÷'© os6 øao21õpe© s»¯6aec÷É1nhBead4tnsim© i nô}eafi2õ!nn©»}ii6tnöre1fi÷&nnreøauitcmeõpe»¯bniv÷5euremBVso4bf©eRô€6rti1oõpm f»}ettosöpim.2÷'e© Søa6si.õpn2»¯Mcae÷Ñno2Beirne4© fio6© vn1,eitiries,t For ¿£9èm† 1 this claim shows that 2oEs 1 sends ô€õp»}öp÷'øaõp»¯÷ÉBe4ÙoEs 1 to 2ro an infinite number of times. tPimroecse.ssN2roo teisthcaotrr3mecˆvt a• niDd–˜—glih“n™Rkƒ„c 2oEs 1 e2o ‘2po (since is fair, 2ro‚9v2 so 2o ) and receives ô€õp»}öp÷'øaõp»¯÷ÑBe4³oEs 1 an infinite number of 31ˆÀ4³oEs 1 (since 2 1 9Œ3 ). So every time 2ro receives ô€õp»}öp÷'øaõp»¯÷ÑBe4³oEs that, by Lemma 28, 1 yÑg , î it ª°3 increments « can never y5g be î ª°3 « in line 14. decremented. So y5g î So, the ª°3 « is incremented an infinite number heartbeat sequence of 3 at 2ÙoÑ9w2 is of times. Note unbounded. ¥ Corollary 30 ((1) -Accuracy) At each process the heartbeat sequence of every neighbor is nondecreasing, and at each correct process the heartbeat sequence of every correct neighbor is unbounded. Proof. From Lemmata 28 and 29. ¥ Lemma 31 If some process 2 sends  HEARTBEATB path then (1) 2 is the last process in path and (2) no process appears twice in path. Proof (Sketch). This follows from lines 9, 15 and 16, and a simple induction that uses the Integrity property of send and receive. ¥ Lemma 32 Let  HEARTBEATB 2 ,3 path bã e3g processes, an infinite and path be a non-empty sequence of processes. If 2 receives message number of times, then 3 receives message  HEARTBEATB path an infinite number of times. Proof. Let – be the message ô}õ!»}ör÷&øaõp»¯÷5BxÏ d„…hmã 3„ and let – 0 be the message ô}õp»}öp÷'øaõ!»¯÷5BxÏ d¨…h  . Suppose 2 2 ‚ sends – receives – an infinite number of to 2 an infinite number of times. times. By the Integrity property of send By Lemma 31 part (1), we have 3Ñ9w2 ‚ . and receive, some Since the length of pÏ rd„o…chseãss3 is at least two, 3 can only send – in line 17. So 3 only sends – if it receives – 0. Therefore 3 receives – 0 an infinite number of times. ¥ Lemma 33 ((1) -Completeness) At each correct process, the heartbeat sequence of every faulty neighbor is bounded. Proof (Sketch). Let 2 be a correct process and let 3 be a faulty neighbor of 2 . Suppose that the heartbeat sequence of 3 at 2 is not bounded. Then 2 increments y5gpª°3 « an infinite number of times. So, for an infinite number of times, 2 receives messages of the form ô}õ!»}ör÷&øaõp»¯÷5BVÔ¨ with a second component that contains 3 . By Lemma 31 part (2), the second Thus there exists a Ï cd¨o…h mcpoonnteanintionfg a 3 message of such that 2 the form receives ô€ô}õpõ!»}»}öpör÷'÷&øaøaõpõp»¯»¯÷Ñ÷5BFBVÏ Ô¨d¨ …rh a nagnesinofivneirteanfiunmitbeesreotfotfimveaslu. es. arcLenoecdntetÏrriaevd¨dce…ihesctimvs9üeeths•aes2 na1fdgaBDecCEbtCEyCDtô}hLBõ!2raet»}om3 ör.mi÷&saTøafha3õpeu1»¯nlt÷5,pya.fBEroeIt2rf(1s¿1BEo)CE,ÒmCD3 Ce Bs2p¿ te©ÁhnR±edna,s nb,yiô€2 nr©õpfie»}np9~ieöptea÷'3tne.øaudõpmIaf»¯pb÷Ñ¿wperlBFi9ýÏcoad¨ft…tihiomtnhesteosno.2,fTLbahyneemrtiehnmfefioanrIein3t,te2eb,gynwruitmtehyecbpoIenrnrotceoplgfuerdrtititeymytephosraf.otsp2 Tee© hrn6tidys1 of send and receive and Lemma 31 part (1), 2T© sends ô}õp»}öp÷'øaõ!»¯÷ùBe2 of times. Since 2d©É9¤3 , this contradicts the fact that 3 is faulty. 1 BECDCECEBq2d©R to 2d© 6 1 an infinite number ¥ By Corollary 30 and the above lemma, we have: Theorem 34 For the general network case, Fig. 6 implements (1) . 20 1 For process ¥ : 2 3 4 InitiafDliiez’ÉatÍ ion0: @ seq is the current sequence number G 5 6 7 8 9 10 To exbfDþefDiercieo’Éu’ÑatÍÎe͑dcRfDafD-iesSie’ ’tE` §”N1DBþ fDŽVieh g’ Be§¥¨BF ¦:  wait until RECEIVEd y»²¼a½ÕBEþ fDie’  from Q#` 1 processes 11 return 12 13 For every process 2 : 14 15 upon deliver§wBEþ fDie’ BV¥‘Bx¦ do 16 SENDggh ŽEy»a¼a½ÕBRÿy¥¡ Á3g 17 if 209¢¦ then R-RECEIVEh ŽD§ Figure 7: Quiescent implementation of R-SENDŽRh  and R-RECEIVEh Ž for 8‡ƒ 2Q 10 Stronger Communication Primitives Quasi reliable send and receive and reliable broadcast are sufficient to solve many problems (see Section 11.1). However, stronger types of communication primitives, namely, reliable send and receive, and uniform reliable broadcast, are sometimes needed. We now give quiescent implementations of these primitives for systems with process crashes and message losses. Let Q be the number of processes that may crash. [BCBT96] shows that if QXlÓ8و 2 (i.e., half of the processes may crash) these primitives cannot be implemented, even if we assume that links may lose only a finite number of messages and we do not require that the implementation be quiescent. We now show that if Q Ò 8و 2 then there are quiescent implementations of these primitives for the two types of network considered in this paper. The implementations that we give here are simple and modular but highly inefficient. assume that MQ Òore8Ùeˆ f2fi.cient ones can be obtained by modifying the algorithms in Figures 3 and 4. Hereafter, we 10.1 Reliable Send and Receive If a process ¥ returns from the invocation of sendŽRh ¨§ we say that ¥ completes the sending of message § to ¦ . With quasi reliable send and receive, it is possible that ¥ completes the sending of § to ¦ , then ¥ crashes, and ¦ never receives § (even though it does not crash). In contrast, with reliable send and receive primitives, if ¥ completes the sending of message § to a correct process ¦ then ¦ eventually receives § (even if ¥ crashes). More precisely, reliable send and receive satisfy Integrity (Section 4.2) and: k No Loss: For all ‚l 1, if ¦ is correct and ¥ completes the sending of § to ¦ exactly  times, then ¦ receives § from ¥ at least  times.11 11The No Loss and Quasi No Loss properties are very similar to the Strong Validity and Validity properties in Section 6 of [HT94]. 21 1 For every process 2 : 2 3 To execute uniform-broadcast(§ ): 4 broadcast§ 5 return 6 7 upon deliver§ do 8 for all 3Xˆ Π do SENDggh iy»²¼a½¤BF§ 9 wait until RECEIVEd y»²¼a½ÕBx§ from Q#` 1 processes 10 uniform-deliver(§ ) Figure 8: Quiescent implementation of uniform reliable broadcast for 8”ƒ 2Q Reliable send and receive primitives are denoted R-SEND/R-RECEIVE. As before, SEND/RECEIVE denote the quasi reliable primitives. Figure 7 shows a quiescent implementation of R-SEND and R-RECEIVE (the code consisting of lines 7 and 8 is executed atomically). It uses reliable broadcast and SEND/RECEIVE between every pair of processes. We have already shown that these primitives have quiescent implementations using (1) for the two types of network in consideration. Roughly speaking, when ¥ wishes to R-SEND § to ¦ , it broadcasts a message that contains § , ¥ , ¦ and a fresh sequence number, and then waits to RECEIVE Q#` 1 acknowledgements for that message before returning from this invocation of R-SEND. When a process 2 delivers this broadcast message, it SENDs an acknowledgement back to ¥ , and if 2“9À¦ then it also R-RECEIVEs § from ¥ . The proof of correctness is straightforward and thus omitted. 10.2 Uniform Reliable Broadcast The Agreement property of reliable broadcast states that if a correct process delivers a message § , then all correct processes eventually deliver § . This requirement allows a faulty process (i.e., one that subsequently crashes) to deliver a message that is never delivered by the correct processes. This behavior is undesirable in some applications, such as atomic commitment in distributed databases [Gra78, Had86, BT93]. For such applications, a stronger version of reliable broadcast is more suitable, namely, uniform reliable broadcast which satisfies Uniform Integrity, Validity (Section 5.2) and: k Uniform Agreement [NT90]: If any process delivers a message § , then all correct processes eventually deliver § . Figure 8 shows a quiescent implementation of uniform reliable broadcast which uses reliable broadcast and SEND/RECEIVE between every pair of processes. The proof of correctness is straightforward and thus omitted. 11 Using ¦š§ to Extend Previous Work (0) can be used to extend previous work in order to solve problems with algorithms that are both quiescent and tolerant of process crashes and messages losses. 22 11.1 Extending Existing Algorithms to Tolerate Link Failures (0) can be used to transform many existing algorithms that tolerate process crashes into quiescent algorithms that tolerate both process crashes and message losses. For example, consider the randomized consensus algorithms of [Ben83, Rab83, CMS89, FM90], the failure-detector based ones of [CT96, AT96], the probabilistic one of [BT85], and the algorithms for atomic broadcast in [CT96],  -set agreement in [Cha93], atomic commitment in [Gue95], and approximate agreement in [DLP6 86]. These algorithms tolerate process crashes, and they use quasi reliable send and receive, and/or reliable broadcast, as their sole communication primitives. All of these algorithms can be made to tolerate both process crashes and message losses (with fair links) in two simple steps: (1) implement (1) as described in Section 9, and (2) plug in the quiescent communication primitives given in Section 8.12 The resulting algorithms tolerate message losses and are quiescent. 11.2 Extending Results of [BCBT96] Another way to solve problems with quiescent algorithms that tolerate both process crashes and message losses is obtained by extending the results of [BCBT96]. That work addresses the following question: given a problem that can be solved in a system where the only possible failures are process crashes, is the problem still solvable if links can also fail by losing messages? One of the models of lossy links considered in [BCBT96] is called fair lossy. Roughly speaking, a fair lossy link 20j3 satisfies the following property: If 2 sends an infinite number of messages to 3 and 3 is correct, then 3 receives an infinite number of messages from 2 . Fair lossy and fair links differ in a subtle way. For instance, if process 2 sends the sequence of distinct messages § 1 BF§ 2 BF§ 3 BECECDC to 3 and 2žá3 is fair lossy, then 3 is guaranteed to receive an infinite subsequence, whereas if 2žÎ3 is fair, 3 may receive nothing (because each distinct message is sent only once). On the other hand, if 2 sends the sequence § 1 BF§ 2 BF§ 1 BF§ 2 BECDCEC and 2” 3 is fair lossy, 3 may never receive a copy of § 2 (while it receives § 1 infinitely often), whereas if 2mn3 is fair, 3 is guaranteed to receive an infinite number of copies of both § 1 and § 2.13 [BCBT96] establishes the following result: any problem 4 that can be solved in systems with process crashes can also be solved in systems with process crashes and fair lossy links, provided 4 is correct-restricted14 or a majority of processes are correct. For each of these two cases, [BCBT96] shows how to transform any algorithm that solves 4 in a system with process crashes, into one that solves 4 in a system with process crashes and fair lossy links. The algorithms that result from these transformations, however, are not quiescent: each transformation requires processes to repeatedly send messages forever. Given (0) , we can modify the transformations in [BCBT96] to ensure that if the original algorithm is quiescent then so is the transformed one. Roughly speaking, the modification consists of (1) adding message acknowledgements; (2) suppressing the sending of a message from 2 to 3 if either (a) 2 has received an acknowledgement for that message from (3) modifying that are not in 3, t¢¥he£ or (b) the heartbeat of 3 has not  m£ e  a2nianrge aocfttuhaellyopaeprpateinodne“datpope¢¤n£d  i¢¤n£ c £ r2e  .a£ sT  e1hdetosrien¢¥scu£el  tt£sh ein2”l[aBssotCttBihmaTte9o62 n],lsyceotnhmtebaeinlmeemedseswnatigstehinttho¢¤e3 £ ; a£ nd    1 above modification, show that if a problem 4 can be solved with a quiescent algorithm in a system with crash failures only, and either 4 is correct-restricted or a majority of processes are correct, then 4 is solvable with a quiescent algorithm that uses (1) in a system with crash failures and fair lossy links. 12Similar steps can be applied to algorithms that use reliable send/receive or uniform reliable broadcast, provided a majority of processes are correct, by plugging in the implementations given in Section 10. 13In [BCBT96], message piggybacking is used to overcome message losses. To avoid this piggybacking, in this paper we adopted the model of fair links: message losses can now be overcome by separately sending each message repeatedly. 14Intuitively, a problem ¦ is correct-restricted if its specification does not refer to the behavior of faulty processes [Gop92, BN92]. 23 12 Generalization to Networks that Partition In this paper, we assumed that every pair of correct processes are reachable from each other through fair paths. In [ACT97b], we drop this assumption and consider the more general problem of quiescent reliable communication in networks that may partition. In particular, we (a) generalize the definitions of quasi reliable send and receive and of reliable broadcast, (b) generalize the definition of the heartbeat failure detector and implement it in networks that may partition, and (c) show that this failure detector can be used to achieve quiescent reliable communication in such networks. In [ACT97b] we also consider the problem of consensus for networks that may partition, and we use (1) to solve this problem with a quiescent protocol (we also use a generalization of the Eventually Strong failure detector [CT96]). 13 Quiescence versus Termination In this paper we considered communication protocols that tolerate process crashes and message losses, and focused on achieving quiescence. What about achieving termination? A terminating protocol guarantees that every process eventually reaches a halting state from which it cannot take further actions. A terminating protocol is obviously quiescent, but the converse is not necessarily true. For example, consider the protocol described at the beginning of Section 1. In this protocol, (a) ¥ sends a copy of § repeatedly until it receives ¨©§ from ¦ , and then it halts; and (b) upon each receipt of § , ¦ sends ¨©§ back to ¥ . In the absence of process crashes this protocol is quiescent. However, the protocol is not terminating because ¦ never halts: ¦ remains (forever) ready to reply to the receipt of a possible message from ¥ . Can we use (0) to obtain reliable communication protocols that are terminating? The answer is no, even for systems with no process crashes. This follows from the result in [KT88] which shows that in a system with message losses (fair links) and no process crashes there is no terminating protocol that guarantees knowledge gain. Acknowledgments We are grateful to Anindya Basu and Vassos Hadzilacos for having provided extensive comments that improved the presentation of this paper. We would also like to thank Tushar Deepak Chandra for suggesting the name Heartbeat. References [ACT97a] Marcos Kawazoe Aguilera, Wei Chen, and Sam Toueg. On the weakest failure detector to achieve quiescence. Manuscript, April 1997. [ACT97b] Marcos Kawazoe Aguilera, Wei Chen, and Sam Toueg. Quiescent reliable communication and quiescent consensus in partitionable networks. Technical Report 97-1632, Department of Computer Science, Cornell University, June 1997. [AT96] Marcos Kawazoe Aguilera and Sam Toueg. Randomization and failure detection: a hybrid approach to solve consensus. In Proceedings of the 10th International Workshop on Distributed Algorithms, Lecture Notes on Computer Science, pages 29–39. Springer-Verlag, October 1996. 24 [BCBT96] Anindya Basu, Bernadette Charron-Bost, and Sam Toueg. Simulating reliable links with unreliable links in the presence of process crashes. In Proceedings of the 10th International Workshop on Dis- tributed Algorithms, Lecture Notes on Computer Science, pages 105–122. Springer-Verlag, October 1996. [BDM97] O¨ zalp Babaog˘lu, Renzo Davoli, and Alberto Montresor. Partitionable group membership: specification and algorithms. Technical Report UBLCS-97-1, Dept. of Computer Science, University of Bologna, Bologna, Italy, January 1997. [Ben83] Michael Ben-Or. Another advantage of free choice: Completely asynchronous agreement protocols. In Proceedings of the 2nd ACM Symposium on Principles of Distributed Computing, pages 27–30, August 1983. [BN92] R. Bazzi and G. Neiger. Simulating crash failures with many faulty processors. In A. Segal and S. Zaks, editors, Proceedings of the 6th International Workshop on Distributed Algorithms, volume 647 of Lecture Notes on Computer Science, pages 166–184. Springer-Verlag, 1992. [BT85] [BT93] Gabriel Bracha and Sam Toueg. Asynchronous consensus and broadcast protocols. Journal of the ACM, 32(4):824–840, October 1985. O¨ zalp Babaog˘lu and Sam Toueg. Non-blocking atomic commitment. In Sape J. Mullender, editor, Distributed Systems, chapter 6. Addison-Wesley, 1993. [Cha93] Soma Chaudhuri. More choices allow more faults: Set consensus problems in totally asynchronous systems. Information and Computation, 105(1):132–158, July 1993. [Cha97] Tushar Deepak Chandra, April 1997. Private Communication. [CHT96] Tushar Deepak Chandra, Vassos Hadzilacos, and Sam Toueg. The weakest failure detector for solving consensus. Journal of the ACM, 43(4):685–722, July 1996. [CMS89] Benny Chor, Michael Merritt, and David B. Shmoys. Simple constant-time consensus protocols in realistic failure models. Journal of the ACM, 36(3):591–614, 1989. [CT96] Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267, March 1996. [DFKM96] Danny Dolev, Roy Friedman, Idit Keidar, and Dahlia Malkhi. Failure detectors in omission failure environments. Technical Report 96-1608, Department of Computer Science, Cornell University, Ithaca, New York, 1996. [DLP6 86] Danny Dolev, Nancy A. Lynch, Shlomit S. Pinter, Eugene W. Stark, and William E. Weihl. Reaching approximate agreement in the presence of faults. Journal of the ACM, 33(3):499–516, July 1986. [FLP85] Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2):374–382, April 1985. [FM90] Paul Feldman and Silvio Micali. An optimal algorithm for synchronous Byzantine agreement. Technical Report MIT/LCS/TM-425, Laboratory for Computer Science, Massachusetts Institute of Technology, June 1990. [GLS95] Rachid Guerraoui, Michael Larrea, and Andre´ Schiper. Non blocking atomic commitment with an unreliable failure detector. In Proceedings of the 14th IEEE Symposium on Reliable Distributed Systems, pages 13–15, 1995. 25 [Gop92] [Gra78] [Gue95] [Had86] [HT94] [KT88] [LH94] [Lyn96] [NT90] [Rab83] [SM95] [vR97] Ajei Gopal. Fault-Tolerant Broadcasts and Multicasts: The Problem of Inconsistency and Contamination. PhD thesis, Cornell University, January 1992. James N. Gray. Notes on database operating systems. In R. Bayer, R. M. Graham, and G. Seegmuller, editors, Operating Systems: An Advanced Course, volume 66 of Lecture Notes on Computer Science. Springer-Verlag, 1978. Also appears as IBM Research Laboratory Technical report RJ2188. Rachid Guerraoui. Revisiting the relationship between non-blocking atomic commitment and consensus. In Proceedings of the 9th International Workshop on Distributed Algorithms, pages 87–100, Le Mont-St-Michel, France, 1995. Springer Verlag, LNCS 972. Vassos Hadzilacos. On the relationship between the atomic commitment and consensus problems. In Proceedings of the Workshop on Fault-Tolerant Distributed Computing, volume 448 of Lecture Notes on Computer Science, pages 201–208. Springer-Verlag, March 1986. Vassos Hadzilacos and Sam Toueg. A modular approach to fault-tolerant broadcasts and related problems. Technical Report 94-1425, Department of Computer Science, Cornell University, Ithaca, New York, May 1994. Richard Koo and Sam Toueg. Effects of message loss on the termination of distributed protocols. Information Processing Letters, 27(4):181–188, April 1988. Wai-Kau Lo and Vassos Hadzilacos. Using failure detectors to solve consensus in asynchronous shared-memory systems. In Proceedings of the 8th International Workshop on Distributed Algorithms, pages 280–295, Terschelling, The Netherlands, 1994. Nancy A. Lynch. Distributed Algorithms. Morgan Kaufmann Publishers, Inc., 1996. Gil Neiger and Sam Toueg. Automatically increasing the fault-tolerance of distributed algorithms. Journal of Algorithms, 11(3):374–419, 1990. Michael Rabin. Randomized Byzantine generals. In Proceedings of the 24th Symposium on Foundations of Computer Science, pages 403–409. IEEE Computer Society Press, November 1983. Laura S. Sabel and Keith Marzullo. Election vs. consensus in asynchronous systems. Technical Report 95-1488, Department of Computer Science, Cornell University, Ithaca, New York, Febrary 1995. Robbert van Renesse, April 1997. Private Communication. 26