Reducing Costs Of Byzantine Fault Tolerant Distributed Applications
Byzantine fault tolerance (BFT) is a powerful technique for building software that tolerates arbitrary failures. The technique has been developed since the 70s and has a rich research literature. Yet no production system has adopted the technique, despite an increasing need from today's large and complex systems. An important reason is that a BFT system incurs significantly higher costs than a crash-tolerant counterpart. The costs include developmental costs, as a BFT system is harder to design and implement correctly, and operational costs, as a BFT system requires to run on more machines, employs more expensive cryptographic operations, and sends more and larger messages. This dissertation attempts to reduce both developmental and operational costs of BFT distributed applications. In the first part, we propose an approach to translate existing crash-tolerant systems to Byzantine-fault-tolerant. Our approach makes fewer assumptions about the source crash-tolerant distributed applications, and thus it is applicable to a larger set of applications than existing approaches. We propose and prove correct our basic approach. Then we extend the basic approach to large scale distributed applications and propose additional mechanisms to deal with practical problems in such settings-namely, replication and host churns. We evaluate our scalable translation approach by a simulation and case studies. The second part of the dissertation presents a novel Byzantine replication approach that focuses on reducing resource costs (but can also be used as a translation). By leveraging an external reconfiguration service, our replication approach requires only t + 1 replicas and t witnesses (lighter hosts) to tolerate t Byzantine faults. Our approach also uses inexpensive HMAC signatures and imposes a chain communication pattern between the replicas to reduce CPU and network consumptions. We also propose a variation of the approach, in which Byzantine hosts do not forge CRC checksums, to further reduce the resource costs. In particular, it uses t + 1 replicas, CRC checksums, and t fewer rounds of communication. We evaluate the performance of a prototype of each variation in the common case, when it runs without failures.
Byzantine fault tolerance; asynchronous distributed applications; developmental costs; operational costs
Van Renesse, Robbert
Easley, David Alan; Halpern, Joseph Yehuda
Ph. D., Computer Science
Doctor of Philosophy
dissertation or thesis