Low Cost Management of Replicated Data in Fault-Tolerant Distributed Systems
Many distributed systems replicate data for fault tolerance or availability. In such systems, a logical update on a data item results in a physical update on a number of copies. The synchronization and communication required to ensure that the copies of replicated data are kept consistent introduces a delay when operations are performed. In this paper, we describe a technique that relaxes the usual degree of synchronization, permitting copies of replicated data to be updated concurrently with other operations, while at the same time ensuring that correctness is not violated. The additional concurrency thus obtained results in better response time when performing operations on replicated data. We also discuss how this technique performs in conjunction with roll-back and roll-forward failure recovery mechanisms.