0% completed
Recovering from failures in distributed systems is far more challenging than in standalone databases. Distributed systems involve multiple interconnected nodes, each of which may fail independently. Ensuring consistency and fault tolerance across all nodes during recovery requires sophisticated coordination mechanisms.
In this lesson, we’ll explore the key challenges of recovery in distributed systems and examine the techniques used to address them, including Two-Phase Commit (2PC), Three-Phase Commit (3PC), and Distributed Checkpointing.
Distributed recovery must address several issues that arise due to the nature of distributed systems:
Network Failures: Network partitions or delays can cause nodes to become temporarily unreachable, complicating recovery coordination.
Independent Failures: Different nodes may fail or recover at different times, making it hard to synchronize their states.
Consistency Guarantees: Ensuring that all nodes either commit or abort a transaction while avoiding partial states is critical.
Concurrency: Multiple nodes may process transactions concurrently, increasing the risk of conflicts during recovery.
Scalability: Recovery protocols must remain efficient as the system grows in size and complexity.
Two-Phase Commit (2PC) is a distributed protocol used to ensure all nodes involved in a transaction either commit or abort the transaction as a single unit. It is widely used for achieving atomicity in distributed systems.
2PC consists of two phases:
Prepare Phase:
Commit Phase:
Coordinator Failure:
Participant Failure:
Three-Phase Commit (3PC) improves upon 2PC by introducing an additional phase to reduce the risk of blocking and ensure progress even during coordinator failures.
Prepare Phase: Similar to 2PC, the coordinator asks all nodes if they are ready to commit.
Pre-Commit Phase:
Commit Phase:
Coordinator Failure:
Participant Failure:
Non-Blocking: If the coordinator fails during the commit process, nodes can still decide whether to commit or abort based on the Pre-Commit stage.
Improved Fault Tolerance: Reduces the chance of nodes being stuck in an uncertain state.
Building on the previous 2PC example:
.....
.....
.....