Coordinated Recovery in Distributed Systems

Database Fundamentals

0% completed

Recovering from failures in distributed systems is far more challenging than in standalone databases. Distributed systems involve multiple interconnected nodes, each of which may fail independently. Ensuring consistency and fault tolerance across all nodes during recovery requires sophisticated coordination mechanisms.

In this lesson, we’ll explore the key challenges of recovery in distributed systems and examine the techniques used to address them, including Two-Phase Commit (2PC), Three-Phase Commit (3PC), and Distributed Checkpointing.

Challenges in Distributed Recovery

Distributed recovery must address several issues that arise due to the nature of distributed systems:

Network Failures: Network partitions or delays can cause nodes to become temporarily unreachable, complicating recovery coordination.
Independent Failures: Different nodes may fail or recover at different times, making it hard to synchronize their states.
Consistency Guarantees: Ensuring that all nodes either commit or abort a transaction while avoiding partial states is critical.
Concurrency: Multiple nodes may process transactions concurrently, increasing the risk of conflicts during recovery.
Scalability: Recovery protocols must remain efficient as the system grows in size and complexity.

Two-Phase Commit (2PC)

Two-Phase Commit (2PC) is a distributed protocol used to ensure all nodes involved in a transaction either commit or abort the transaction as a single unit. It is widely used for achieving atomicity in distributed systems.

How 2PC Works

2PC consists of two phases:

Prepare Phase:
- The coordinator node sends a Prepare message to all participating nodes, asking if they are ready to commit.
- Each node checks its local state and responds with either Yes (ready to commit) or No (cannot commit).
Commit Phase:
- If all nodes respond with "Yes," the coordinator sends a Commit message, and all nodes commit the transaction.
- If any node responds with "No," the coordinator sends an Abort message, and all nodes rollback the transaction.

Recovery Process in 2PC

Coordinator Failure:
- If the coordinator fails:
  - Participants use their logs to wait for the coordinator’s recovery or timeout and initiate abort.
Participant Failure:
- Upon recovery, participants check their logs:
  - If they logged Prepare, they wait for the coordinator’s decision (Commit/Abort).
  - If no Prepare log exists, the transaction is assumed aborted.

Example Scenario

Transaction: Transfer $100 from Account A (on Node 1) to Account B (on Node 2).

Prepare Phase:
- The coordinator asks Node 1 and Node 2 if they can commit the transaction.
- Both nodes reply "Yes," indicating they are ready.
Commit Phase:
- The coordinator sends a Commit message to both nodes.
- The transaction is committed, ensuring consistency.

Limitations of 2PC

Blocking Problem: If the coordinator fails during the commit phase, all nodes remain in an uncertain state until the coordinator recovers.
Network Dependence: Network delays or partitions can prevent timely decision-making.

Three-Phase Commit (3PC)

Three-Phase Commit (3PC) improves upon 2PC by introducing an additional phase to reduce the risk of blocking and ensure progress even during coordinator failures.

How 3PC Works

Prepare Phase: Similar to 2PC, the coordinator asks all nodes if they are ready to commit.
Pre-Commit Phase:
- If all nodes respond "Yes," the coordinator sends a Pre-Commit message.
- Nodes prepare to commit but do not finalize the transaction.
Commit Phase:
- After receiving acknowledgments for the Pre-Commit message, the coordinator sends a final Commit message.

Recovery Process in 3PC

Coordinator Failure:
- If the coordinator fails after the Pre-Commit, participants use their logs:
  - If Pre-Commit is logged, participants proceed to commit the transaction.
  - Otherwise, they abort the transaction.
Participant Failure:
- Upon recovery, participants check their logs:
  - If Pre-Commit is logged, they proceed with the commit.
  - If only Prepare is logged, they wait for the coordinator’s decision.

Advantages of 3PC

Non-Blocking: If the coordinator fails during the commit process, nodes can still decide whether to commit or abort based on the Pre-Commit stage.
Improved Fault Tolerance: Reduces the chance of nodes being stuck in an uncertain state.

Example Scenario

Building on the previous 2PC example:

Prepare Phase: Coordinator asks Node 1 and Node 2 if they are ready to commit. Both respond "Yes."
Pre-Commit Phase: Coordinator sends a Pre-Commit message. Nodes prepare for the transaction.
Commit Phase: Coordinator sends the final Commit message. Nodes finalize the transaction.

.....

Like the course? Get enrolled and start learning!