Disclaimer: This blog has been written with the help of LLM.
This chapter was pessimistic in its approach, but that’s alright, one has to be paranoid when they are building systems. When you are a small company and shipping fast you don’t think about negative flows - like what will happen if things fail. You only focus on making the positive flows work.
When the company becomes a behemoth data guzzler you have to be mindful and deliver on the promises made on the contract to the customers.
There are so many ways in which a distributed system can fail. So many. One cannot prevent some things from happening but we can have fail safes in position.
The token fencing idea was good though. Learnt a new approach.
Fails and Partial Fails
Distributed systems have more potential failures compared to single computers.
Single computers are deterministic; retrying an operation yields the same result, and failures are total rather than partial.
In single systems, failures often result from software issues, and the system prefers to crash over returning incorrect results.
Distributed systems consist of multiple connected computers, making them fundamentally different from single systems.
The focus is on commodity computers connected via a network, not supercomputers.
Distributed systems must handle both partial and complete failures, necessitating fault-tolerance mechanisms.
Fault handling is essential in the design of distributed systems, requiring a proactive approach.
Unreliable Networks
Distributed systems are "shared-nothing" computers communicating over a network.
Relying on networks introduces various risks, such as message loss, damage, rejection, or queuing delays.
Responses can also be lost, delayed, or affected by network misconfigurations or overloads.
The reason for a lack of response is often unknown, making failure detection difficult.
Timeouts are a common strategy to handle unresponsive requests, though they don't confirm message delivery.
Network failures are inevitable at all levels, from switches to data centers.
Systems must either handle or tolerate network problems; handling involves precautions, while tolerating may involve user error messages.
Regular testing is needed to assess how systems manage network issues and when to notify users.
Detecting Faults
Fault detection is common in many systems, but the exact cause is unknown without a successful response.
In distributed systems, faults can be detected at various levels, like load balancers or leader nodes.
Failures can occur anywhere in the network, including switches, routers, applications, and machines.
Some faults are undetectable, leading to the use of timeouts to declare a node as down.
Retries can be attempted, but eventually, a node may need to be declared dead.
Timeouts and Unbounded Delay
Setting effective timeout values is challenging:
Too short: May falsely declare a node as dead during temporary slowdowns.
Too long: Prolongs waiting for genuinely dead nodes, worsening user experience.
Using retries with short timeouts can cause problems, like:
Executing the same operation twice if redirected while the original request is still queued.
Triggering cascading failures if the new node is already overloaded.
Network delays are unpredictable; asynchronous networks have no upper limit on delay times.
Network switches can cause congestion by queuing messages on the same link, increasing wait times.
Protocols like TCP add delays due to retransmissions after packet loss, with inconsistent timeout values.
Systems like TCP and Cassandra use dynamic timeouts based on past network performance.
Defining fixed timeouts is nearly impossible; understanding network behavior and adjusting timeouts dynamically is crucial.
It’s important to know the differences between synchronous and asynchronous networks for better timeout estimation.
Synchronous versus Asynchronous Networks
Phone calls are synchronous: They allocate fixed bandwidth (circuit) end-to-end, ensuring no queuing and a fixed network latency (bounded delay).
Synchronous bandwidth allocation: For phone calls, if 10,000 connections are available, exactly 10,000 concurrent calls can be made without waiting, regardless of network routing.
TCP connections differ: Unlike phone calls, bandwidth needs are unknown when establishing a connection, making fixed allocation impractical.
Bandwidth challenges in TCP: Fixed bandwidth per connection (e.g., 100kbps) would lead to long waits for large data transfers and wasted resources if other slots remain empty.
Queuing in TCP: To optimize for speed, TCP dynamically adjusts data transfer rates based on available network capacity, allowing connections to queue for bandwidth.
Unbounded delays: Network setups with queuing have unpredictable delays, making it impossible to guarantee specific delay times or reliability.
Timeout setting challenges: Due to variable network conditions, correct timeout values can only be determined through experimentation.
Experimentation requires clock knowledge: Understanding network timing and delays involves learning how clocks work.
Unreliable Clocks
Two types of clocks: Monotonic and time-of-day clocks serve different purposes.
Time-of-day clocks:
Based on an epoch (e.g., UTC Jan 1, 1970) without accounting for leap seconds.
Synchronized using Network Time Protocol (NTP) to maintain consistency across machines.
Can jump forward or backward during synchronization, making them unsuitable for measuring elapsed time (e.g., timeouts).
Monotonic clocks:
Continuously increase, like a counter tracking nanoseconds.
Do not track the actual time of day; they measure elapsed time accurately without jumps.
Each computer or CPU has its own monotonic clock, typically not synced between machines.
NTP can still adjust the speed of monotonic clocks (faster or slower), but cannot cause jumps, ensuring consistency in elapsed time measurement.
Relying on Synchronous Clocks
Clocks have reliability issues:
Quartz clocks experience delays and skew.
NTP synchronization helps but doesn't fully resolve timing inaccuracies due to network delays.
Time-of-day clock challenges:
Systems must account for clock faults and monitor offsets between machines to prevent large discrepancies.
Using timestamps for decision-making can be problematic due to inevitable offsets.
Decision-making issues with timestamps:
Clocks may not be accurate enough to determine the order of operations, especially for rapid or concurrent writes.
Identical timestamps may occur if the clock resolution is limited (e.g., milliseconds).
Conflict resolution strategies like Last Write Wins (LWW) can fail due to clock skew, leading to data loss.
Improving conflict resolution:
Additional tiebreakers or version vectors can help resolve situations where timestamps are insufficient.
Google’s True Time API provides a confidence interval (earliest and latest time), increasing accuracy in determining operation order.
Snapshot Isolation and Repeatable Read issues:
Determining whether a read occurred after a write is challenging in distributed systems.
Google’s True Time-based transaction ID generation doesn’t fully solve these issues with ordering transactions.
Process Pauses
Leader Node Lease Mechanism:
A leader node obtains a lease from other nodes, similar to a lock with a timeout.
It must periodically renew the lease to maintain its leadership; if it fails, another node can take over.
Challenges with Timing Assumptions:
Various reasons can cause process pauses, including:
Garbage collection
Virtual machine suspensions
System sleep
Context-switching
Synchronous disk access
Paging
Signals like SIGSTOP
Timing cannot be reliably assumed in distributed systems.
Real-Time Response Requirements:
Some systems require guaranteed responses before specific deadlines (e.g., real-time operating systems).
These systems need documentation of worst-case execution times and may restrict dynamic memory allocation.
Handling Garbage Collection:
Garbage collection can be treated as planned outages; runtime warnings allow applications to stop sending new requests during GC.
Alternatively, GC can be limited to short-lived objects, with periodic process restarts to manage memory.
Knowledge, Truth and Lies
Challenges in Building Reliable Systems:
Numerous potential failures in distributed systems make it difficult to identify specific problems.
Assumptions must be made to create systems based on certain truths and knowledge.
Truth Defined by Majority (Quorum):
Relying on a single node for decisions creates a single point of failure.
A quorum, or majority decision, must be used to prevent incorrect assumptions by isolated nodes.
Lock Management Example:
A node holding a lock may expire while garbage collection is paused, leading to data corruption.
Fencing tokens (monotonically increasing numbers) can be used to ensure only the most recent lock holder can write to storage.
Byzantine Faults:
Byzantine faults occur when nodes provide false information or fail to act correctly.
These are particularly relevant in environments where nodes may be compromised, but most systems operate in closed networks.
Input validation can mitigate risks, such as SQL injection or denial of service attacks.
System Models and Expectations:
Acknowledgment of various problems leads to the creation of expectations and assumptions about system behavior.
System models (synchronous, partially synchronous, asynchronous) help define these expectations regarding timing and fault types (crash-stop, crash-recovery, Byzantine).
Algorithm correctness is defined through safety (e.g., uniqueness of tokens) and liveness properties (e.g., eventual response).
Safety properties are expected to always hold, while liveness properties may have exceptions.
Fin.