Chapter 5 of (Kleppman) Designing Data-Intensive Systems
Table of contents
- Replication vs partitioning (241)
- Leader-based replication (242)
- Consistency guarantees
- Read-after-write consistency (257)
- Monotonic reads (260)
- Consistent prefix reads (262)
- Multi-leader replication
- Leaderless replication (279)
- Quorum consistency (282-5)
- Sloppy quorums (288)
- Hinted handoff (289)
- Detecting concurrent writes (291)
- Last write wins (293)
- Merging concurrently written values (301)
- Version vectors (302)
- Replication problems and solutions
- Read-after-write consistency (257)
- Monotonic reads (260)
- Consistent prefix reads (262)
- Synchronous, asynchronous, and semi-synchronous replication (245)
Raw highlights and notes
Highlight(yellow) - Page 241 · Location 4671
In this chapter we will assume that your dataset is so small that each machine can hold a copy of the entire dataset.
Highlight(yellow) - Page 241 · Location 4676
done. All of the difficulty in replication lies in handling changes to replicated data,
Highlight(yellow) - Leaders and Followers > Page 243 · Location 4715
When a client wants to read from the database, it can query either the leader or any of the followers. However, writes are only accepted on the leader (the followers are read-only from the client’s point of view).
Highlight(yellow) - Synchronous Versus Asynchronous Replication > Page 245 · Location 4759
The advantage of synchronous replication is that the follower is guaranteed to have an up-to-date copy of the data that is consistent with the leader.
Highlight(yellow) - Synchronous Versus Asynchronous Replication > Page 245 · Location 4760
The disadvantage is that if the synchronous follower doesn’t respond (because it has crashed, or there is a network fault, or for any other reason), the write cannot be processed.
Highlight(yellow) - Synchronous Versus Asynchronous Replication > Page 245 · Location 4763
For that reason, it is impracticable for all followers to be synchronous:
Highlight(yellow) - Synchronous Versus Asynchronous Replication > Page 245 · Location 4764
halt. In practice, if you enable synchronous replication on a database, it usually means that one of the followers is synchronous, and the others are asynchronous.
Highlight(yellow) - Synchronous Versus Asynchronous Replication > Page 246 · Location 4769
Often, leader-based replication is configured to be completely asynchronous. In this case, if the leader fails and is not recoverable, any writes that have not yet been replicated to followers are lost.
Highlight(yellow) - Setting Up New Followers > Page 247 · Location 4802
setting up a follower can usually be done without downtime.
Highlight(yellow) - Setting Up New Followers > Page 247 · Location 4816
The follower connects to the leader and requests all the data changes that have happened since the snapshot was taken. This requires that the snapshot is associated with an exact position in the leader’s replication log.
Note - Setting Up New Followers > Page 247 · Location 4817
Catching up a new replica after it loads a consistent snapshot
Note - Handling Node Outages > Page 248 · Location 4830
Followers that recover from a fault can catch up in the same way that new replicas catch up from a snapshot
Highlight(yellow) - Handling Node Outages > Page 249 · Location 4838
Handling a failure of the leader is trickier: one of the followers needs to be promoted to be the new leader, clients need to be reconfigured to send their writes to the new leader, and the other followers need to start consuming data changes from the new leader. This process is called failover.
Note - Handling Node Outages > Page 249 · Location 4841
Failover steps: 1. Detect leader has failed; 2. Followers elect a new leader; 3. Clients are redirected to the new leader.
Highlight(yellow) - Handling Node Outages > Page 250 · Location 4859
If the former leader rejoins the cluster after a new leader has been chosen, what should happen to those writes? The new leader may have received conflicting writes in the meantime.
Highlight(yellow) - Handling Node Outages > Page 250 · Location 4865
in one incident at GitHub [13], an out-of-date MySQL follower was promoted to leader. The database used an autoincrementing counter to assign primary keys to new rows, but because the new leader’s counter lagged behind the old leader’s, it reused some primary keys that were previously assigned by the old leader. These primary keys were also used in a Redis store, so the reuse of primary keys resulted in inconsistency between MySQL and Redis, which caused some private data to be disclosed to the wrong users.
Note - Handling Node Outages > Page 251 · Location 4877
Split brain: two nodes both think that they are leader and accept writes, which causes conflicts that cannot be resolved.
Highlight(yellow) - Handling Node Outages > Page 251 · Location 4886
If the system is already struggling with high load or network problems, an unnecessary failover is likely to make the situation worse, not better.
Highlight(yellow) - Implementation of Replication Logs > Page 252 · Location 4903
In the simplest case, the leader logs every write request (statement) that it executes and sends that statement log to its followers.
Note - Implementation of Replication Logs > Page 252 · Location 4904
Problems with statement-based replication: 1. Operations with that use local state (such as timestamps) become non-deterministic. 2. Concurrent operations that are order-dependent can get executed in a different order. 3. Side-effects can play out non-deterministically if the side effects are themselves not carefully implemented to avoid these problems.
Highlight(yellow) - Implementation of Replication Logs > Page 252 · Location 4915
because there are so many edge cases, other replication methods are now generally preferred.
Highlight(yellow) - Implementation of Replication Logs > Page 253 · Location 4928
In the case of a log-structured storage engine (see “SSTables and LSM-Trees”), this log is the main place for storage. Log segments are compacted and garbage-collected in the background.
Highlight(yellow) - Implementation of Replication Logs > Page 253 · Location 4930
In the case of a B-tree (see “B-Trees”), which overwrites individual disk blocks, every modification is first written to a write-ahead log so that the index can be restored to a consistent state after a crash.
Highlight(yellow) - Implementation of Replication Logs > Page 253 · Location 4932
In either case, the log is an append-only sequence of bytes containing all writes to the database. We can use the exact same log to build a replica on another node: besides writing the log to disk, the leader also sends it across the network to its followers.
Highlight(yellow) - Implementation of Replication Logs > Page 253 · Location 4938
The main disadvantage is that the log describes the data on a very low level: a WAL contains details of which bytes were changed in which disk blocks. This makes replication closely coupled to the storage engine.
Highlight(yellow) - Implementation of Replication Logs > Page 254 · Location 4948
An alternative is to use different log formats for replication and for the storage engine, which allows the replication log to be decoupled from the storage engine internals.
Note - Implementation of Replication Logs > Page 254 · Location 4950
Called “logical” or “row-based” log replication
Highlight(yellow) - Implementation of Replication Logs > Page 255 · Location 4974
if you want to only replicate a subset of the data, or want to replicate from one kind of database to another, or if you need conflict resolution logic (see “Handling Write Conflicts”), then you may need to move replication up to the application layer.
Highlight(yellow) - Implementation of Replication Logs > Page 256 · Location 4990
Trigger-based replication typically has greater overheads than other replication methods, and is more prone to bugs and limitations than the database’s built-in replication. However, it can nevertheless be useful due to its flexibility.
Highlight(yellow) - Problems with Replication Lag > Page 257 · Location 5014
if an application reads from an asynchronous follower, it may see outdated information if the follower has fallen behind.
Highlight(yellow) - Problems with Replication Lag > Page 257 · Location 5018
this effect is known as eventual consistency
Highlight(yellow) - Problems with Replication Lag > Page 257 · Location 5023
general, there is no limit to how far a replica can fall behind.
Note - Problems with Replication Lag > Page 257 · Location 5023
called “replication lag”
Highlight(yellow) - Reading Your Own Writes > Page 258 · Location 5041
read-after-write consistency, also known as read-your-writes consistency [24]. This is a guarantee that if the user reloads the page, they will always see any updates they submitted themselves.
Note - Reading Your Own Writes > Page 258 · Location 5043
p259 discusses some solutions
Highlight(yellow) - Monotonic Reads > Page 262 · Location 5090
Monotonic reads [23] is a guarantee that this kind of anomaly does not happen. It’s a lesser guarantee than strong consistency, but a stronger guarantee than eventual consistency.
Note - Monotonic Reads > Page 262 · Location 5091
“this kind of anomaly” refers to temporal inconsistencies
Highlight(yellow) - Monotonic Reads > Page 262 · Location 5093
One way of achieving monotonic reads is to make sure that each user always makes their reads from the same replica
Highlight(yellow) - Consistent Prefix Reads > Page 264 · Location 5115
Preventing this kind of anomaly requires another type of guarantee: consistent prefix reads [23]. This guarantee says that if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.
Note - Consistent Prefix Reads > Page 264 · Location 5117
Prevents causality violations
Highlight(yellow) - Solutions for Replication Lag > Page 265 · Location 5129
Pretending that replication is synchronous when in fact it is asynchronous is a recipe for problems down the line.
Highlight(yellow) - Solutions for Replication Lag > Page 265 · Location 5132
It would be better if application developers didn’t have to worry about subtle replication issues and could just trust their databases to “do the right thing.” This is why transactions exist: they are a way for a database to provide stronger guarantees so that the application can be simpler.
Highlight(yellow) - Multi-Leader Replication > Page 266 · Location 5153
multi-leader configuration (also known as master–master or active/ active replication). In this setup, each leader simultaneously acts as a follower to the other leaders.
Highlight(yellow) - Use Cases for Multi-Leader Replication > Page 267 · Location 5166
In a multi-leader configuration, you can have a leader in each datacenter.
Highlight(yellow) - Use Cases for Multi-Leader Replication > Page 268 · Location 5176
place. In a multi-leader configuration, every write can be processed in the local datacenter and is replicated asynchronously to the other datacenters.
Highlight(yellow) - Use Cases for Multi-Leader Replication > Page 268 · Location 5179
In a multi-leader configuration, each datacenter can continue operating independently of the others, and replication catches up when the failed datacenter comes back online.
Highlight(yellow) - Use Cases for Multi-Leader Replication > Page 268 · Location 5195
Although multi-leader replication has advantages, it also has a big downside: the same data may be concurrently modified in two different datacenters, and those write conflicts must be resolved
Highlight(yellow) - Use Cases for Multi-Leader Replication > Page 269 · Location 5200
autoincrementing keys, triggers, and integrity constraints can be problematic. For this reason, multi-leader replication is often considered dangerous territory that should be avoided if possible
Note - Handling Write Conflicts > Page 274 · Location 5292
Naive conflict resolution strategy: last write wins
Highlight(yellow) - Handling Write Conflicts > Page 274 · Location 5295
Give each replica a unique ID, and let writes that originated at a higher-numbered replica always take precedence over writes that originated at a lower-numbered replica. This approach also implies data loss. Somehow merge the values together—e.g., order them alphabetically and then concatenate them (in Figure 5-7, the merged title might be something like “B/ C”). Record the conflict in an explicit data structure that preserves all information, and write application code that resolves the conflict at some later time (perhaps by prompting the user).
Note - Handling Write Conflicts > Page 274 · Location 5301
Conflict resolution strategies other than last-write-wins
Highlight(yellow) - Multi-Leader Replication Topologies > Page 279 · Location 5409
To order these events correctly, a technique called version vectors can be used, which we will discuss later in this chapter (see “Detecting Concurrent Writes”).
Highlight(yellow) - Leaderless Replication > Page 280 · Location 5431
Some data storage systems take a different approach, abandoning the concept of a leader and allowing any replica to directly accept writes from clients. Some of the earliest replicated data systems were leaderless [1, 44], but the idea was mostly forgotten during the era of dominance of relational databases. It once again became a fashionable architecture for databases after Amazon used it for its in-house Dynamo system [37]. vi Riak, Cassandra, and Voldemort are open source datastores with leaderless replication models inspired by Dynamo, so this kind of database is also known as Dynamo-style.
Highlight(yellow) - Writing to the Database When a Node Is Down > Page 281 · Location 5447
in a leaderless configuration, failover does not exist.
Highlight(yellow) - Writing to the Database When a Node Is Down > Page 282 · Location 5456
when a client reads from the database, it doesn’t just send its request to one replica: read requests are also sent to several nodes in parallel.
Highlight(yellow) - Writing to the Database When a Node Is Down > Page 283 · Location 5465
Read repair When a client makes a read from several nodes in parallel, it can detect any stale responses. For example, in Figure 5-10, user 2345 gets a version 6 value from replica 3 and a version 7 value from replicas 1 and 2. The client sees that replica 3 has a stale value and writes the newer value back to that replica. This approach works well for values that are frequently read.
Highlight(yellow) - Writing to the Database When a Node Is Down > Page 283 · Location 5469
Anti-entropy process In addition, some datastores have a background process that constantly looks for differences in the data between replicas and copies any missing data from one replica to another. Unlike the replication log in leader-based replication, this anti-entropy process does not copy writes in any particular order, and there may be a significant delay before data is copied.
Highlight(yellow) - Writing to the Database When a Node Is Down > Page 283 · Location 5474
without an anti-entropy process, values that are rarely read may be missing from some replicas and thus have reduced durability, because read repair is only performed when a value is read by the application.
Highlight(yellow) - Writing to the Database When a Node Is Down > Page 284 · Location 5484
if there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes for each read. (In our example, n = 3, w = 2, r = 2.) As long as w + r > n, we expect to get an up-to-date value when reading, because at least one of the r nodes we’re reading from must be up to date. Reads and writes that obey these r and w values are called quorum reads and writes [44]. vii You can think of r and w as the minimum number of votes required for the read or write to be valid.
Highlight(yellow) - Writing to the Database When a Node Is Down > Page 285 · Location 5500
The quorum condition, w + r > n, allows the system to tolerate unavailable nodes as follows: If w < n, we can still process writes if a node is unavailable. If r < n, we can still process reads if a node is unavailable.
Highlight(yellow) - Limitations of Quorum Consistency > Page 286 · Location 5521
If you have n replicas, and you choose w and r such that w + r > n, you can generally expect every read to return the most recent value written for a key.
Highlight(yellow) - Limitations of Quorum Consistency > Page 286 · Location 5533
With a smaller w and r you are more likely to read stale values, because it’s more likely that your read didn’t include the node with the latest value. On the upside, this configuration allows lower latency and higher availability:
Highlight(yellow) - Limitations of Quorum Consistency > Page 287 · Location 5538
However, even with w + r > n, there are likely to be edge cases where stale values are returned.
Highlight(yellow) - Limitations of Quorum Consistency > Page 287 · Location 5549
If a write succeeded on some replicas but failed on others (for example because the disks on some nodes are full), and overall succeeded on fewer than w replicas, it is not rolled back on the replicas where it succeeded.
Note - Sloppy Quorums and Hinted Handoff > Page 289 · Location 5588
My understanding in the preceding sections is that all of the nodes eventually have all of the values in a replicated, but not partitioned, cluster. You broadcast the write operation to all of the nodes, and you consider the write to be successful iff some minimum number of nodes respond. This section talks about “home nodes,” suggesting that some data lives on some of the nodes but not others. This means you have to keep track of which nodes have which data, which automatically sounds like partitioning to me. But the chapter starts with the claim that, for its entirety, we are assuming that all data goes to all nodes. I tried to discuss this with both ChatGPT and Claude, and they both gave those garbled, self-contradictory responses that they do when the answer is itself not well understood.
Highlight(yellow) - Sloppy Quorums and Hinted Handoff > Page 289 · Location 5595
quorums (as described so far) are not as fault-tolerant as they could be.