Chapter 5 of (Kleppman) Designing Data-Intensive Systems

Table of contents

Raw highlights and notes

Highlight(yellow) - Page 241 · Location 4671

In this chapter we will assume that your dataset is so small that each machine can hold a copy of the entire dataset.

Highlight(yellow) - Page 241 · Location 4676

done. All of the difficulty in replication lies in handling changes to replicated data,

Highlight(yellow) - Leaders and Followers > Page 243 · Location 4715

When a client wants to read from the database, it can query either the leader or any of the followers. However, writes are only accepted on the leader (the followers are read-only from the client’s point of view).

Highlight(yellow) - Synchronous Versus Asynchronous Replication > Page 245 · Location 4759

The advantage of synchronous replication is that the follower is guaranteed to have an up-to-date copy of the data that is consistent with the leader.

Highlight(yellow) - Synchronous Versus Asynchronous Replication > Page 245 · Location 4760

The disadvantage is that if the synchronous follower doesn’t respond (because it has crashed, or there is a network fault, or for any other reason), the write cannot be processed.

Highlight(yellow) - Synchronous Versus Asynchronous Replication > Page 245 · Location 4763

For that reason, it is impracticable for all followers to be synchronous:

Highlight(yellow) - Synchronous Versus Asynchronous Replication > Page 245 · Location 4764

halt. In practice, if you enable synchronous replication on a database, it usually means that one of the followers is synchronous, and the others are asynchronous.

Highlight(yellow) - Synchronous Versus Asynchronous Replication > Page 246 · Location 4769

Often, leader-based replication is configured to be completely asynchronous. In this case, if the leader fails and is not recoverable, any writes that have not yet been replicated to followers are lost.

Highlight(yellow) - Setting Up New Followers > Page 247 · Location 4802

setting up a follower can usually be done without downtime.

Highlight(yellow) - Setting Up New Followers > Page 247 · Location 4816

The follower connects to the leader and requests all the data changes that have happened since the snapshot was taken. This requires that the snapshot is associated with an exact position in the leader’s replication log.

Note - Setting Up New Followers > Page 247 · Location 4817

Catching up a new replica after it loads a consistent snapshot

Note - Handling Node Outages > Page 248 · Location 4830

Followers that recover from a fault can catch up in the same way that new replicas catch up from a snapshot

Highlight(yellow) - Handling Node Outages > Page 249 · Location 4838

Handling a failure of the leader is trickier: one of the followers needs to be promoted to be the new leader, clients need to be reconfigured to send their writes to the new leader, and the other followers need to start consuming data changes from the new leader. This process is called failover.

Note - Handling Node Outages > Page 249 · Location 4841

Failover steps: 1. Detect leader has failed; 2. Followers elect a new leader; 3. Clients are redirected to the new leader.

Highlight(yellow) - Handling Node Outages > Page 250 · Location 4859

If the former leader rejoins the cluster after a new leader has been chosen, what should happen to those writes? The new leader may have received conflicting writes in the meantime.

Highlight(yellow) - Handling Node Outages > Page 250 · Location 4865

in one incident at GitHub [13], an out-of-date MySQL follower was promoted to leader. The database used an autoincrementing counter to assign primary keys to new rows, but because the new leader’s counter lagged behind the old leader’s, it reused some primary keys that were previously assigned by the old leader. These primary keys were also used in a Redis store, so the reuse of primary keys resulted in inconsistency between MySQL and Redis, which caused some private data to be disclosed to the wrong users.

Note - Handling Node Outages > Page 251 · Location 4877

Split brain: two nodes both think that they are leader and accept writes, which causes conflicts that cannot be resolved.

Highlight(yellow) - Handling Node Outages > Page 251 · Location 4886

If the system is already struggling with high load or network problems, an unnecessary failover is likely to make the situation worse, not better.

Highlight(yellow) - Implementation of Replication Logs > Page 252 · Location 4903

In the simplest case, the leader logs every write request (statement) that it executes and sends that statement log to its followers.

Note - Implementation of Replication Logs > Page 252 · Location 4904

Problems with statement-based replication: 1. Operations with that use local state (such as timestamps) become non-deterministic. 2. Concurrent operations that are order-dependent can get executed in a different order. 3. Side-effects can play out non-deterministically if the side effects are themselves not carefully implemented to avoid these problems.

Highlight(yellow) - Implementation of Replication Logs > Page 252 · Location 4915

because there are so many edge cases, other replication methods are now generally preferred.

Highlight(yellow) - Implementation of Replication Logs > Page 253 · Location 4928

In the case of a log-structured storage engine (see “SSTables and LSM-Trees”), this log is the main place for storage. Log segments are compacted and garbage-collected in the background.

Highlight(yellow) - Implementation of Replication Logs > Page 253 · Location 4930

In the case of a B-tree (see “B-Trees”), which overwrites individual disk blocks, every modification is first written to a write-ahead log so that the index can be restored to a consistent state after a crash.

Highlight(yellow) - Implementation of Replication Logs > Page 253 · Location 4932

In either case, the log is an append-only sequence of bytes containing all writes to the database. We can use the exact same log to build a replica on another node: besides writing the log to disk, the leader also sends it across the network to its followers.

Highlight(yellow) - Implementation of Replication Logs > Page 253 · Location 4938

The main disadvantage is that the log describes the data on a very low level: a WAL contains details of which bytes were changed in which disk blocks. This makes replication closely coupled to the storage engine.

Highlight(yellow) - Implementation of Replication Logs > Page 254 · Location 4948

An alternative is to use different log formats for replication and for the storage engine, which allows the replication log to be decoupled from the storage engine internals.

Note - Implementation of Replication Logs > Page 254 · Location 4950

Called “logical” or “row-based” log replication

Highlight(yellow) - Implementation of Replication Logs > Page 255 · Location 4974

if you want to only replicate a subset of the data, or want to replicate from one kind of database to another, or if you need conflict resolution logic (see “Handling Write Conflicts”), then you may need to move replication up to the application layer.

Highlight(yellow) - Implementation of Replication Logs > Page 256 · Location 4990

Trigger-based replication typically has greater overheads than other replication methods, and is more prone to bugs and limitations than the database’s built-in replication. However, it can nevertheless be useful due to its flexibility.

Highlight(yellow) - Problems with Replication Lag > Page 257 · Location 5014

if an application reads from an asynchronous follower, it may see outdated information if the follower has fallen behind.

Highlight(yellow) - Problems with Replication Lag > Page 257 · Location 5018

this effect is known as eventual consistency

Highlight(yellow) - Problems with Replication Lag > Page 257 · Location 5023

general, there is no limit to how far a replica can fall behind.

Note - Problems with Replication Lag > Page 257 · Location 5023

called “replication lag”

Highlight(yellow) - Reading Your Own Writes > Page 258 · Location 5041

read-after-write consistency, also known as read-your-writes consistency [24]. This is a guarantee that if the user reloads the page, they will always see any updates they submitted themselves.

Note - Reading Your Own Writes > Page 258 · Location 5043

p259 discusses some solutions

Highlight(yellow) - Monotonic Reads > Page 262 · Location 5090

Monotonic reads [23] is a guarantee that this kind of anomaly does not happen. It’s a lesser guarantee than strong consistency, but a stronger guarantee than eventual consistency.

Note - Monotonic Reads > Page 262 · Location 5091

“this kind of anomaly” refers to temporal inconsistencies

Highlight(yellow) - Monotonic Reads > Page 262 · Location 5093

One way of achieving monotonic reads is to make sure that each user always makes their reads from the same replica

Highlight(yellow) - Consistent Prefix Reads > Page 264 · Location 5115

Preventing this kind of anomaly requires another type of guarantee: consistent prefix reads [23]. This guarantee says that if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.

Note - Consistent Prefix Reads > Page 264 · Location 5117

Prevents causality violations

Highlight(yellow) - Solutions for Replication Lag > Page 265 · Location 5129

Pretending that replication is synchronous when in fact it is asynchronous is a recipe for problems down the line.

Highlight(yellow) - Solutions for Replication Lag > Page 265 · Location 5132

It would be better if application developers didn’t have to worry about subtle replication issues and could just trust their databases to “do the right thing.” This is why transactions exist: they are a way for a database to provide stronger guarantees so that the application can be simpler.

Highlight(yellow) - Multi-Leader Replication > Page 266 · Location 5153

multi-leader configuration (also known as master–master or active/ active replication). In this setup, each leader simultaneously acts as a follower to the other leaders.

Highlight(yellow) - Use Cases for Multi-Leader Replication > Page 267 · Location 5166

In a multi-leader configuration, you can have a leader in each datacenter.

Highlight(yellow) - Use Cases for Multi-Leader Replication > Page 268 · Location 5176

place. In a multi-leader configuration, every write can be processed in the local datacenter and is replicated asynchronously to the other datacenters.

Highlight(yellow) - Use Cases for Multi-Leader Replication > Page 268 · Location 5179

In a multi-leader configuration, each datacenter can continue operating independently of the others, and replication catches up when the failed datacenter comes back online.

Highlight(yellow) - Use Cases for Multi-Leader Replication > Page 268 · Location 5195

Although multi-leader replication has advantages, it also has a big downside: the same data may be concurrently modified in two different datacenters, and those write conflicts must be resolved

Highlight(yellow) - Use Cases for Multi-Leader Replication > Page 269 · Location 5200

autoincrementing keys, triggers, and integrity constraints can be problematic. For this reason, multi-leader replication is often considered dangerous territory that should be avoided if possible

Note - Handling Write Conflicts > Page 274 · Location 5292

Naive conflict resolution strategy: last write wins

Highlight(yellow) - Handling Write Conflicts > Page 274 · Location 5295

Give each replica a unique ID, and let writes that originated at a higher-numbered replica always take precedence over writes that originated at a lower-numbered replica. This approach also implies data loss. Somehow merge the values together—e.g., order them alphabetically and then concatenate them (in Figure 5-7, the merged title might be something like “B/ C”). Record the conflict in an explicit data structure that preserves all information, and write application code that resolves the conflict at some later time (perhaps by prompting the user).

Note - Handling Write Conflicts > Page 274 · Location 5301

Conflict resolution strategies other than last-write-wins

Highlight(yellow) - Multi-Leader Replication Topologies > Page 279 · Location 5409

To order these events correctly, a technique called version vectors can be used, which we will discuss later in this chapter (see “Detecting Concurrent Writes”).

Highlight(yellow) - Leaderless Replication > Page 280 · Location 5431

Some data storage systems take a different approach, abandoning the concept of a leader and allowing any replica to directly accept writes from clients. Some of the earliest replicated data systems were leaderless [1, 44], but the idea was mostly forgotten during the era of dominance of relational databases. It once again became a fashionable architecture for databases after Amazon used it for its in-house Dynamo system [37]. vi Riak, Cassandra, and Voldemort are open source datastores with leaderless replication models inspired by Dynamo, so this kind of database is also known as Dynamo-style.

Highlight(yellow) - Writing to the Database When a Node Is Down > Page 281 · Location 5447

in a leaderless configuration, failover does not exist.

Highlight(yellow) - Writing to the Database When a Node Is Down > Page 282 · Location 5456

when a client reads from the database, it doesn’t just send its request to one replica: read requests are also sent to several nodes in parallel.

Highlight(yellow) - Writing to the Database When a Node Is Down > Page 283 · Location 5465

Read repair When a client makes a read from several nodes in parallel, it can detect any stale responses. For example, in Figure 5-10, user 2345 gets a version 6 value from replica 3 and a version 7 value from replicas 1 and 2. The client sees that replica 3 has a stale value and writes the newer value back to that replica. This approach works well for values that are frequently read.

Highlight(yellow) - Writing to the Database When a Node Is Down > Page 283 · Location 5469

Anti-entropy process In addition, some datastores have a background process that constantly looks for differences in the data between replicas and copies any missing data from one replica to another. Unlike the replication log in leader-based replication, this anti-entropy process does not copy writes in any particular order, and there may be a significant delay before data is copied.

Highlight(yellow) - Writing to the Database When a Node Is Down > Page 283 · Location 5474

without an anti-entropy process, values that are rarely read may be missing from some replicas and thus have reduced durability, because read repair is only performed when a value is read by the application.

Highlight(yellow) - Writing to the Database When a Node Is Down > Page 284 · Location 5484

if there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes for each read. (In our example, n = 3, w = 2, r = 2.) As long as w + r > n, we expect to get an up-to-date value when reading, because at least one of the r nodes we’re reading from must be up to date. Reads and writes that obey these r and w values are called quorum reads and writes [44]. vii You can think of r and w as the minimum number of votes required for the read or write to be valid.

Highlight(yellow) - Writing to the Database When a Node Is Down > Page 285 · Location 5500

The quorum condition, w + r > n, allows the system to tolerate unavailable nodes as follows: If w < n, we can still process writes if a node is unavailable. If r < n, we can still process reads if a node is unavailable.

Highlight(yellow) - Limitations of Quorum Consistency > Page 286 · Location 5521

If you have n replicas, and you choose w and r such that w + r > n, you can generally expect every read to return the most recent value written for a key.

Highlight(yellow) - Limitations of Quorum Consistency > Page 286 · Location 5533

With a smaller w and r you are more likely to read stale values, because it’s more likely that your read didn’t include the node with the latest value. On the upside, this configuration allows lower latency and higher availability:

Highlight(yellow) - Limitations of Quorum Consistency > Page 287 · Location 5538

However, even with w + r > n, there are likely to be edge cases where stale values are returned.

Highlight(yellow) - Limitations of Quorum Consistency > Page 287 · Location 5549

If a write succeeded on some replicas but failed on others (for example because the disks on some nodes are full), and overall succeeded on fewer than w replicas, it is not rolled back on the replicas where it succeeded.

Note - Sloppy Quorums and Hinted Handoff > Page 289 · Location 5588

My understanding in the preceding sections is that all of the nodes eventually have all of the values in a replicated, but not partitioned, cluster. You broadcast the write operation to all of the nodes, and you consider the write to be successful iff some minimum number of nodes respond. This section talks about “home nodes,” suggesting that some data lives on some of the nodes but not others. This means you have to keep track of which nodes have which data, which automatically sounds like partitioning to me. But the chapter starts with the claim that, for its entirety, we are assuming that all data goes to all nodes. I tried to discuss this with both ChatGPT and Claude, and they both gave those garbled, self-contradictory responses that they do when the answer is itself not well understood.

Highlight(yellow) - Sloppy Quorums and Hinted Handoff > Page 289 · Location 5595

quorums (as described so far) are not as fault-tolerant as they could be.