#18 Consider Latency: CAP Theorem Revisited

When dealing with distributed systems, we fall into the bag of problems related to consistency, availability and partition tolerance. At the same time, we tend to forget about latency.

Jul 20, 2024

When dealing with distributed systems, we face the problem of sharing information between multiple nodes. There are usually two main problems:

How do we keep the data consistent?
How do we make sure that the data is always available?

Years ago, Eric Brewer introduced the CAP Theorem. The abbreviation describes the following characteristics:

Consistency. All nodes see the same data at the same time. When a write is performed on one node, it is instantly reflected on all other nodes before the write is considered complete.
Availability. Each request received by a working node must result in a response. In other words, the system remains operational and can always respond to requests. This might eventually lead to inconsistent data (the data in one node differs from the data in another).
Partition tolerance. The system continues to operate in the case of a network partition (division into separate subnets).

According to the CAP Theorem, you can choose any two:

CP (Consistency + Partition Tolerance)
AP (Availability + Partition Tolerance)
CA (Consistency + Availability)

In practice, to consider the latter's option, we must assume that we operate in an environment free from network partitions. That’s why, in real-world applications, you usually select CP or AP.

What is worth remembering is that your system can be built on top of both. Let me explain.

Imagine the application has several modules, each responsible for various business capabilities (high cohesion).

One of your modules is responsible for adding product reviews. There is probably no need to keep the data strongly consistent across multiple nodes because if some users don’t see the latest review of the product, nothing wrong will happen. So, you can prioritize availability over consistency (AP).

On the other hand, another module responsible for checking a product's availability might require strong consistency to avoid users seeing different product availability and potentially buying something that is no longer in stock. In this case, you might prioritize consistency over availability (CP). This means that during a network partition, the system will maintain consistency across the nodes that can still communicate with each other, even if it means some nodes become unavailable to some users. This ensures that all successful requests see the most up-to-date inventory, preventing overselling but at the cost of potential temporary unavailability for some users.

Now, I would like to touch on two areas that can be problematic if only these three characteristics (consistency, availability, and partition tolerance) are considered.

What happens when my system is available, but I have to wait 32 seconds (or longer) for the response because of latency? Can I still consider it available? Is such availability acceptable?

This fourth characteristic - latency - would be a reasonable consideration when planning your distributed systems.

There is one more issue related to availability. Is the system still available if I can execute read operations (available replicas) but cannot perform writes (unavailable primary)? Let me know what you think about it! I am curious—I have already talked with several architects, and we have various opinions.

For more details about the latency-centric approach, look at Martin Kleppmann's document, A Critique of the CAP Theorem. He proposes there an alternative called Delay Sensitivity. It’s an excellent read!

You can also read about a concept called PACELC by Daniel J. Abadi that also considers latency.

Do you consider latency while designing your software systems?

Join the next Advanced Distributed Systems Design course with Udi Dahan in London this fall and change the way you think about designing software systems. Learn more

Fractional Architect

Discussion about this post