The foundation of every algorithm in a distributed system is on what kind of underlining hardware system it will execute on. In practice, nodes (physical or virtual machines) and communication links between them (the physical network layer), can behave in a variety of ways. So for an algorithm to be correct, we need to assume what properties the underlining systems will have. The set of these assumptions is what is known as a system model.
Three main characteristics or properties a system model will describe are:
- How the network will behave
- How the node will behave
- How the timing will behave
When designing a system, we need to choose a model for each of these three parts.
Networks are unreliable. That is a fact of life we are all very well aware of. There are many reasons why a message can be dropped or delayed and of them is a cow stepping on a fiber optic cable.
Typically a distributed system is built on top of point-to-point communications links between two nodes and the assumption about them is what will determine a network model of the system. The network behavior is defined by one of the following link properties:
- Reliable link: the simplest model which assumes a perfect link over which a message will always go through. A message is received if and only if it is previously sent and if a message is sent then it will be received. We will only assume here that messages can be reordered. This is the strongest assumption.
- Fail-loss link: a message sent might get through or it might not, but with a certain amount of retries in a finite amount of time we can assume that it will eventually get through. Besides the possibility of messages being lost, they can also be duplicated or reordered.
- Arbitrary link: here the link can arbitrarily do anything to a message. The assumption is that, besides the problems we get with a fair-loss link, there can be some malicious active adversary which can do anything with a link and messages going over it (spoofing, modifying, dropping…)
The important and interesting thing about these three models of network behavior is that it is possible to convert one to other. So for example, if we retry the sending of a message and utilize deduplication on the recipient side (because it can receive the same message multiple times) we could turn a fair-loss link into a reliable one. And if we use some encryption protocol for our communication, like TLS, we can almost convert the arbitrary link to a fair-loss. I say almost because if a bad guy (or a girl) decides to block all communications over the link, we can’t do much about it besides cursing (not good) or something similar.
A common term popping up when talking about networks in a distributed system is network partition which simply means that the nodes are working fine, but the links between them are interrupted. Usually, this interruption is for some finite period of time and eventually, it will get repaired. The situation which can typically be produced with a network partition is that we can have two subgroups of nodes that can’t communicate with each other. This is called a split-brain problem and it’s probably one of the hardest problems in any distributed system.
Network partition is the P (partition tolerant) in the CAP theorem.
Like network links, nodes can (and will) fail in various ways. These behaviors are the next part describing a system model and they can be:
- Crash-stop (fail-stop): for this kind of failure we assume that when a node crashes (at any moment) it will forever stop with its execution. A node being dead forever is somewhat of a simplified assumption but it can actually happen, for example, if a phone (or some other device) is a node and it gets dropped in a pool of water it will probably be dead forever.
- Crash-recovery (fail recovery): a node can crash (also at any moment) but it can resume executing some time later. We assume that the node will lose its in-memory state (everything not written to some non-volatile storage).
- Byzantine (fail-arbitrary): like a byzantine general in a Byzantine general’s problem, a node, besides simply crashing, can do more-or-less anything, including malicious behavior. With this assumption, a faulty node is deviating from the specified algorithm which all nodes have to follow.
If a node is not faulty it is then called a correct node.
The important thing to note here is that a node does not necessarily know that some other node is correct or faulty. For that, there are different fault detection techniques, of which the most simple and powerful are timeouts.
The third part describing a system model is timing behavior. By timing behavior, it is simply meant the messages’ response times and latencies. Like with networks and nodes, here we can also choose one of three properties:
- Synchronous: in a synchronous system model a message latency will not go over some known upper bound. So when we send a message over a network, we know the maximum time it will take for it to reach its destination. Also, a node will execute the algorithm at a known speed. This is the strongest assumption.
- Partially synchronous: the message latencies are known, but for some finite, but unknown, periods of time the latencies can go over the known upper bounds. Node execution speed can also decrease from some usual values.
- Asynchronous: with this assumption, there are basically no timing guarantees. Messages can be delayed arbitrarily, and node execution can slow down or even pause (say stop-the-world GC pause) for some time. This is the weakest model.
Basically, a partially synchronous model is some compromise between the two extremes, the synchronous and asynchronous models. There is hardly a problem that can be solved with no timing guarantees at all and it is dangerous to assume synchronous timings. If we assume we have a synchronous model and it briefly goes asynchronous, our algorithm can fail catastrophically. Distributed algorithms are very sensitive to timing assumptions.
Usually, a system can be synchronous. Networks can have pretty predictable latencies (especially inside a data center) and nodes execute at almost known speeds, but there are some things that can violate this assumption like a message being lost or retried, various network congestions, stop-the-world GC pauses (depending on the GC algorithm and heap size, the pause can last over a minute) or page faults. Depending on the system we are building, these hiccups may not be only hiccups.
Consensus system models
Popular consensus algorithms like Paxos (Multi-Paxos) and Raft assume partially synchronous and crash-recovery system models. One of the reasons is that this model is more-or-less what we can typically find in the real world, network latencies that can spike and a node that can crash also can get recovered (if your node is in a docker container and you configure some restart policy, you have a crash-recovery node behavior).
Besides the above reason, the more important thing here is the timing behavior. You may ask why not choose an asynchronous system model, it would be ideal, a model with no timing guarantees means that we can run our algorithm on pretty much anything? The reason why a partially synchronous is the weakest model we can choose is because it is impossible to implement a deterministic consensus algorithm on an asynchronous system. The proof for this impossibility is the well-known so-called FLP result (Fischer, Linch, Paterson).
For a distributed algorithm to be correct it is crucial that it operates in the context or a system model, it is designed for, and for an algorithm to be designed correctly it is mandatory to assume (and assume right) certain properties of the underlining system. These assumptions are what is known as a system model and it will basically describe three key parts of a system: network behavior (e.g. message loss), node behavior (e.g. crashes), and timing behavior (e.g. latency).