Skip to content

Out of the Tar Pit

March 20, 2015

Out of the Tar Pit – Moseley & Marks 2006

This is the final Desert Island Paper choice from Jonas Bonér, and a great way to round out the week. ‘Out of the Tar Pit’ was the 10th paper that I covered in the #themorningpaper series, but at that time I was only giving highlights on twitter, not via the blog format. So I’m grateful for a chance to revisit it. It’s a long read at 66 pages, but written in a very easy to digest essay style. The crux of the argument is that complexity lies at the root of our problems with software systems. By separating the essential (to the problem domain) complexity from any accidental complexity – and avoiding the latter altogether wherever possible – the authors argue for a better way of building software.

Complexity and its causes

Complexity is the root cause of the vast majority of problems with software today… The primary status of complexity as the major cause comes simply from the fact that being able to understand a system is a prerequisite for avoiding all of them, and of course it is this which complexity destroys.

How do we try to understand systems? We can try to understand a system from the outside, by testing it and seeing what it does, but this approach has limits. Far more important is the process of informal reasoning. This is greatly hampered by state.

The mental processes which are used to do this informal reasoning often revolve around a case-by-case mental simulation of behaviour… As the number of states – and hence the number of possible scenarios that must be considered – grows, the effectiveness of this mental approach buckles almost as quickly as testing… For every single bit of state that we add, we double the total number of possible states.

State also contaminates stateless procedures if they make use of – even indirectly – any other procedure that is stateful.

…it is our belief that the single biggest remaining cause of complexity in most contemporary large systems is state, and the more we can do to limit and manage state, the better.

After state, control logic is the biggest source of complexity. “When a programmer is forced (through use of a language with implicit control flow) to specify the control, he or she is being forced to specify an aspect of how the system should work rather than simply what is desired.”

Like basic control such as branching, but as opposed to sequencing, concurrency is normally specified explicitly in most languages. The most common model is “shared state concurrency” in which specification for explicit synchronization is required. The impacts that this has for informal reasoning are well known, and the difficulty comes from adding further to the number of scenarios that must mentally be considered as the program is read.

In the same vein, we’ve seen again and again that distribution brings with it the same problems.

Once complexity is let into a system, things tend to go downhill fairly quickly: “Complexity breeds complexity.”

Duplication is a prime example of this – if (due to state, control, or code volume) it is not clear that the functionality already exists, or it is too complex to understand whether what exists does exactly what is required, there will be a strong tendency to duplicate. This is particularly true in the presence of time pressures.

Powerful languages can also be a source of complexity when trying to understand a system:

The bottom line is that the more powerful a language (i.e. the more that is possible within the language), the harder it is to understand systems constructed in it.

Moseley and Marks then turn their attention to programming language paradigms to see how they stack up. Object-orientation (‘from traditional passive objects to the active/actor styles’) can make it hard to enforce constraints across multiple objects, and introduces complexity of its own with concept of object identity.

The bottom line is that all forms of OOP rely on state (contained within objects) and in general all behaviour is affected by this state. As a result of this, OOP suffers directly from the problems associated with state, and as such we believe that it does not provide an adequate foundation for avoiding complexity.

In contrast, the primary strength of (pure) functional programming is that by avoiding state and side-effects the entire system gains the property of referential transparency. The problem of state moving from encapsulation within an object to a growing list of parameters that are passed around from function to function is discussed. “This does not detract from the general power of the functional approach.”

More generally, we would argue that whatever the language being used there are large benefits to be had from avoiding hidden, implicit, mutable state.

Writing in 2006, the authors lament that ‘such arguments have been insufficient to result in widespread adoption of functional programming.’ (Adoption and awareness has certainly increased since then, but not to the point that Moselely and Marks would desire).

We must therefore conclude that the main weakness of functional programming is the flip side of its main strength – namely that problems arise when (as is often the case) the system to be built must maintain some kind of state…. One potential approach is the elegant system of monads used by Haskell… Again, despite their huge strengths, monads have as yet been insufficient to give rise to the widespread adoption of functional techniques.

Logic programming “offers the tantalising promise of the ability to escape from the complexity problems caused by control.”

Essential and Accidental Complexity

Essential Complexity is inherent in, and the essence of, the problem (as seen by the users).

Accidental Complexity is all the rest – complexity with which the development team would not have to deal in the ideal world (e.g. complexity arising from performance issues and from suboptimal language and infrastructure).

Note that the definition of essential is deliberately more strict than common usage. Specifically when we use the term essential we will mean strictly essential to the users’ problem (as opposed to – perhaps – essential to some specific, implemented system, or even – essential to software in general).

Essential complexity is that which even in the ideal world the team will have to be concerned with. However,..

given that in the real world not all possible ways are practical, the implication is that any real development will need to contend with some accidental complexity. The definition does not seek to deny this – merely to identify its secondary nature.

State will be considered accidental state if it can be omitted in the ideal world, and the same applies to control. In the ideal world we would produce formal requirements ensuring that there is no relevant ambiguity in them, and then be able to simply execute those formal requirements:

This state of affairs is absolute simplicity – it does not seem conceivable that we can do any better than this even in the ideal world. It is interesting to note that what we have just described is in fact the very essence of declarative programming – i.e. that you need only specify what you require, not how it must be achieved.

All control logic is therefore considered to be accidental complexity.

Data may be provided directly to the system as input, or derived. All derived data is accidental state as it can always be re-derived. (See the Tachyon paper for an example of this concept in-the-large).

It is our belief that the vast majority of state (as encountered in typical contemporary systems) simply isn’t needed (in this ideal world). Because of this, and the huge complexity which state can cause, the ideal world removes all non-essential state.

How close is it possible to get to the ideal world in the real one?

The Way Forward

There are two possible reasons why in practice – even with optimal language and infrastructure, we may require complexity which strictly is accidental:

  • Performance – when accidental state and control is required for efficiency
  • Ease of expression – accidental state can be the most natural way to express logic in some cases

We believe that – despite the existence of required accidental complexity – it is possible to retain most of the simplicity of the ideal world in the real one. We now look at how this might be achievable. Our recommendations for dealing with complexity (as exemplified by both state and control) can be summed up as avoid, and separate.

Avoid accidental state and complexity whenever you can, and if you can’t, then separate it from the rest of the system. “There is nothing particularly profound in these recommendations, but they are worth stating because they are emphatically not the way most software is developed today.” The foremost separation is a logic/state split in which all complexity of any kind is separated out from the pure logic of the system.

One implication of this overall structure is that the system (essential + accidental but useful) should still function completely correctly if the “accidental but useful” bits are removed – albeit possibly unacceptably slowly.

If we’ve separated out the parts in this way, they will each be of a very different nature.

… as a result, it may be ideal to use different languages for each. These languages would each be oriented (i.e. restricted) to their specific goal – there is no sense in having control specification primitives in a language for specifying state.

The recommended architecture is shown below:

Components of an FRP System

Essential state is the foundation of the system. It makes no reference to any of the other parts. Essential logic is the heart of the system (sometimes termed the ‘business logic’). It expresses what must be true, but does not say anything about how, when, or why the state might change dynamically. (We could also describe this as the system invariants to tie it back to the concepts we were looking at yesterday). The logic specification references only the essential state. Accidental state and control is the third component, and changing it can never affect the other two.

The key difference between what we are advocating and existing approaches (as embodied by the various styles of programming language) is a high level separation into three components – each specified in a different language. It is this separation which allows us to restrict the power of each individual component, and it is this use of restricted languages which is vital in making the overall system easier to comprehend. Doing this separation when building a system may not be easy, but we believe that for any large system it will be significantly less difficult that dealing with the complexity that arises otherwise.

For specifying state the authors recommend the relational model, and the relational algebra for specifying manipulations. “Integrity in the relational model is maintained simply by specifying – in a purely declarative way – a set of constraints which must hold at all times.

This is augmented by functional programming, to give “Functional Relational Programming (FRP).”

FRP is currently a purely hypothetical approach to system architecture that has not in any way been proved in practice. It is however based firmly on principles from other areas (the relational model, functional and logic programming) which have been widely proven. In FRP all essential state takes the form of relations, and the essential logic is expressed using relational algebra extended with pure user-defined functions.

Essential state is a relational definition of the stateful components of the system. Essential logic contains derived-relation definitions, integrity constraints, and pure functions. The accidental state and control component contains a declarative specification of a set of performance optimizations for the system.

Feeder components convert inputs into relational assignments – i.e. cause changes to the essential state. Observer components generate output in response to changes which they observe.

At a minimum, observers will only need to specify the name of the relation which they wish to observe. The infrastructure which runs the system will ensure that the observer is invoked (with the new relation value) whenever it changes.

The concluding part of the paper is a worked example of a system to support a real estate business.

Dedalus, Bloom, and Edelweiss

It’s interesting to compare the vision set out in “Out of the Tar Pit” with some of the research undertaken by Joe Hellerstein and his team at Berkeley.

  • In The Declarative Imperative we see the imperative to adopt a declarative approach grounded in relational logic, and the creation of a datalog derivative called Dedalus for this purpose.

  • In Consistency analysis in Bloom: A CALM and collected approach we see how Bloom, based on Dedalus, can be used as a separated embedded language to specify essential state and logic.

  • And in Edelweiss we see the pursuit of a clean separation between essential and accidental complexity that can significantly simplify Bloom programs leaving the programmer to specify the desired outcomes, and Edelweiss to worry about making achieving them efficient.

Coordination Avoidance in Database Systems

March 19, 2015

Coordination Avoidance in Database Systems – Bailis et al. 2014

The very title of this paper speaks to the theme we’ve been looking at so far this week – how to reduce the amount of coordination needed in a distributed system. (Which seems fitting having just spent the prior two weeks looking at how costly and complex that coordination can be when it is needed). There’s a particularly stunning result with respect to the TPC-C benchmark – after looking at the problem from a fresh perspective, and without breaking any of the invariants required by TPC-C, the authors were able to create a linearly scalable system with 200 servers processing 12.7M tps – about 25x the next-best system.

One of the things that strikes me when I read the large-scale systems papers from the likes of Google and Amazon is that distributed system design involves many trade-offs, and these companies are operating at a scale whereby they can build custom distributed systems which make those trade-offs in such a manner as to best fit their workloads. Not all of us have the same luxury of being able to do that! But as we’ve seen with Bloom and CALM, and with CRDTs, we can try to build our applications in such a way as to minimise coordination requirements. This collaboration between application and underlying datastore seems to be key to unlocking the next level of progress. And it’s the area that Bailis et al. focus on in this paper.

Minimizing coordination, or blocking communication between concurrently executing operations, is key to maximizing scalability, availability, and high performance in database systems. However,uninhibited coordination-free execution can compromise application correctness, or consistency. When is coordination necessary for correctness? The classic use of serializable transactions is sufficient to maintain correctness but is not necessary for all applications, sacrificing potential scalability. In this paper, we develop a formal framework, invariant confluence, that determines whether an application requires coordination for correct execution.

When an application programmer expresses their correctness criteria in the form of invariants, the invariant confluence framework is able to determine which operations can safely be executed without coordination.

Section 2 of the paper contains a nice demonstration of just how costly coordination can be – and becomes ever more so as the RTT between coordinating peers increases and the number of participants increases. A system that can do over 1000 tps coordinating between two servers on a local network can see performance drop-off to only 2 tps when coordinating across all 8 EC2 availability zones. I’m going to assume you’re already convinced that coordination is costly, and move onto the parts of the paper looking at what we can do to safely avoid it.

An invariant is defined as a predicate over the state of the database at a replica – given the state of the database, it can be true or false. For example, a foreign-key constraint. We call a replica invariant-valid (I-valid) if all invariants are true at the replica, and a system is globally I-valid if all replicas always contain I-valid state. In the system model, …

Each transaction commits on its local replica, and the result of each transaction is reflected in the transaction’s local server state. After the transactions have completed, the servers exchange state and, after applying the merge operator, converge to the same state. Any transactions executing later on either server will obtain a replica that includes the effects of both transactions.

When an application requires coordination for correctness depends on both its invariants, and its transactions. Take an invariant I, and a set of transactions T. Assume a starting state in which the invariant holds (I-valid). Each transaction should take us to a new state in which the invariant also holds. By chaining these together, we can reason about the states the database can get into – including when transactions initiated at different replicas are merged.

We say that Si is a I-T-reachable state if, given an invariant I and set of transactions T (with merge function t), there exists a (partially ordered) sequence of transaction and merge function invocations that yields Si, and each intermediate state produced by transaction execution or merge invocation is also I-valid. We call these previous states ancestor states. Note that each ancestor state is either I-T-reachable or is instead the initial state.

The central notion in the paper, I-confluence, can now be defined:

A set of transactions T is I-confluent with respect to invariant I if, for all I-T-reachable states Di, Dj with a common ancestor state, Di merged with Dj is I-valid.

Roughly translated, we can let state diverge and then later on bring it back together with the merge operation, and the invariant will always hold as we do this. And if we have I-confluence, then we don’t need coordination:

A globally I-valid system can execute a set of transactions T with coordination-freedom, transactional availability, (and) convergence if and only if T is I-confluent with respect to I… If I-confluence holds, there exists a correct, coordination-free execution strategy for the transactions; if not, no possible implementation can guarantee these properties for the provided invariants and transactions.

If I-confluence does not hold, then at least one of the transaction sequences will have to forego availability or coordination-freedom, or the system will have to forego convergence. Given those choices, adding in some coordination seems like the best option! The informal rule is “coordination can only be avoided if all local commit decisions are globally valid.”

I-confluence analysis is independent of any given implementation, and effectively “lifts” prior discussions of scalability, availability, and low latency to the level of application (i.e., not “I/O”) correctness. This provides a useful handle on the implications of coordination-free execution without requiring reasoning about low-level properties such as physical data location and the number of servers.

This does of course rely on the developer correctly specifying invariants:

I-confluence analysis only guards against violations of any provided invariants. If invariants are incorrectly or incompletely specified, an I-confluent database system may violate application-level correctness. If users cannot guarantee the correctness and completeness of their invariants and operations, they should opt for a more conservative analysis or mechanism such as employing serializable transactions. Accordingly, our development of I-confluence analysis provides developers with a powerful option—but only if used correctly. If used incorrectly, I-confluence allows incorrect results, or, if not used at all, developers must resort to existing alternatives.

Language design to support more automated I-confluence analysis is an area for future research. The authors found I-confluence analysis by hand to be ‘non-trivial, but feasible in practice.’

Several SQL constraints are analyzed with respect to I-confluence. A NOT NULL constraint for example is easily shown to be I-confluent. PRIMARY KEY and UNIQUE constraints are not I-confluent under insertion, but they are under reads and deletes. If the creation of unique values is delegated to the database, and it has a scheme that guarantees uniqueness across replicas (e.g. by including unique replica ids), then inserting can be made I-confluent too. Insertions under FOREIGN KEY constraints are I-confluent, as are cascading deletes, but arbitrary deletion of records is unsafe.

To avoid ever growing immutable sets, many databases have used alternate strategies such as last-writer-wins:

…if we implement a user’s account balance using a “last writer wins” merge policy, then performing two concurrent withdrawal transactions might result in a database state reflecting only one transaction (a classic example of the Lost Update anomaly). To avoid variants of these anomalies, many optimistic, coordination-free database designs have proposed the use of abstract data types (ADTs), providing merge functions for a variety of uses such as counters, sets, and maps that ensure that all updates are reflected in final database state. For example, a database can represent a simple counter ADT by recording the number of times each transaction performs an increment operation on the counter. I-confluence analysis is also applicable to these ADTs and their associated invariants.

For example, a row level ‘greater than’ threshold invariant is I-confluent for counter increment and assign, but not for decrement.

If users wish to “read their writes” or desire stronger “session” guarantees (e.g., maintaining recency on a per-user or per-session basis), they must maintain affinity or “stickiness” with a given (set of) replicas. These guarantees are also expressible in the I-confluence model and do not require coordination between different users’ or sessions’ transactions.

So much for the theory, what happened when the authors tried out these ideas on the TPC-C New Order benchmark workload?

The TPC-C benchmark is the gold standard for database concurrency control both in research and in industry, and in recent years has been used as a yardstick for distributed database concurrency control performance. How much coordination does TPC-C actually require a compliant execution?

Of the 12 invariants found in TPC-C, 10 of them are I-confluent! This means that there exists some execution strategy for these ten that does not require coordination. Of course, you still have to build an implementation that achieves this.

As one coordination-free execution strategy that respects the foreign key and materialized view invariants, we can use RAMP transactions, which provide atomically visible transactional updates across servers without relying on coordination for correctness. In brief, RAMP transactions employ limited multi-versioning and metadata to ensure that readers and writers can always proceed concurrently: any client whose reads overlap with another client’s writes to the same item(s) can use metadata stored in the items to fetch any “missing” writes from the respective servers.

(We’ll look at RAMP transactions in more detail in a future edition of The Morning Paper).

For the 2 remaining invariants in TPC-C, one of these is related to a transaction that the benchmark allows to be run asynchronously and in batch mode. That leaves the constraint that New Order IDs are sequentially assigned. It is always possible to fall back to serializable isolation here, but as a more efficient solution the authors introduce a layer of indirection and defer New-Order ID assignment until commit time.

In effect, the New-Order ID assignment can use a nested atomic transaction upon commit, and all coordination between any two transactions is confined to a single server.

Sounds like a plan…

We subsequently implemented the above execution strategy in a distributed database prototype to quantify the overheads associated with coordination in TPC-C New-Order. In brief, the coordination-avoiding query plan scales linearly to over 12.7M transactions per second on 200 servers while substantially outperforming distributed two-phase locking.

In conclusion:

These results begin to quantify the effects of coordination-avoiding concurrency control. If considering application-level invariants, databases only have to pay the price of coordination when necessary. We were surprised that the “current industry standard for evaluating the performance of OLTP systems” was so amenable to coordination-avoiding execution—at least for compliant execution as defined by the official TPC-C specification…. Anecdotally, our conversations and experiences with real-world application programmers and database developers have not identified invariants that are radically different than those we have studied here. A simple thought experiment identifying the invariants required for a social networking site yields a number of invariants but none that are particularly exotic (e.g., username uniqueness, foreign key constraints between updates, privacy settings). Nonetheless, we view the further study of real-world invariants to be a necessary area for future investigation. In the interim, these preliminary results hint at what is possible with coordination-avoidance as well as the costs of coordination if applications are not I-confluent.

A Comprehensive study of Convergent and Commutative Replicated Data Types

March 18, 2015

A comprehensive study of Convergent and Commutative Replicated Data Types – Shapiro et al. 2011

This is the third of five Desert Island Paper choices from Jonas Bonér, and it continues the theme of avoiding coordination overhead in a principled manner whenever you can. As we saw yesterday, there are trade-offs between consistency, failure tolerance, latency tolerance, and performance. Let us not confuse the family of eventual consistency approaches with an ill-defined ‘approximate’ consistency though:

[Eventual consistency] performs well (as the consensus bottleneck has been moved off the critical path), and the weaker consistency is considered acceptable for some classes of applications. However, reconciliation is generally complex. There is little theoretical guidance on how to design a correct optimistic system, and ad-hoc approaches have proven brittle and error-prone.

This paper introduces the concept of a CRDT, a “simple, theoretically sound approach to eventual consistency.” Let’s adddress one of the pressing distributed systems questions of our time right here: “what does CRDT stand for?” We’ve seen over the last couple of weeks that there are two fundamental approaches to replication: you can execute operations at a primary and replicate the resulting state, or you can replicate the operations themselves. If you’re replicating state, then given some convergence rules for state, you can create Convergent Replicated Data Types. If you’re replicating operations, then given operations carefully designed to commute , you can create Commutative Replicated Data Types. Conveniently both ‘convergent’ and ‘commutative’ begin with C, so we can call both of these CRDTs. In both cases, the higher order goal is to avoid the need for coordination by ensuring that actions taken independently can’t conflict with each other (and thus can be composed at a later point in time). Thus we might also call them Conflict-free Replicated Data Types.

Think of it a bit like this: early on languages gave us standard data type implementations for set, list, map, and so on. Then we saw the introduction of concurrent versions of collections and related data types. With CRDTs, we are seeing the birth of distributed collections and related data types. Eventually any self-respecting language/framework will come with a distributed collections library – Riak already supports CRDTs and Jonas has an Akka CRDT library in github at least. As you read through the paper, it’s tempting to think “oh, these are pretty straightforward to implement,” but pay attention to the section on garbage collection – a bit like we saw with Edelweiss, making production implementations with state that doesn’t grow unbounded makes things more difficult.

Since, by design, a CRDT does not use consensus, the approach has strong limitations; nonetheless, some interesting and non-trivial CRDTs are known to exist.

Assume that some number of replicas of an object (e.g .a Set) are distributed. We’d like these to eventually converge to the same state, or more precisely, we’d like all query operations on the object to return the same result at each replica. For any two replicas i and j, this requires both a safety and a liveness condition:

  • Safety: if the causal history of i and j is the same, then the abstract state of i and j is equivalent.
  • Liveness: if some event e is in the causal history of i then it will eventually be in the causal history of j.

If we have this pairwise eventual convergence for any i and j, then this implies that any non-empty subset of replica objects will converge so long as all replicas eventually receive all updates.

State-based CRDTs

The terminology can seem intimidating, but the core idea is actually very simple. If you understand max(x,y) you can understand how state-based CRDTs work. Let’s suppose our state is a simple integer value. Given two values 4 and 6, then the max is 6. We could also describe 6 as the least upper bound of the two values: the smallest value such that 4 and 6 are both ≤ to it. Of course, the state values we want to compose won’t always be integers, but so long as we can define a meaningful least upper bound (LUB) function over the state, we can create a CRDT. You might hear the term join semilattice being thrown around. This simply means a set of values with an LUB-based partial order function defined over them. (Such as the set of integers, and max).

Take an object whose values come from such a semilattice, and define the merge operation for state values to be its least upper bound function. If the values of such an object only get larger (as defined by the least upper bound function) – a monotonic semilattice, then the type is a CRDT. Replicas of such a type will eventually converge.

Operation-based CRDTs

For an operation-based object, we assume a delivery channel (such as TCP) that can deliver updates in the deliver order specified by the data type (e.g. causal delivery). Operations not covered by this order are said to be concurrent. If all such concurrent operations commute, then all execution orders consistent with the delivery order are equivalent, and all replicas will converge to the same state. As an example, addition and subtraction commute (+7, -5 gives the same result as -5, +7).

Interestingly, it is always possible to emulate a state-based object using the operation-based approach, and vice-versa.

Example CRDTs

On the surface, it maybe seems a little underwhelming. The clever part is that if these rules are all we’ve got to work with, how can we design data types that stick to the constraints and yet still do meaningful things? Because if we can do that, we get all the nice convergent properties. To a man with a hammer, everything looks like a nail. We have a hammer, it’s time to find some nails…

Starting with the humble counter, an operation based counter is straightforward since addition and subtraction commute. The state-based counter highlights some of the interesting issues that arise in designing CRDTs though. Let’s start with a counter that only increments: if two independent replicas both increment the counter (from 0 to 1), and then we merge with max, we’ll end up with 1, not the desired 2. So let’s keep a more complex state structure modeled after a vector clock with one entry in the vector for each replica, increment at a given replica increments its counter in the vector. Now merge can take the maximum of each entry, and the counter value is the sum of all entries…

It is not straightforward to support decrement with the previous representation, because this operation would violate monotonicity of the semilattice. Furthermore, since merge is a max operation, decrement would have no effect.

But… we can solve that by keeping two counters, one for the number of increments, and one for the number of decrements. (The authors call this a PN-counter). Having a non-negative counter (e.g. to count the remaining quantity of some asset) turns out to be hard because ‘non-negative’ is a global invariant you can’t calculate locally. You can enforce the rule locally at each replica (you can’t decrement more than you increment) which will of course ensure the global property, but may be too restrictive.

Sadly, the remaining alternative is to synchronise. This might be only occasionally, e.g., by reserving in advance the right to originate a given number of decrements, as in escrow transactions.

Now that you’ve got an idea of what’s involved in designing a CRDT, let’s take a whistlestop tour through some of the other CRDTs defined in the paper:

  • Last-Writer-Wins register (a register is a cell that can store an object or value) – merge is based on the timestamp
  • Multi-value register – with merge based on a version vector
  • A Grow-only Set (supports add and lookup). This turns out to be a useful building block for higher types.
  • A 2P-Set (two-phase set), in which an item can be added and optionally removed, but never added again thereafter.
    • A U-Set, (unique set). Simplified variant of a 2P-Set under the assumption that ‘each element to be added is unique’. Which confused me on the first several readings – because by definition every element in a set is unique! What the authors seem to be capturing here is that the element to be added is not drawn from some a priori know fixed set of values, but instead is ‘created’ at the point of addition, in such a way that the same element can never be created again, nor can it be independently created at another replica. For example, suppose values are drawn from the set of all UUIDs, and replicas generate and then add UUIDs to the U-Set…
  • A Last-Writer-Wins element Set, which keeps an add-set and a remove-set, with timestamped entries
  • A PN-Set which keeps a counter for each element (has the interesting property that the counter can go negative, in which circumstance adding an element does not make it a set member…).
  • An Observed-Remove Set:

The preceding Set constructs have practical applications, but are somewhat counter-intuitive. In 2P-Set, a removed element can never be added again; in LWW-Set the outcome of concurrent updates depends on opaque details of how timestamps are allocated. We present here the Observed-Removed Set (OR-Set), which supports adding and removing elements and is easily understandable. The outcome of a sequence of adds and removes depends only on its causal history and conforms to the sequential specification of a set. In the case of concurrent add and remove of the same element, add has precedence (in contrast to 2P-Set).

  • A 2P2P-Graph (which is the combination of two 2P-Sets for vertices and edges).
  • An add-only monotonic DAG (an edge may be added only if it is oriented in the same direction as an existing path).
  • An add-and-remove partial order data type
  • A couple of data types supporting collaborative text editing.

Shopping Carts Again

In Monday’s paper Alvaro et al. showed us how to reduce the amount of coordination required in a shopping cart using a CALM analysis in Bloom. Shapiro et al. pull off the same trick by modeling the shopping cart using CRDTs.

We define a shopping cart data type as a map from an ISBN number (a unique number representing the book edition) to an integer representing the number of units of the book the user wants to buy. Any of the Set abstractions presented earlier extends readily to a Map; we choose to extend OR-set as it minimises anomalies. An element is a (key, value) pair; concretely the key is a book ISBN (a unique product identifier), and the value is a number of copies.

Going one stage further, we can model a bookstore’s worth of shopping carts:

Our e-commerce bookstore maintains the following information. Each user account has a separate OR-Cart. Assuming accounts are uniquely identified, the mapping from user to OR-Cart can be maintained by a U-Map, derived from U-Set in the obvious way. The shopping cart is created when the account is first created, and removed when it is deleted from the system.

In Amazon’s multi-version approach used with Dynamo under failures with concurrent updates, version branching can occur. Added items are never lost, but a deleted item may resurface.

This (CRDT based-)design remains simple and does not incur the remove anomaly reported for Dynamo, and does not bear the cost of the version vector needed by Dynamo’s MV-Register approach.

A note on CRDTs and CALM

Alvaro et al.’s so-called CALM approach ensures eventual consistency by enforcing a monotonic logic. This is somewhat similar to our rule for CvRDTs, that every update or merge operation move forward in the monotonic semilattice. Their Bloom domain-specific language comes with a static analysis tool that analyses program flow and identifies non-monotonicity points, which require synchronization. This approach encourages programmers to write monotonic programs and makes them aware of synchronization requirements. Monotonic logic is more restrictive than our monotonic semilattice. Thus, Bloom does not support remove without synchronisation.

Consistency, Availability, and Convergence + COPS

March 17, 2015

Consistency, Availability, and Convergence Mahajan et al. 2014, and
Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS – LLoyd et al. 2011

This is the second of five Desert Island Paper selections from Jonas Bonér that we’ll be looking at this week. I’ve turned this post into a double-header since Consistency, Availability, and Convergence has made a previous cursory appearance on The Morning Paper. My earlier posting just focused on the main results, including:

No consistency stronger than real-time causal consistency (RTC) can be provided in an always available, one-way convergent system, and RTC can be provided in an always available one-way convergent system.

Today I’d like to go a little bit deeper into what RTC (and availability, and convergence) really mean in the statement above. Then in part two we’ll look at the ‘COPS’ paper which introduces and motivates the causal+ consistency model and an embodiment in the COPS key-value store.

Part One: CAC

Hang on, I thought it was CAP, not CAC! Why are Mahajan et al. looking at Consistency, Availability, and Convergence, not Consistency, Availability, and Partition Tolerance?

The CAP (consistency, availability, partition-resilience) formulation mixes properties (consistency and availability) with the system model (network reliability assumptions). In our formulation, we decouple the model from the properties so that we can separately consider bounds on properties achievable under both omission and Byzantine failure models. Additionally, CAP does not explicitly consider convergence because linearizability and sequential consistency embed a convergence requirement. When we examine weaker semantics like causal consistency, we find that we must explicitly consider convergence.

The ‘convergence’ problem is often mentioned in passing in papers discussing CAP, since in its absence you can trivially achieve consistency through a number of less desirable schemes such as just agreeing on a single fixed value a priori and never changing it. So a model that includes convergence tells us something important that we actually care about in real-world systems.

Informally, convergence refers to an implementation’s ability to ensure that writes issued by one node are observed by others. Convergence can be formally defined by describing the set of environment conditions (network, local-clocks etc) under which nodes can observe each other’s writes… A simple convergence property is eventual consistency. One common definition requires that if a system stops accepting writes and sufficient communication occurs, then the system reaches a state in which for any object o, a read of o would return the same value at all nodes. This formulation defines a weak convergence property; for example, it makes no promises about intervals in which some nodes are partitioned from others.

For maximal liveness, we would like it that any subset of connected nodes should converge on a common state. The authors define one way convergence between any pair of nodes A and B, which permits convergence with two steps of one-way communication. First A sends updates to B, and then B sends updates to A.

Informally, if consistency is the property that we all agree, convergence is the property that what we all agree on is in fact a desirable and useful state.

What about availability?

Availability, informally, refers to an implementation’s ability to ensure that read and write operations complete. The availability of an implementation is defined by describing the environment conditions (network, local-clocks etc) under which all issued operations complete. An implementation is always available if for any workload, all reads and writes can complete regardless of which messages are lost and which nodes can communicate.

Finally we turn our attention to consistency, and in particular, if real-time causal consistency is the best we can do, then what is that exactly?

Given a set of nodes, and a set of mutable data items, then an execution consists of a set of read and write events. A write event includes the nodeId of the node performing the write, the objId of the data item being written, the value being written, the start time of the write operation, and the end time of the write operation. (Think key-value store). It looks a bit odd to see start time and end time in there, but these are required to model the ‘real-time’ aspect of real-time causal consistency. And remember that this is a model, so we can assume an absolute global time visible to all nodes even though we can’t have that in a practical implementation (see Google’s Spanner and the TrueTime API it introduces for a real-world system that comes pretty darn close… “as a community, we should no longer depend on loosely synchronized clocks and weak time APIs in designing distributed algorithms”!).

A read event includes the nodeId of the node performing the read, the objId of the data item being read, the writeList (of all write operations that produced the values a read returns), and the start time and end time. Reads are allowed to return multiple results in order to handle logically concurrent updates without having to worry about conflict resolution in the model.

Given this model, we can start to reason about consistency. In particular, a consistency model ‘accepts’ (allows) certain executions, but not others. A consistency semantics C-strong is stronger than a consistency semantics C-weak if the set of executions accepted by C-strong is a subset of those accepted by C-weak. If neither of two models is stronger according to this definition, then they are incomparable.

What set of executions does causal consistency allow? Take each of the events and make them nodes in a directed acyclic graph, where an edge from event a to event b represents that a precedes, or “happens before,” b. This graph therefore imposes a partial order on the overall set of events. For causal consistency we place two requirements on this graph:

  • Given any two operations a and b taking place at the same node, then a precedes b if and only if a‘s start time is earlier than b‘s start time.
  • A read returns the latest preceding concurrent writes. The return value of a read is encoded in its writeList. So this writeList must contain every write w that precedes the read, and has not been overwritten by another write that follows w and also precedes the read.

And real-time causal consistency adds a third requirement:

  • Time can’t travel backwards. If the end time of event a is before the start time of event b, then b cannot precede a. (Which matches our common sense definition of ‘happens before’).

… most systems that claim to implement causal consistency actually implement stronger semantics (e.g. RTC). Lloyd et al. [our second paper today] explicitly note that their system’s causal and per-object sequential semantics are stronger than causal consistency. In particular, these semantics enforce a variation of causal consistency in which writes to each key are totally ordered.

Part Two: COPS

In “Don’t settle for eventual” Lloyd et al. introduce the concept of casual+ consistency and then show that it can be implemented effeciently with their COPS system (Clusters of Order Preserving Systems). In keeping with today’s theme, I’m going to focus mostly on the causal+ aspects, and refer you to the paper for the full details of how COPS was constructed.

A distributed storage system has multiple, sometimes competing, goals: availability, low latency, and partition tolerance to provide an “always on” user experience; scalability to adapt to increasing load and storage demands; and a sufficiently strong consistency model to simplify programming and provide users with the system behavior that they expect.

The first four of these properties are described as the ‘ALPS’ properties: Availability, Low-Latency, Partition-tolerance, and Scalability.

  • Availability: all operations complete successfully and no operation can block indefinitely or return an error indicating that data is unavailable.
  • Low-latency: target response times on the order of a few milliseconds
  • Partition tolerance: the data store continues to operate under network partitions
  • Scalability: the data store scales out linearly

Given that ALPS systems must sacrifice strong consistency (i.e., linearizability), we seek the strongest consistency model that is achievable under these constraints. Stronger consistency is desirable because it makes systems easier for a programmer to reason about. In this paper, we consider causal consistency with convergent conflict handling, which we refer to as causal+ consistency.

Causal consistency we addressed above. Note that if a does not precede b, and b does not precede a, then a and b are concurrent. Causal consistency does not impose any order on concurrent operations.

Normally, this allows increased efficiency in an implementation: Two unrelated put operations can be replicated in any order, avoiding the need for a serialization point between them. If, however, a and b are both puts to the same key, then they are in conflict.

If each node is free to resolve the conflict in its own way, then it is possible for replicas to diverge forever. The convergent conflict handling of causal+ consistency therefore adds the requirement that all conflicting updates be handling in the same manner at all replicas. For COPS, the default strategy is to use last-writer-wins with a Lamport clock-based implementation:

The primary storage node uses a Lamport timestamp to assign a unique version number to each update. The node sets the version number’s high-order bits to its Lamport clock and the low-order bits to its unique node identifier. Lamport timestamps allow COPS to derive a single global order over all writes for each key. This order implicitly implements the last-writer-wins convergent conflict handling policy.

Once a replica has returned a given version of a key, then causal+ consistency ensures that it will then only ever return that version or a causally later version. Thus the returned version number monotonically increases, which is refered to as the progressing property.

The implementation of COPS itself assumes a small number of COPS clusters (each wholly contained within a datacenter) connected together into a single datastore.

Each local COPS cluster is set up as a linearizable (strongly consistent) key-value store. Linearizable systems can be implemented scalably by partitioning the keyspace into N linearizable partitions.

Replication between COPS clusters happens asynchronously. One interesting design point in COPS is the extension to support get transactions. If a store only supports a single key get operation, then even if that store is causally+ consistent, you won’t be able to read a causally+ consistent set of dependent keys. In such a model there is no canonically correct ordering of the item values.

… a better programming interface would allow the client to obtain a causal+ consistent view of multiple keys. The standard way to achieve such a guarantee is to read and write all related keys in a transaction; this, however, requires a single serialization point for all grouped keys, which COPS avoids for greater scalability and simplicity. Instead, COPS allows keys to be written independently (with explicit dependencies in metadata), and provides a get_trans operation for retrieving a consistent view of multiple keys.

The implementation proceeds in two rounds: first fetching the most recent version of each requested key, together with the ‘dependency list’ of keys they depend on. If any of the originally requested keys also appears in a dependency list, then a check is made to ensure that the version retrieved is at least as recent as the one in the list. If one of these checks fails, then the offending key is fetched by version using the newest version seen in any dependency list.

Consistency analysis in Bloom: a CALM and collected approach

March 16, 2015

Consistency analysis in Bloom: a CALM and collected approach – Alvaro et al. 2011

This week I’m delighted to bring you another edition of Desert Island Papers, featuring Jonas Bonér. And it seems fitting that Jonas’ first choice is a paper by our previous Desert Island Paper guest, Peter Alvaro.

There are several big ideas in this paper: first, the introduction of the CALM principle for reasoning about distributed system behaviour; second, a declarative language called Bloom that encourages CALM programming and is well-suited to the inherent characteristics of distribution; and third, practical guidance and tools for developing software in this manner.

We show that we can bring that theory to bear on the practice of software development via “disorderly” programming patterns, complemented with automatic analysis techniques for identifying and managing a program’s points of order in a principled way.

If you’re interested in digging deeper into the theoretical underpinnings of Bloom, you might want to check out Joe Hellerstein’s ‘The Declarative Imperative’ paper. Also, don’t forget to read about Peter Alvaro’s follow-on work with Edelweiss.

Staying CALM in the face of uncertainty

The challenges of distribution—concurrency and asynchrony, performance variability, and partial failure—often translate into tricky data management challenges regarding task coordination and data consistency…. There are two main bodies of work to guide programmers through these issues. The first is the “ACID” foundation of distributed transactions, grounded in the theory of serializable read/write schedules and consensus protocols like Paxos and Two-Phase Commit. […] The second point of reference is a long tradition of research and system development that uses application-specific reasoning to tolerate “loose” consistency arising from flexible ordering of reads, writes and messages.

The latter approach can give better availability and/or lower latency but “it is typically unclear what guarantees are provided by systems built in this style. ” Temporal nondeterminism (i.e., not knowing when things are going to happen) is at the root of a lot of the difficulties. We can be eventually consistent however if we have order independence (tolerance of temporal nondetermism). Declarative languages based on sets tend to have this property, and the theory of relational databases and logic programming helps us to reason about them.

Monotonic programs—e.g., programs expressible via selection, projection and join (even with recursion)—can be implemented by streaming algorithms that incrementally produce output elements as they receive input elements. The final order or contents of the input will never cause any earlier output to be “revoked” once it has been generated. Non-monotonic programs—e.g., those that contain aggregation or negation operations—can only be implemented correctly via blocking algorithms that do not produce any output until they have received all tuples in logical partitions of an input set.

Monotonic programs are therefore easy to distribute and can tolerate message reordering and delays. “By contrast, even simple non-monotonic tasks like counting are difficult in distributed systems.”

As a mnemonic, we say that counting requires waiting in a distributed system: in general, a complete count of distributed data must wait for all its inputs, including stragglers, before producing the correct output. “Waiting” is specified in a program via coordination logic (Paxos, 2PC, …).

As we’ve seen over the last two weeks, distributed consensus also involves counting (to ensure a quorum). Thus we can also say ‘waiting requires counting.’

And here comes the big idea:

Our observations about waiting and counting illustrate the crux of what we call the CALM principle: the tight relationship between Consistency And Logical Monotonicity. Monotonic programs guarantee eventual consistency under any interleaving of delivery and computation. By contrast, non-monotonicity—the property that adding an element to an input set may revoke a previously valid element of an output set—requires coordination schemes that “wait” until inputs can be guaranteed to be complete.

We’d like to minimize the amount of coordination (for latency and availability reasons). By building the program on top of a logic language we can develop checks for distributed consistency since conservative tests for monotonicity are well understood in that domain.

In cases where an analysis cannot guarantee monotonicity of a whole program, it can instead provide a conservative assessment of the points in the program where coordination may be required to ensure consistency.

If all this sounds a bit daunting, don’t worry. The ideas can be embodied in a 2,400 line Ruby DSL with surprising power.

Keep CALM and code on

von Neumann architectures bring with them a legacy of implicit ordering assumptions…

Traditional imperative programming grew out of these pervasive assumptions about order. Therefore, it is no surprise that popular imperative languages are a bad match to parallel and distributed platforms, which make few guarantees about order of execution and communication. By contrast, set-oriented approaches like SQL and batch dataflow approaches like MapReduce translate better to architectures with loose control over ordering. Bloom is designed in the tradition of programming styles that are “disorderly” by nature. State is captured in unordered sets. Computation is expressed in logic: an unordered set of declarative rules, each consisting of an unordered conjunction of predicates.

Bloom programs comprise of a set of local collections (state), and a set of declarative statements concerning those collections.

Bloom statements are defined with respect to atomic “timesteps,” which can be implemented via successive rounds of evaluation. In each timestep, certain “ground facts” exist in collections due to persistence or the arrival of messages from outside agents (e.g., the network or system clock). The statements in a Bloom program specify the derivation of additional facts, which can be declared to exist either in the current timestep, at the very next timestep, or at some non-deterministic time in the future at a remote node.

Bloom is side-effect free and has no mutable state. Collections can be one of five basic types, and statements follow a simple ‘lhs rhs’ format. The collection types are:

  • table – a collection that persists across timesteps
  • scratch – a collection whose contents last for only a single timestep
  • channel – a scratch collection with a special location specifier attribute. Tuples ‘appear’ at the network address specified by this location specifier
  • periodic – a scratch collection in which tuples ‘appear’ approximately every period seconds with a unique id and the current wallclock time
  • interface – a scratch collection especiallly designated as an interface point between modules

And the operations are:

  • scratch = rhs (the contents of scratch are determined by the rhs for this timestep)
  • table or scratch <= rhs (lhs includes the contents of rhs in the current timestep)
  • table or scratch <+ rhs (lhs will include the contents of rhs in the next timestep)
  • table <- rhs (lhs will not include tuples in rhs in the next timestep)
  • channel <~ rhs (tuples in rhs will appear in the remote lhs at some point in the future)

This should be enough information for you to get a good impression of how the reliable unicast messaging program below works. Please refer to the full paper for a worked key-value store example including an abstract specification of a store and both single node and distributed implementations.

    module DeliveryProtocol
        def state 
            interface input, :pipe_in, 
                ['dst','src','ident'],['payload']
            interface output, :pipe_sent,
                ['dst','src','ident'],['payload']
        end
    end 
      
    module ReliableDelivery include DeliveryProtocol
      
      def state
          channel :data_chan, ['@dst','src','ident'],['payload']
          channel :ack_chan, ['@src','dst','ident']
          table :send_buf, ['dst','src','ident'],['payload']
          periodic :timer, 10
      end
      
      declare
      def send_packet
          send_buf <= pipe_in
          data_chan <~ pipe_in
      end
      
      declare
      def timer_retry
          data_chan <~ join([send_buf,timer]).map{|p,t| p}
      end 
      
      declare
      def send_ack
          ack_chan <~ data_chan.map{|p| [p.src, p.dst, p.ident] }
      end
      
      declare
      def recv_ack
          got_ack = join [ack_chan, send_buf], 
                         [ack_chan.ident, send_buf.ident]
          pipe_sent <= got_ack.map{|a, sb| sb}
          send_buf <- got_ack.map{|a, sb| sb}
      end
    
    end

(See the aforementioned Edelweiss paper for a way to make this program even simpler).

Thinking CALM thoughts

A Bloom program may be viewed as a dataflow graph with external input interfaces as sources, external output interfaces as sinks, collections as internal nodes, and rules as edges. This graph represents the dependencies between the collections in a program and is generated automatically by the Bud interpreter.

Generating such a graph makes it easy to visualise a program, and especially to see the points where coordination is required due to non-monotonicity. (Of course, it’s the fundamental properties of the logic-based language that enables this kind of analysis). This is turn helps you to reason about your design, and perhaps to choose alternatives that reduce the amount of coordination needed. A worked shopping cart example shows these ideas in action. The first alternative deletes items from the shopping cart in response to a delete request (which seems like a good idea on the surface!). But the second alternative simply accumulates update requests (adding and removing items) in a monotonically increasing set. This set is ‘summed up’ only at checkout. Instead of requiring coordination on every cart action, now we only require it on checkout.

Strictly monotonic programs are rare in practice, so adding some amount of coordination is often required to ensure consistency. In this running example we studied two candidate implementations of a simple distributed application with the aid of our program analysis. Both programs have points of order, but the analysis tool helped us reason about their relative coordination costs. Deciding that the disorderly approach is “better” required us to apply domain knowledge: checkout is a coarser-grained coordination point than cart actions and their replication.

If adding additional coordination is undesirable, Bloom’s ‘point-of-order’ analysis can help programmers figure out where they need to take appropriate action to tolerate inconsistency. Helland and Campbell’s “Building on Quicksand” paper is cited as a model from which we may draw inspiration (definitely one I’ll cover on The Morning Paper sometime soon). This introduces the notions of memories, guesses, and apologies.

Most of these patterns can be implemented as automatic program rewrites. We envision building a system that facilitates running low-latency, “guess”-driven decision making in the foreground, and expensive but consistent logic as a background process. When the background process detects an inconsistency in the results produced by the foreground system (e.g., because a “guess” turns out to be mistaken), it can then take corrective action by generating an “apology.” Importantly, both of these subsystems are implementations of the same high-level design, except with different consistency and coordination requirements; hence, it should be possible to synthesize both variants of the program from the same source code. Throughout this process—making calculated “guesses,” storing appropriate “memories,” and generating the necessary “apologies”—we see significant opportunities to build scaffolding and tool support to lighten the burden on the programmer.

Desert Island Papers: Jonas Bonér

March 15, 2015

With the Desert Island Papers weeks, I invite a guest to select 5 of their favourite papers and give a short introduction to them (see last month’s edition with Peter Alvaro for a longer explanation of the concept). This week, I’m delighted to have Jonas Bonér introduce his selections. I’ve known Jonas for a long time, going back to the early 2000’s when we were both active in the aspect-oriented programming community. Of course nowadays many of you probably know him best for his work with Akka and the reactive manifesto. If you’re in San Francisco this week, you can catch him speaking at the Scala Days conference this Tuesday.

Jonas, would you like to briefly introduce yourself and your research interests?

I work as CTO at Typesafe. I’ve been involved in Open Source for most of my career and started the Akka project (a runtime for concurrency and distribution) about 6 years ago. I’ve always been interested in distributed systems but the last years it has grown into a deep passion of mine — which naturally goes hand in hand with the my love for reading academic papers and CS books.

And now, can you please tell us a little bit about your selections and what it is that makes them special to you?

First when you asked me I wanted to come up with my list of 5 desert island papers, the papers that came to mind were some of the classics (Backus’ Turing Award Lecture, Lamport’s Time, Clocks and Ordering of Events paper, A Note on Distributed Computing and others). These are all papers that I love and re-read, but after thinking some more about it I had another idea. First, I am very passionate about distributed systems and second, there so much amazing research in distributed systems happening right now (the last 5-10 years). Why not instead dedicate my first 4 papers to contemporary research gems, that points towards the future, and that will most likely have a huge impact in how we will build systems in the (a not that distant) future. I liked this idea better and came up with a new list which will hopefully inspire you as much as it has me. The final one I’ll reserve for an old time favorite.

(Note, all three of the paper’s that Jonas cites above have previously featured in The Morning Paper as they are also among some of my favourites. This was in the pre-blog era of #themorningpaper when I gave short highlights on twitter only. Thus there is no permanent record on this blog for the first 44 papers I covered. Some of these I would love to revisit in future editions to fix that. One of them we get to cover this week!).

I’ll be featuring each of Jonas selections below throughout the week. As always, look out for the notifications on twitter or subscribe to the email list to make sure you don’t miss any!

Consistency Analysis in Bloom: a CALM and Collected Approach

Consistency analysis in Bloom – Alvaro et al. 2011

I’m a big believer in Logic Programming and ever since I took the first Prolog class in school I’ve had a strong belief that declarative programming is the right way to develop systems, and that it would take over the world as soon as the hardware got good enough. I still do. It just has to.

This is one of the reasons I got so excited when Joe Hellerstein published the CALM Conjecture (later to be proved into the CALM Theorem). His (and his team’s) work on marrying logic with distributed systems was genius move and the result is both as thrilling and mind-blowing as it is dead obvious. It feels intuitive when you see it, which is a trademark for most brilliant ideas.

The essence of the idea is that a program that can be expressed in monotonic logic does not need to coordinate it’s changes–it is deterministic (confluent) and therefore eventually consistent. This has huge implications since coordination is the most expensive thing you can do in a distributed system. Avoiding it means that we can eliminate both contention (wait time) and the coherency costs of maintaining the shared view on data.

Now the question is: when can a program be expressed in monotonic logic and how can it be made practical and approachable for the general audience of programmers? Ideally you would have your language tell you exactly when and where you need to coordinate, and guarantee you that the rest of the program can freely exploit all concurrency and parallelism that you make available to it, without you having to worry. Peter Alvaro et. al’s ground-breaking work on Bloom (and Dedalus) gives you the theoretical foundation for such a language and I believe it gives us a glimpse of the future.

Consistency, Availability, and Convergence

Consistency, Availability, and Convergence – Mahajan et al. 2011

In a world of impossibility theorems (FLP, CAP etc.) it is refreshing to read one that shows what is actually possible (even thought it does so through an impossibility result).

We all know that Eventual Consistency is often too weak, and can be unintuitive and hard to work with. On the other hand most problems do not need Linearizability (which the CAP theorem argument is really all about). There is a grey zone between these two extremes, but it is IMO in this grey zone that most of the interesting use-cases and applications reside.

What is so exciting with the CAC (Consistency, Availability, and Convergence) paper is that it wanders out in this grey zone and manages to come back with a roof for the strongest level of consistency achievable in an an always available system; Real-Time Causal Consistency. The reason I’m so excited about this result is that causal consistency is usually what users expect, it is intuitive and provides a level of consistency that is often sufficient in real-world applications.

A related paper (with concurrent research) is the COPS paper (Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS) which defines Causal+ consistency and is also worth reading.

Note: I’ve previously covered ‘Consistency, Availability, and Convergence’ on this blog, but only with a highlight on the main results. On Tuesday this week we’ll take a deeper look, and put it up in a double-header with the COPS paper.

A comprehensive study of Convergent and Commutative Replicated Data Types


A comprehensive study of Convergent and Commutative Replicated Data Types – Shapiro et al. 2011

The Dynamo paper was highly influential on the industry. It introduced and popularized an interesting combination of features–Eventually Consistency, epidemic gossiping, node ring, consistent hashing, read repair, hinted handoff etc.—which kicked off the NOSQL movement. In its footsteps we saw lots of Key-Value stores emerge, databases that chose availability over (strong) consistency (sometimes it felt like we got a new K/V store every day).

This was all good. But after the hype had settled in grew on us that there are a lot of problems that don’t naturally fit the key-value model. We as developers are used to model our data in richer data structures such as maps, sets and graphs, and we are used to be able to compose these to model our world. The question is is there a way to get the best of both worlds? To be able to model our data in rich composable data structures, while retaining the benefits of the eventually consistent model?

Mark Shapiro et. al’s work on CRDTs showed us how, and their work has already been highly influential in the industry with production grade implementations in both Riak and Akka.

Coordination Avoidance in Database Systems

Coordination Avoidance in Database Systems- Bailis et al. 2014

As we have discussed above; coordination of data is one of the most expensive thing you can do in a distributed system (or a local multi-core system for that matter). It introduces contention and coherency costs, and — as Neil Gunther’s Universal Scalability Law shows us — introduces scalability bottlenecks and makes it hard to write resilient answer the question “When does consistency require coordination?”.

Peter Bailis et. al set out to answer this question in the general sense — a bold task if you ask me — and in his most recent paper he came back with a very interesting answer. First they define a property called Invariant Confluence (I-confluence) which is defined as “all local commit decisions must be globally invariant-preserving”. Then they show that if I-confluence holds for a certain commit, then coordination of this commit is not necessary — at all. Quite an astonishing result in its simplicity and power, and feels quite intuitive when you start thinking about it.

However, defining these invariants in the application and making sure that they are maintained over its lifetime, is more easily said than done. Here tools or language support would be helpful. I would be interested in seeing the dynamics between CALM and I-confluence explored in more depth.

Out of the Tar Pit

Out of the tar pit – Moseley & Marks 2006

This paper is one of my all time favorites and one that I try reread every year. It covers a lot of ground but essentially tries to answer the question: why is programming still so hard, and what can we do about it? It starts out by trying to understand what software complexity is all about, its different flavors — essential and accidental complexity — and its roots.

The authors identify two main sources of complexity — mutable state and control — and proposes a better way of managing it. First, to move the mutable state into relations which are then managed through declarative logic. Second, to design systems in three different layers consisting of; a completely stand-alone core of Essential State, managed through a layer of Essential Logic (the business logic), shielded from the rest of the system – the so called Accidental State and Control layer.

I think this paper is becoming more and more relevant every year and a lot of it is applicable today; for example through the increased use of Functional and Dataflow Programming for managing state and state dependencies, to using Error Kernel pattern when programming with Actors as a way of shielding the Essential State and Logic from the more risky Accidental State and Control layers below. We are getting there, step by step.

Runner Ups

Here are some other recent distributed systems favorites that made it to the final round but that I had to leave at home (5 is too few :-)):

Highly Available Transactions: Virtues and Limitations – Bailis et al. 2014.

(See the write-up on The Morning Paper).

Distributed transactions strikes back! This paper got me to reconsider my pretty strong opinions on the usefulness of distributed transactions. A great reality check. Luckily it has been covered on this blog before.

Consistency without Borders – Alvaro et al. 2013

I really like this paper. It tells a very nice story about what consistency is really all about, what makes it hard, what is wrong with the way we approach it in today’s systems, and what we can do about it–at different levels in our programming model. It leaves more questions than answers, which I like. Food for thought.

Derflow: Distributed Deterministic Dataflow Programming for Erlang – Bravo et al. 2014

(See the write-up on The Morning Paper).

This research builds upon a long time love of mine: Oz’s deterministic dataflow concurrency. It takes it distributed and then mixes in some CRDTs. Points towards the future.​

Life Beyond Distributed Transactions – Helland 2007

(See the write-up on The Morning Paper).

​This is one of my top favorites and HUGE influence to me and to the work we are doing in Akka.

Raft Refloated: Do we have consensus?

March 13, 2015

Raft Refloated: Do we have consenus? – Howard et al. 2015

This is part ten of a ten-part series on consensus and replication.

We’re nearing the end of this journey after looking at Viewstamped Replication (and VRR), Paxos, ZooKeeper’s atomic broadcast, and Raft. Not that we’ve exhausted all the literature on these topics – far from it! We did manage to exhaust my brain though :). These are hard papers to read critically, and difficult to provide concise summaries of – every little detail is often crucial to correct operation. Of course one of the main points of Raft is to reduce this cognitive overload, and one of goals of Howard et al. is to reproduce the Raft work and find out if this is really true.

Hopefully you’ve got a feel for how these algorithms work and the basic ideas they share in common. You can also learn the basic ideas of how an internal combustion engine works. But would you therefore decide it’s a good idea to build your own car engine? There’s a big gap between the theory of how an internal combustion engine works, and the practice of building a modern car engine. One of the themes that comes out through these papers is that building production quality consensus engines is hard. And those ‘obvious’ performance optimisations you’re tempted to slip in along the way are almost certainly not a good idea (unless you verify them with the same degree of rigour that the original protocols were developed with). Bugs can be very subtle and only occur in deep system traces with multiple levels of failures.

Consensus is at the very core of distributed systems design, and bad things can happen when it goes wrong. In the debate that often rages about whether software engineering can really justify the ‘engineering’ tag, this is one case where true engineering discipline is required. You really want to be using an algorithm that has been formally modeled, with proofs of its important properties (as Raft was modeled in TLA+). Any subsequent optimisations you’re tempted to make need to be validated against this model – it’s all too easy to make an ‘innocent’ change and find out you’ve broken a subtle assumption somewhere along the line. Model checkers can explore permissible system traces to flush out deep and subtle bugs. And then you need to instrument your actual implementation to ensure that it faithfully follows the modeled system, and to catch any problems that occur in the real world. Alternatively you could use a consensus library that meets all these criteria, and plug in your state machine on top…

In Raft Refloated Howard et al. demonstrate just this kind of rigour. Here we find an independent implementation of Raft to assess how easy it is to understand and how complete the original specification was; a full distributed systems simulator for reproducing performance results and exploring scenarios; and a formal model in Statecall Policy Language (SPL) used to validate over 10,000 traces of the system. The source code is all available under the MIT license too:

Our source code is available as open-source software under a MIT license at https://github.com/heidi-ann/ocaml-raft with tag v1.0, and can be installed via the OPAM package manager as raft-sim.1.0. The datasets used are also available at: https://github.com/heidi-ann/ocaml-raft-data, also tagged as v1.0.

The paper includes a nice summary of the Raft protocol, but we can skip that having covered Raft yesterday. The project is written in OCaml.

We chose OCaml as the implementation language for reproduction (as compared to the original implementation’s C++) due to its static typing and powerful module system. The core protocol implementation is pure and does not include any side-effecting code, and uses algebraic data types to restrict the behaviour of the protocol to its own safety criteria. Where the static type system was insufficient, we made liberal use of assertion checks to restrict run-time behaviour.

By modeling Raft’s state transitions in SPL it is possible use model checking tools and also to generate OCaml code that acts as a safety monitor at run-time:

Raft’s state transition models (such as Figure 3) are encoded in Statecall Policy Language (SPL). SPL is a first order imperative language for specifying non-deterministic finite state automata (NFA). We chose to use SPL due to its ability to be compiled to either Promela, for model checking in SPIN, or to OCaml, to act as a safety monitor at run-time. Alternatives to SPL include using systems such as MoDist that model-check protocol implementations directly, or encoding the model directly in Promela, as has been recently done for Paxos.

The simulator is also built in OCaml. In simulation it is possible to have a holistic view of the cluster state to aid in verification.

In order to evaluate our Raft protocol implementation across a range of diverse network environments, we also built a message-level network simulation framework in OCaml. Beyond evaluating the performance of the protocol, we can also use our simulation traces to catch subtle bugs in our implementation or the protocol specification. Since such issues may only occur very rarely (e.g. once in 10,000 protocol runs), a fast simulator offers the unique opportunity to address them via simulation trace analysis. Furthermore, we can use our simulator’s holistic view of the cluster to ensure that all nodes’ perspectives of the distributed system are consistent with respect to the protocol. To meet these domain-specific requirements, we designed our own simulation framework, instead of opting for traditional event-driven network simulators like ns3 or OMNeT++.

Using the simulator the authors were able to reproduce the performance results from the original Raft paper. “We were then able to use our framework to rapidly prototype optimizations to the protocol by virtue of having calibrated our simulation by replicating the authors’ original experimental setup.”

  • Instead of having a single timer used for both follower and candidate timeouts, elections proceed faster if two separate timers are used, with the candidate timeout set lower than the follower one. Under simulation of a highly contested environment, this led to leaders being established within 281ms compared to 1330ms without the optimization.
  • Binary exponential backoff for candidates rejected by a majority of leaders also improves election times. However, combining this with the lower candidate timer performed slightly worse than just using lower candidate timers in isolation.
  • A Tail at Scale effect exists for client command completions, with a small number of outliers (most commonly when a leader fails and a new one is subsequently elected) taking much longer than the normal case. The client commit timeout value is normally set much higher than the RTT to accommodate this. The introduction of a separate ClientCommit acknowledgement would enable the use of two distinct timers to separate the normal and election cases.

The simulation traces were checked using SPL, and at no point were safety guarantees compromised.

However, we did observe permanent livelock in some of our simulation traces, caused by the interaction between the extra condition on commitment (detailed in §2.3.3) and our placement of the client request cache. We recommend that if a client request is blocked by the extra condition on commitment, the leader should create a no-op entry in its log and replicate this across the cluster. We refer the reader to our technical report for a detailed analysis of this issue.

(The “extra condition on commitment” refers to the fact that a new leader can only commit entries from a previous term once it has successfully replicated an entry from the current term).

And this brings us to the big question: is Raft really easier to understand than the alternatives?

In our experience, Raft’s high level ideas were easy to grasp—more so than with Paxos. However, the protocol still has many subtleties, particularly regarding the handling of client requests. The iterative protocol description modularizes and isolates the different aspects of the protocol for understandability by a reader, but this in our experience hinders implementation as it introduces unexpected interaction between components (see previous section). As with Paxos, the brevity of the original paper also leaves many implementation decisions to the reader. Some of these omissions, including detailed documentation of the more subtle aspects of the protocol and an analysis of the authors’ design decisions, are rectified in Ongaro’s PhD thesis on Raft. However, this only became available towards the end of our effort. Despite this, we believe that Raft has achieved its goal of being a “more understandable” consensus algorithm than Paxos.

Follow

Get every new post delivered to your Inbox.

Join 48 other followers