This is a very good question! I do not understand AWS's replication architecture well enough to reimplement it with standard Postgres yet. This behavior doesn't happen in single-node Postgres, as far as I can tell, but it might happen in some replication setups!
There is a universe where cloud providers announce each new database offering by commissioning a Jepsen test and iterating on the results until every issue has been resolved or at least documented.
Unfortunately reliability is not that high on the priority list here. Keep up the good work!
Folks on HN are often upset with the titles of Jepsen reports, so perhaps a little more context is in order. Jepsen reports are usually the product of a long collaboration with a client. Clients often have strong feelings about how the report is titled--is it too harsh on the system, or too favorable? Does it capture the most meaningful of the dozen-odd issues we found? Is it fair, in the sense that Jepsen aims to be an honest broker of database safety findings? How will it be interpreted in ten years when people link to it routinely, but the findings no longer apply to recent versions? The resulting discussions can be, ah, vigorous.
The way I've threaded this needle, after several frustrating attempts, is to have a policy of titling all reports "Jepsen: <system> <version>". HN is of course welcome to choose their own link text if they prefer a more descriptive, or colorful, phrase. :-)
Given that author and submitter (and commenter!) are all the same person I think we can go with your choice :)
The fact that the thread is high on HN, plus the GP comment is high in the thread, plus that the audience knows how interesting Jepsen reports get, should be enough to convey the needful.
This isn't just stale data, in the sense of "a point-in-time consistent snapshot which does not reflect some recent transactions". I think what's going on here is that a read-only transaction against a secondary can observe some transaction T, but also miss transactions which must have logically executed before T.
"I think what's going on here is that a read-only transaction against a secondary can observe some transaction T, but also miss transactions which must have logically executed before T."
i was intuitively wondering the same but i'm having trouble reasoning how the post's example with transactions 1, 2, 3, 4 exhibits this behavior. in the example, is transaction 2 the only read-only transaction and therefore the only transaction to read from the read replica? i.e. transactions 1, 3, 4 use the primary and transaction 2 uses the read replica?
ah, so something like... if the primary ordered transaction 3 < transaction 1, but transaction 2 observes only transaction 1 on the read-only secondary potentially because the secondary orders transaction 1 < transaction 3?
The fundamental definition of consensus already requires that some proposal will eventually (given sufficient communicating, non-faulty, non-malicious nodes) win, so that doesn't seem particularly novel: https://lamport.azurewebsites.net/pubs/lower-bound.pdf
The core problem here that it's impossible for any replica to tell whether its speculatively-executed operations were legal or not; it might have to admit "sorry, I lied to you", and back up at any point. That seems to be what they mean by "partial progress". You can guess that things might happen, but you will often be wrong.
This idea's been around for a while--Fekete et al outlined speculative execution at local replicas with eventual convergence on a Serializable history in their 1996 PODC paper "Eventually Serializable Data Services": https://groups.csail.mit.edu/tds/papers/Lynch/podc96-esds.pd.... These systems have essentially two tiers of operations: weak operations, which the system might re-order (which could cause totally different results, including turning successes to failures and vice-versa!), and strong operations, which are Serializable. I assume Cassandra does the same. Assuming it's correct, the strong operations work like any other consensus system: they must block (or abort) when there isn't sufficient communication with other replicas. The weak ones might give arbitrarily weird results. In CAP's terms, the strong operations are CP, the weak ones are AP.
You see similar dynamics at play in probabilistic consensus systems, like Bitcoin, by the way. In Bitcoin technically all operations are weak ones, but the probability of non-Serializable outcomes should decrease quickly over time.
Having a consensus system that merges conflicting proposals is a nice idea, but I don't think Cassandra is novel here either. I don't have a citation handy, but I recall a conversation with Heidi Howard at HPTS (maybe 2017?) where she explained that one of the advantages of leaderless Paxos is that when you're building a replicated state machine, you can treat what would normally be conflicting proposals from multiple replicas as sets of transitions. Instead of rejecting all but one proposal, you can union them, then apply them to the state machine in some (deterministic) order--say, by lexicographically sorting the proposals.
Kafka actually does call these transactions! However (and this is a loooong discussion I can't really dig into right now) there's sort of two ways to look at "exactly once". One is in the sense that DB transactions are "exactly once"; a transaction's effects shouldn't be duplicated or lost. But in another sense "exactly once" is a sort of dataflow graph property that relates messages across topic-partitions. That's a little more akin to ACID "consistency".
You can use transactions to get to that dataflow property, in the same sort of way that Serializable transaction systems guarantee certain kinds of domain-level consistency. For example, Serializability guarantees that any invariant preserved by a set of transactions, considered purely in isolation, is also preserved by concurrent histories of those transactions. I think you can argue Kafka intends to reach "exactly-once semantics" through transactions in that way.
Both are true, but we use "transactions" for clarity, since the semantics of consumers outside transactions is even murkier. Every read in this workload takes place in the context of a transaction, and goes through the transactional offset commit path.
Ah, got it; I was assuming that “transactions” was referring to the transactions mentioned as the subject of the previous sentence, not the transactions active in consumers observing those. My mistake!
If you are looking for fun targets, may I suggest KubeMQ too? Its author claims that it’s better than Kafka, Redis and RabbitMQ. It’s also "kubernetes native" but the open source version refuses to start if it detects kubernetes.
Not that I'm a Kafka user, but I greatly appreciate your posts, so thank you :)
Maybe Kafka users should do a crowdfund for it if the companies aren't willing. Realistically, what would the goal of the crowdfund have to be for you to consider it?
I don’t think ayphr would disagree with me when I say that FDB’s testing regime is the gold standard and Jepsen is trying to get there, not the other way around.
I'm not sure. I've worked on a few projects now which employed simulation testing and passed, only to discover serious bugs using Jepsen. State space exploration and oracle design are hard problems, and I'm not convinced there's a single, ideal path for DB testing that subsumes all others. I prefer more of a "complete breakfast" approach.
On another axis: Jepsen isn't "trying to get there [to FDB's testing]" because Jepsen and FDB's tests are solving different problems. Jepsen exists to test arbitrary, third-party databases without their cooperation, or even access to the source. FoundationDB's test suite is designed to test FoundationDB, and they have political and engineering buy-in to design the database from the ground up to cooperate with a deterministic (and, I suspect, protocol-aware) simulation framework.
To some extent Antithesis may be able to bridge the gap by rendering arbitrary distributed binaries deterministic. Something I'd like to explore!
Has your opinion changed on that in the last few years? I could have sworn you were on record as saying this about foundation in the past but I couldn’t find it in my links.
I don't think so, but I've said a lot about databases in the last fifteen years haha.
Sometimes I look at what people say about FDB and it feels like... folks are putting words in my mouth that I don't recognize. I was very impressed by a short phone conversation with their engineers ~12 years ago. That's good, but that's not, like, a substantive experimental evaluation. That's "I focus my unpaid efforts on databases which seem more likely to yield fun, interesting results".
Hey mate, think we interacted briefly on the Confluent Slack while you were working on this, something about outstanding TXes potentially interfering with consumption in the same process IIRC?
This isn't the first time you've discussed how parlous the Kafka tx spec is - not that that's even really a spec as such. I think this came up in your Redpanda analysis.
(And totally agree with you btw, some of the worst ever customer Kafka issues I dealt with at RH involved transactions.)
So was wondering what your ideal spec would look like, because I'd be interested in trying to capture the tx semantics in something like TLA+ as a learning experience - and because it would only help FOSS Kafka and FOSS clients improve, especially now that Confluent has withdrawn so much from Apache Kafka development.
I'm not really sure how to answer this question, but even a few chapters worth of clear prose would go a long way. We lay out a bunch of questions in the discussion section that would be really helpful in firming up intended txn semantics.
I also understand there are lots of ways to do Postgres replication in general, with varying results. For instance, here's Bin Wang's report on Patroni: https://www.binwang.me/2024-12-02-PostgreSQL-High-Availabili...
reply