-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Observed behavior
In nats-server 2.12.1, membership changes, process kills, and network partitions can cause the loss of some or all acknowledged records to a Jetstream stream, even with sync-interval always.
For example, take this test run, in which five out of 5236 acknowledged records published in the middle of the test were lost. The following plot shows published records arranged by the time a client sent the publish message for that record. Each record is a single point, and the color indicates whether the record was "ok" (eventually read), "lost" (acknowledged but never read) or "unknown" (the publish operation was not acknowledged, and the message was never read). Records are spread over the vertical axis to show approximate throughput over time. The data loss event is visible as a small red streak around 16 seconds:
This test involves five nodes: n1 through n5. Around 16 seconds into the test, we partitioned node n1 away from the other four, then killed nodes n1, n2, and n3, then removed node n4, in quick succession. While this was occurring, we published records 7-63, 8-63, 9-63,11-63, and 12-63, all of which were acknowledged successfully by their respective nodes.
{:index 2022, :time 16109960074, :type :ok, :process 4, :f :publish, :value "4-63"}
{:index 2023, :time 16112167773, :type :invoke, :process 5, :f :publish, :value "5-63"}
{:index 2024, :time 16114235936, :type :info, :process :nemesis, :f :start-partition, :value [:isolated {"n1" #{"n2" "n5" "n4" "n3"}, "n2" #{"n1"}, "n5" #{"n1"}, "n4" #{"n1"}, "n3" #{"n1"}}]}
{:index 2025, :time 16114263329, :type :info, :process :nemesis, :f :kill, :value :majority}
{:index 2026, :time 16122492214, :type :invoke, :process 7, :f :publish, :value "7-63"}
{:index 2027, :time 16123329080, :type :ok, :process 7, :f :publish, :value "7-63"}
{:index 2028, :time 16124539204, :type :invoke, :process 8, :f :publish, :value "8-63"}
{:index 2029, :time 16125058438, :type :ok, :process 8, :f :publish, :value "8-63"}
{:index 2030, :time 16132755900, :type :invoke, :process 9, :f :publish, :value "9-63"}
{:index 2031, :time 16133716428, :type :ok, :process 9, :f :publish, :value "9-63"}
{:index 2032, :time 16134796130, :type :invoke, :process 10, :f :publish, :value "10-63"}
{:index 2033, :time 16145849910, :type :invoke, :process 11, :f :publish, :value "11-63"}
{:index 2034, :time 16146687908, :type :ok, :process 11, :f :publish, :value "11-63"}
{:index 2035, :time 16164675317, :type :invoke, :process 12, :f :publish, :value "12-63"}
{:index 2036, :time 16165734007, :type :ok, :process 12, :f :publish, :value "12-63"}
{:index 2037, :time 16179824530, :type :info, :process :nemesis, :f :kill, :value {"n1" :killed, "n3" :killed, "n2" :killed}}
{:index 2038, :time 16179871864, :type :info, :process :nemesis, :f :leave, :value "n4"}After a few hundred seconds of additional faults, the cluster recovered, and was able to acknowledge more published messages. At the end of the test we rejoined all nodes, and, on each node, attempted to read every message from the stream using repeated calls to fetch. These reads observed stream values like:
{:index 15568, :time 319383763266, :type :invoke, :process 362, :f :fetch, :value nil}
{:index 15569, :time 319384992110, :type :ok, :process 362, :f :fetch, :value ["1-58" ... "4-63" "158-5" "158-6" ...]}Note that value 4-63, acknowledged immediately prior to the partition at 16 seconds, is present in the final read. However, values like 7-63, which were also acknowledged as successful, are missing. Instead, the stream's values jump directly to 158-5, which was written much later in the test.
The problem here is not necessarily that NATS lost data--we should expect that some combinations of faults should be capable of breaking the consensus mechanism permanently. The problem is that NATS lost data silently--it went on to accept more reads and writes after this time as if nothing had gone wrong!
I'm still trying to narrow down the conditions under which this can happen. I've got lots of cases of total data loss, but that may actually be OK, given that the test may have removed too nodes from the cluster. If you've got any insight, I'd love to hear about it! :-)
Expected behavior
NATS should not violate consensus; once data loss has occurred, it should refuse further writes and reads.
Server and client version
This is with nats-server 2.12.1, and the jnats client at version 2.24.0.
Host environment
These nodes are running in LXC containers, on Linux Mint 6.8.0-85-generic, AMD x64.
Steps to reproduce
You can reproduce these results with the Jepsen test harness for NATS, at commit 50fae2030ac3995703bccca6ae72bc35b9bce8b6, by running:
lein run test-all --nemesis membership,partition,kill --time-limit 300 --leave-db-running --version 2.12.1 --sync-interval always --rate 100 --test-count 50 --no-lazyfs
Activity
aphyr commentedon Nov 13, 2025
Here's a second case, which lost 55 writes in the middle of the test. In this case, the membership view derived from
nats server request jetstream --leader --streamsnever dropped below three out of five nodes. At least... this is what I'm doing to try and figure out which nodes are members, which might be wrong.MauriceVanVeen commentedon Nov 13, 2025
Thank you for reporting, can also reproduce this on my end using:
There are a couple of topics here that need to be discussed.
NATS JetStream (mostly) doesn't refuse to serve reads or writes permanently. One example is when using a replicated in-memory stream. Now, this is different than a file-based stream as the data being in-memory means if all servers hosting that stream restart, the data is (obviously) gone. But once the servers come back up, they'll still remember through the meta layer (which is still persisted on disk) that they have been assigned that stream. That in-memory stream will then come up empty, and that would be expected.
This doesn't quite extend to file-based streams, but illustrates there are cases where it's expected that the system will continue to be available. Have tested the same with a replicated file-based stream and manually removing the stream directories on disk (but preserve the meta state/assignment). There the expectation of the system refusing any changes holds, no leader could be elected and all servers were essentially "stuck", requiring manual intervention to recreate the stream. However, this still assumes stable and unchanged membership.
I'm thinking it's the membership changes that has issues here, likely a combination of too many membership changes and it happening too quickly. When running the top command without the
membershipnemesis everything seems to be good. Whereas when usingmembershipI had some runs where nodes would be removed too quickly and either the stream could end up losing data, or the stream itself would cease to exist.Membership changes seems like a topic where NATS JetStream may differ from other systems. Some context though, a "Core" NATS system without JetStream doesn't require any backing state and can scale up and down in cluster size dynamically, with no need for
peer-remove. When using JetStream however, there needs to be persistent state and there are more things to consider around membership.A couple comments around the use of
peer-remove. See also the mention in the docs: https://docs.nats.io/running-a-nats-service/configuration/clustering/jetstream_clustering/administration#evicting-a-peer, and in the CLI when not passing the -f/force flag: https://github.com/nats-io/natscli/blob/9cebb956791ff8f9f10d35f11af8d471af0cdbd8/cli/server_cluster_command.go#L166-L176Peer-removing servers is meant to be a permanent action where nodes are already gone and never coming back. For example, in a system with 5 nodes and a R3 stream, if you peer-remove one of the 3 servers hosting the stream, the stream assignment will be automatically reassigned such that one of the 2 other nodes in the cluster will be assigned and start catching up data such that it can be a complete R3 stream again. Or on a 5 node R5 stream you'd keep the dead node, add a new node to replace the dead one, and then peer-remove the former such that data can be replicated to the new node.
For the sake of testing membership changes peer-remove may be possible to be used, but requires an addition to be safer. When the server comes back you should not only wait for the node to show up in various reports (which it will do almost immediately after coming online empty), but also do a HEALTHZ request and confirm it reports 200 OK for that node. That should ensure the wiped node can catch up with the meta layer to be assigned the stream again, and then also wait for catching up the stream data. Health checking can be done using the
nats server report healthwhich visualizes the data thatnats server request jetstream-healthreturns for all servers.Ideally you'd only change membership of one node at a time, while confirming the health check is happy before continuing. The system will currently allow you to forcefully perform any changes like peer-removes without stopping you. This means that in situations like these safety is currently also in the hands of the user, which indeed can be tricky to ensure when automating this. (And there likely will be cases still where the system can be improved to provide more safety guarantees so the user doesn't need to know.)
aphyr commentedon Nov 14, 2025
Just in case I've misunderstood--Jetstream streams are on-disk, not in-memory, right?
I concur--the membership nemesis is probably leaving/joining nodes "too fast" for Jetstream to keep up. This raises a sort of awkward question: is there any safe way to remove a node from a JetStream cluster? As far as I can tell, operators are expected to simply destroy nodes and optionally inform the cluster that they're gone. I try to keep this safe by looking at the output of
nats-clicommands, but of course this has race conditions: the picture the cluster gives you could be out-of-date---especially when the cluster is under stress---and so even ifnats-clisays you have enough capacity to remove a node, it could be wrong.To wit, the Evicting a Peer Docs say "A 5 node cluster can withstand 2 nodes being down", but even though our original five-node cluster never dipped below three nodes--at least per
nats-cli's point of view--we still lost data.Ah, this is a little surprising---usually you can re-join a physical machine to a cluster as if it were a fresh node. Is Nats interpreting DNS names as node identities, or do nodes generate some sort of unique ID on startup, say via consensus, and store it in their data files? What is the proper procedure for re-joining a node which has left the cluster?
I wanted to ask about this too: it looks to me like rejoined nodes stall for a couple minutes before accepting any requests or showing up in membership list again. Is there a way to speed that up?
Right, so this brings me back to the core point above: I agree that the current membership nemesis is likely unsafe, and can create conditions under which consensus is impossible. Ideally, on violating consensus, the stream should shut down in its entirety--otherwise you've got silent data loss. A worse but not as awful behavior is to lose a prefix of the stream. What Jetstream does is worse than that: it loses records in the middle of the stream. This is bad, no?
aphyr commentedon Nov 14, 2025
Here's a run, incidentally, which lost a prefix of the stream with membership changes and network partitions: 20251112T211628.337-0600.zip
MauriceVanVeen commentedon Nov 14, 2025
They can be configured to be either file-based or memory-based, as well as replicated or not replicated. By default a stream is not replicated (R1) and file-based. Your setup seems to be correct, by explicitly setting the storage type to file-based and setting replicas to 5 (your cluster size). The memory-based example was just for context.
If a node dies or is destroyed, you must always inform JetStream if this node will never come back by using a peer-remove. JetStream does not assume to automatically peer-remove a node. If the system is not informed that a node died and will not come back, then it will keep counting towards needing quorum. Making the cluster think it's larger than it really is.
The 5 nodes 2 down case describes when not performing peer-removes. If you peer-remove a single server in a 5 node system, that then means your system will become 4 nodes (quorum of 3). The membership nemesis doing peer-removes too quickly will likely mean this results in the cluster then going to a size of 3 nodes with a quorum of 2, etc.
The
server_nameis what constitutes its identity, so it must stay the same across restarts, etc. This is already properly done in your case by having static names liken1, n2, n3, etc.The proper way to join a new node (which can still be the same physical machine) is to ensure it's empty as well as use a new identity, i.e. a new
server_name, so it restores data from its peers from a clean slate. (The answer continues below the next response)This is correct, as the intention behind a peer-remove is that the server is dead/destroyed and will never come back. There's otherwise no reason to peer-remove a node on the cluster level if your intention is that the server comes back after some downtime due to a restart.
Small tangent: you can separately only peer-remove on the stream level, but that only really makes sense if you have a 5 node cluster with a R3 stream and you'd like one of the followers to not be on node A but on a different remaining node. But that still isn't something you'd need to do normally, unless if you want to manually "balance" followers if you've got many stream/consumer assets on a particular node.
What you're seeing is the peer-removed node not being allowed to return, as per the timeout defined here which is 5 minutes:
nats-server/server/raft.go
Line 266 in aa1b2fc
As mentioned by the peer-remove command in the CLI, this can be sped up by restarting the entire cluster as they keep in-memory that that node was peer-removed until a restart: https://github.com/nats-io/natscli/blob/9cebb956791ff8f9f10d35f11af8d471af0cdbd8/cli/server_cluster_command.go#L166-L176
Likely you don't actually need to restart the full cluster, and you can probably restart a single server and then elect that server to become the new meta leader.. But then again, that's not the intention behind peer-remove. Peer-remove means the server is never coming back, so there's no easy way to speed that up as it's not meant to ever happen that way.
This is indeed bad, and likely a result of not performing health checks before performing more peer-removes and rejoins.
If you're not waiting for the health check, that rejoined server will probably only catch up with half of the stream ("the middle"). Then the nemeses go on to peer-remove, hard kill and/or partition more servers which isn't safe. Then new writes will replace what was after "the middle" and you'd have lost data.
Important to mention though, if the cluster nodes are not peer-removed but only temporarily down, for example due to a restart, hard kill, version upgrade, etc. If as a result of that we can't achieve quorum, then the stream indeed stops and refuses writes as you'd expect. The issue here is that by peer-removing you are deliberately telling the cluster that the cluster size and quorum needed will need to be lowered.
So, I'm expecting if you peer-remove node X, then have it rejoin and wait for the health check to report 200 OK. Then the stream's state for that node X is fully restored, not only partially, and there will be no data loss.
Like I mentioned, the system currently allows you to do things that are unsafe by forcing the system's state to be a certain way. That's what the membership nemesis is currently taking advantage of to lose data. Waiting for the health check in that nemesis should resolve that. I agree though the system can be better at ensuring safety during these member changes, so this could be improved in the future such that you can't get yourself into this situation in the first place.
[-]Possible data loss with membership changes, crashes, and partitions[/-][+]Data loss with membership changes and partitions[/+]aphyr commentedon Nov 20, 2025
Thanks again for the note about node names! I've modified the test in 9f85a930713e240f1728c3fe3dc18c013ce0d509 to generate fresh node names each time it wipes a node. I don't have a ton of time to dig into this, beyond saying "yup, still looks like data loss"--but hopefully that'll help make it easier to explore behavior going forward!
aphyr commentedon Nov 20, 2025
On the plus side, it looks like now that we're doing fresh node names, all the failures I'm seeing with these (admittedly cursory) tests are due to the cluster being totally unavailable for reads at the end. This could be fine--that's not really proof of a safety violation any more.
On the other hand, it's weird that membership changes which leave 3/5 nodes intact for the entire test can break quorum. For instance, in 20251119T215937.898-0600.zip, we only ever removed nodes n2 and n4; n1, n3, and n5 were in place for the entire test. With a target replication factor of five, there should have been at least three copies of every acked write on the five-node cluster. And with one node gone, there should be four replicas, which still requires a three-node majority... More to investigate, I imagine!
MauriceVanVeen commentedon Nov 20, 2025
Have opened a draft PR with a couple fixes for the membership nemesis: jepsen-io/nats#1 (Not sure if this is the usual process, otherwise let me know)
Before this I've never coded in Clojure before, so it could be either "good" or "very bad", but I can't tell that myself just yet 😅
The PR description outlines some of the fixes, and for me it's looking a bit better.
But.. there seems one nasty issue still where the membership nemesis just runs through all nodes and peer-removes all of them. Then coming up to the "fetching stage" it re-joins all the empty nodes.
I'm not sure why the nemesis doesn't keep track of only peer-removing a minority and can sometimes wipe all nodes, haven't figured that out yet.. I'm thinking it's due to the bump in node names maybe.
MauriceVanVeen commentedon Nov 21, 2025
Have opened another draft PR, and it's looking a lot better: jepsen-io/nats#2
Peer-removals now seem to be safely handled, apart from one server issue mentioned in the description where the meta leader will respond immediately to the peer-remove request instead of waiting for the peer removal to get quorum resulting in cluster operations safely halting until the peer-remove is manually retried at which point all will be operational again.
[FIXED] Meta peer-remove response after quorum & disallow concurrent …
MauriceVanVeen commentedon Nov 27, 2025
The docs have been updated to be clearer about peer-removes, nats-io/nats.docs#893.
https://docs.nats.io/running-a-nats-service/configuration/clustering/jetstream_clustering/administration#evicting-a-peer
aphyr commentedon Nov 29, 2025
TY!