With org.apache.kafka/kafka-clients 3.0.0, under rare circumstances involving multiple node and network failures, I've observed a call to `producer.send()` throw `NotLeaderOrFollowerException` for a message which later appears in `consumer.poll()` return values.
I don't have a reliable repro case for this yet, but the case I hit involved retries=1000, acks=all, and idempotence enabled. I suspect what might be happening here is that an initial attempt to send the message makes it to the server and is committed, but the acknowledgement is lost e.g. due to timeout; the Kafka producer then automatically retries the send attempt, and on that retry hits a NotLeaderOrFollowerException, which is thrown back to the caller. If we interpret NotLeaderOrFollowerException as a definite failure, then this would constitute an aborted read.
I've seen issues like this in a number of databases around client or server-internal retry mechanisms, and I think the thing to do is: rather than throwing the most recent error, throw the most indefinite. That way clients know that their request may have actually succeeded, and they won't (e.g.) attempt to re-submit a non-idempotent request again.
As a side note: is there... perhaps documentation on which errors in Kafka are supposed to be definite vs indefinite? NotLeaderOrFollowerException is a subclass of RetriableException, but it looks like RetriableException is more about transient vs permanent errors than whether it's safe to retry.
aphyr Thanks for the bug report. My guess is that you might be hitting a race condition such as the following during a leader change:
1. Leader takes the write at offset N and parks the request in purgatory to await replication.
2. Follower acks offset N-1 and writes the data up to offset N.
3. Follower becomes new leader before it is able to ack offset N.
4. Old leader sees the leader change and returns NOT_LEADER to the still-parked produce request.
In this scenario, the new leader has the data and it may become committed even though the produce request returned NOT_LEADER. There might be a better error to return in this scenario, but we do want the producer to realize that it needs to find the new leader. I guess we could also hold the request a bit longer until we know whether or not it was committed. Failing that, we should make it clear that NOT_LEADER does not necessarily indicate a failure to write the data.
Jason Gustafson
added a comment - aphyr Thanks for the bug report. My guess is that you might be hitting a race condition such as the following during a leader change:
1. Leader takes the write at offset N and parks the request in purgatory to await replication.
2. Follower acks offset N-1 and writes the data up to offset N.
3. Follower becomes new leader before it is able to ack offset N.
4. Old leader sees the leader change and returns NOT_LEADER to the still-parked produce request.
In this scenario, the new leader has the data and it may become committed even though the produce request returned NOT_LEADER. There might be a better error to return in this scenario, but we do want the producer to realize that it needs to find the new leader. I guess we could also hold the request a bit longer until we know whether or not it was committed. Failing that, we should make it clear that NOT_LEADER does not necessarily indicate a failure to write the data.
junrao Any thoughts?
aphyr : Thanks for reporting this issue. hachikuji : Thanks for the investigation.
NOT_LEADER_OR_FOLLOWER is considered an indefinite error, which means that the produce request may or may not have succeeded. We could wait a bit longer to return a more precise response. However, we need to make sure that the record committed at offset N is indeed the one from the producer to be acknowledged. To do that, we could save the leader epoch after the produce records are appended to the leader's log. When checking the purgatory for completion, we could compare the records' committed leader epoch (probably through leader epoch cache) with the expected one.
Jun Rao
added a comment - aphyr : Thanks for reporting this issue. hachikuji : Thanks for the investigation.
NOT_LEADER_OR_FOLLOWER is considered an indefinite error, which means that the produce request may or may not have succeeded. We could wait a bit longer to return a more precise response. However, we need to make sure that the record committed at offset N is indeed the one from the producer to be acknowledged. To do that, we could save the leader epoch after the produce records are appended to the leader's log. When checking the purgatory for completion, we could compare the records' committed leader epoch (probably through leader epoch cache) with the expected one.
> NOT_LEADER_OR_FOLLOWER is considered an indefinite error
Aha! In that case this isn't a bug at all, and we can close it!
Is there documentation for which errors are definite vs indefinite? I've been relying on the javadocs and asking kafka users how they interpret these errors, and that may have led me astray here.
Kyle Kingsbury
added a comment - > NOT_LEADER_OR_FOLLOWER is considered an indefinite error
Aha! In that case this isn't a bug at all, and we can close it!
Is there documentation for which errors are definite vs indefinite? I've been relying on the javadocs and asking kafka users how they interpret these errors, and that may have led me astray here.
Another case like this–I'm seeing transactions fail with `InvalidTxnStateException: The producer attempted a transactional operation in an invalid state.`, which sounds like it should be a definite failure, but... sometimes appears to actually commit. Is this error intended to be definite? Is this documented anywhere?
Kyle Kingsbury
added a comment - Another case like this–I'm seeing transactions fail with `InvalidTxnStateException: The producer attempted a transactional operation in an invalid state.`, which sounds like it should be a definite failure, but... sometimes appears to actually commit. Is this error intended to be definite? Is this documented anywhere?
We are 2 years later this bug report that assess exactly once processing is broken.
I see some ideas, but I see no beginning of a solution. Is there any workaround or fix ?
Do you guys think that this is not important ?
Maybe some warning notice should be added on the documentation, because people choosing to pay the price of exactly-once generally care a lot about consistency !
Laurenceau Julien
added a comment - We are 2 years later this bug report that assess exactly once processing is broken.
I see some ideas, but I see no beginning of a solution. Is there any workaround or fix ?
Do you guys think that this is not important ?
Maybe some warning notice should be added on the documentation, because people choosing to pay the price of exactly-once generally care a lot about consistency !
People
Unassigned
Kyle Kingsbury
Votes:
1Vote for this issue
Watchers:
5Start watching this issue
Dates
Created:
Updated:
{"report":{"fcp":4138,"ttfb":873.6999988555908,"pageVisibility":"visible","entityId":13420157,"key":"jira.project.issue.view-issue","isInitial":true,"threshold":1000,"elementTimings":{"app-header":4138},"userDeviceMemory":8,"userDeviceProcessors":4,"apdex":0,"firstPaint":4138,"journeyId":"96559f31-7e7f-44d8-a519-ca2cb249f8a4","navigationType":0,"readyForUser":4172.699998855591,"redirectCount":0,"resourceLoadedEnd":3863.3999996185303,"resourceLoadedStart":881.7999992370605,"resourceTiming":[{"duration":552.7000007629395,"initiatorType":"link","name":"https://issues.apache.org/jira/s/b62489a2eaac59d9b8a093c1a51d034f-CDN/xd97tr/820010/13pdxe5/49fa3aa3d35a2cc689cbf274e66cc41a/_/download/contextbatch/css/_super/batch.css","startTime":881.7999992370605,"connectEnd":881.7999992370605,"connectStart":881.7999992370605,"domainLookupEnd":881.7999992370605,"domainLookupStart":881.7999992370605,"fetchStart":881.7999992370605,"redirectEnd":0,"redirectStart":0,"requestStart":1132.5,"responseEnd":1434.5,"responseStart":1413.599998474121,"secureConnectionStart":881.7999992370605},{"duration":1517.1999988555908,"initiatorType":"link","name":"https://issues.apache.org/jira/s/56490edcf9d54e35149505f78cca6a47-CDN/xd97tr/820010/13pdxe5/72cb823bcc50211a60c1ebe830467cae/_/download/contextbatch/css/jira.browse.project,jira.view.issue,project.issue.navigator,atl.general,atl.global,jira.global,jira.general,-_super/batch.css?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&richediton=true&slack-enabled=true","startTime":881.8999996185303,"connectEnd":1399.1999988555908,"connectStart":883.3999996185303,"domainLookupEnd":881.8999996185303,"domainLookupStart":881.8999996185303,"fetchStart":881.8999996185303,"redirectEnd":0,"redirectStart":0,"requestStart":1399.1999988555908,"responseEnd":2399.099998474121,"responseStart":1662.599998474121,"secureConnectionStart":1145.8999996185303},{"duration":1370.2999992370605,"initiatorType":"script","name":"https://issues.apache.org/jira/s/5263129088916436ab9aeb2417075b3f-CDN/xd97tr/820010/13pdxe5/49fa3aa3d35a2cc689cbf274e66cc41a/_/download/contextbatch/js/_super/batch.js?locale=en-UK","startTime":882,"connectEnd":882,"connectStart":882,"domainLookupEnd":882,"domainLookupStart":882,"fetchStart":882,"redirectEnd":0,"redirectStart":0,"requestStart":1434.5,"responseEnd":2252.2999992370605,"responseStart":1721.7999992370605,"secureConnectionStart":882},{"duration":2981.3999996185303,"initiatorType":"script","name":"https://issues.apache.org/jira/s/611c208bd094adb71a6f4f3e7f6fff3d-CDN/xd97tr/820010/13pdxe5/72cb823bcc50211a60c1ebe830467cae/_/download/contextbatch/js/jira.browse.project,jira.view.issue,project.issue.navigator,atl.general,atl.global,jira.global,jira.general,-_super/batch.js?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&locale=en-UK&richediton=true&slack-enabled=true","startTime":882,"connectEnd":1435.099998474121,"connectStart":883.7999992370605,"domainLookupEnd":882,"domainLookupStart":882,"fetchStart":882,"redirectEnd":0,"redirectStart":0,"requestStart":1435.099998474121,"responseEnd":3863.3999996185303,"responseStart":1724.6999988555908,"secureConnectionStart":1163.8999996185303},{"duration":813.9000015258789,"initiatorType":"link","name":"https://issues.apache.org/jira/s/981f587853769311cda7c3b845131a06-CDN/xd97tr/820010/13pdxe5/cb5a5495a038c0744457f25821ba9ee8/_/download/contextbatch/css/jira.global.look-and-feel,-_super/batch.css","startTime":882.0999984741211,"connectEnd":1426.599998474121,"connectStart":884.0999984741211,"domainLookupEnd":882.0999984741211,"domainLookupStart":882.0999984741211,"fetchStart":882.0999984741211,"redirectEnd":0,"redirectStart":0,"requestStart":1426.599998474121,"responseEnd":1696,"responseStart":1695.599998474121,"secureConnectionStart":1159.8999996185303},{"duration":830.9000015258789,"initiatorType":"script","name":"https://issues.apache.org/jira/s/d41d8cd98f00b204e9800998ecf8427e-CDN/xd97tr/820010/13pdxe5/1.0/_/download/batch/jira.webresources:calendar-en/jira.webresources:calendar-en.js","startTime":882.0999984741211,"connectEnd":1437.1999988555908,"connectStart":883.8999996185303,"domainLookupEnd":882.0999984741211,"domainLookupStart":882.0999984741211,"fetchStart":882.0999984741211,"redirectEnd":0,"redirectStart":0,"requestStart":1437.1999988555908,"responseEnd":1713,"responseStart":1712.599998474121,"secureConnectionStart":1164.7999992370605},{"duration":1081.900001525879,"initiatorType":"script","name":"https://issues.apache.org/jira/s/d41d8cd98f00b204e9800998ecf8427e-CDN/xd97tr/820010/13pdxe5/1.0/_/download/batch/jira.webresources:calendar-localisation-moment/jira.webresources:calendar-localisation-moment.js","startTime":882.0999984741211,"connectEnd":882.0999984741211,"connectStart":882.0999984741211,"domainLookupEnd":882.0999984741211,"domainLookupStart":882.0999984741211,"fetchStart":882.0999984741211,"redirectEnd":0,"redirectStart":0,"requestStart":1696,"responseEnd":1964,"responseStart":1963.6999988555908,"secureConnectionStart":882.0999984741211},{"duration":1092.3000011444092,"initiatorType":"script","name":"https://issues.apache.org/jira/rest/api/1.0/shortcuts/820010/5840efff50357da9055d4714dc0713f/shortcuts.js?context=issuenavigation&context=issueaction","startTime":882.0999984741211,"connectEnd":882.0999984741211,"connectStart":882.0999984741211,"domainLookupEnd":882.0999984741211,"domainLookupStart":882.0999984741211,"fetchStart":882.0999984741211,"redirectEnd":0,"redirectStart":0,"requestStart":1704.2999992370605,"responseEnd":1974.3999996185303,"responseStart":1973.8999996185303,"secureConnectionStart":882.0999984741211},{"duration":789.1999988555908,"initiatorType":"link","name":"https://issues.apache.org/jira/s/3ac36323ba5e4eb0af2aa7ac7211b4bb-CDN/xd97tr/820010/13pdxe5/efa42a25652b26dfd802540c024826b3/_/download/contextbatch/css/com.atlassian.jira.projects.sidebar.init,-_super,-jira.view.issue,-project.issue.navigator/batch.css?jira.create.linked.issue=true&richediton=true","startTime":914.8999996185303,"connectEnd":1427.8999996185303,"connectStart":915.7999992370605,"domainLookupEnd":914.8999996185303,"domainLookupStart":914.8999996185303,"fetchStart":914.8999996185303,"redirectEnd":0,"redirectStart":0,"requestStart":1428,"responseEnd":1704.099998474121,"responseStart":1698.2999992370605,"secureConnectionStart":1160.1999988555908},{"duration":1073.8999996185303,"initiatorType":"script","name":"https://issues.apache.org/jira/s/efa8931cd5ac13ed95c56ca8a1dc1967-CDN/xd97tr/820010/13pdxe5/efa42a25652b26dfd802540c024826b3/_/download/contextbatch/js/com.atlassian.jira.projects.sidebar.init,-_super,-jira.view.issue,-project.issue.navigator/batch.js?jira.create.linked.issue=true&locale=en-UK&richediton=true","startTime":915,"connectEnd":915,"connectStart":915,"domainLookupEnd":915,"domainLookupStart":915,"fetchStart":915,"redirectEnd":0,"redirectStart":0,"requestStart":1713.1999988555908,"responseEnd":1988.8999996185303,"responseStart":1987.1999988555908,"secureConnectionStart":915},{"duration":1660.2000007629395,"initiatorType":"script","name":"https://issues.apache.org/jira/s/d41d8cd98f00b204e9800998ecf8427e-CDN/xd97tr/820010/13pdxe5/1.0/_/download/batch/jira.webresources:bigpipe-js/jira.webresources:bigpipe-js.js","startTime":1132.599998474121,"connectEnd":1132.599998474121,"connectStart":1132.599998474121,"domainLookupEnd":1132.599998474121,"domainLookupStart":1132.599998474121,"fetchStart":1132.599998474121,"redirectEnd":0,"redirectStart":0,"requestStart":2523.8999996185303,"responseEnd":2792.7999992370605,"responseStart":2792.5,"secureConnectionStart":1132.599998474121},{"duration":1665,"initiatorType":"script","name":"https://issues.apache.org/jira/s/d41d8cd98f00b204e9800998ecf8427e-CDN/xd97tr/820010/13pdxe5/1.0/_/download/batch/jira.webresources:bigpipe-init/jira.webresources:bigpipe-init.js","startTime":1132.599998474121,"connectEnd":1132.599998474121,"connectStart":1132.599998474121,"domainLookupEnd":1132.599998474121,"domainLookupStart":1132.599998474121,"fetchStart":1132.599998474121,"redirectEnd":0,"redirectStart":0,"requestStart":2526.099998474121,"responseEnd":2797.599998474121,"responseStart":2797.199998855591,"secureConnectionStart":1132.599998474121},{"duration":260.70000076293945,"initiatorType":"xmlhttprequest","name":"https://issues.apache.org/jira/rest/webResources/1.0/resources","startTime":2450.699998855591,"connectEnd":2450.699998855591,"connectStart":2450.699998855591,"domainLookupEnd":2450.699998855591,"domainLookupStart":2450.699998855591,"fetchStart":2450.699998855591,"redirectEnd":0,"redirectStart":0,"requestStart":2451.7999992370605,"responseEnd":2711.3999996185303,"responseStart":2711,"secureConnectionStart":2450.699998855591}],"fetchStart":1,"domainLookupStart":1,"domainLookupEnd":1,"connectStart":2,"connectEnd":575,"secureConnectionStart":302,"requestStart":575,"responseStart":874,"responseEnd":1133,"domLoading":876,"domInteractive":4197,"domContentLoadedEventStart":4197,"domContentLoadedEventEnd":4214,"domComplete":4590,"loadEventStart":4590,"loadEventEnd":4592,"userAgent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36","marks":[{"name":"bigPipe.sidebar-id.start","time":4181.89999961853},{"name":"bigPipe.sidebar-id.end","time":4182.39999961853},{"name":"bigPipe.activity-panel-pipe-id.start","time":4182.39999961853},{"name":"bigPipe.activity-panel-pipe-id.end","time":4183.099998474121},{"name":"activityTabFullyLoaded","time":4219.799999237061}],"measures":[],"correlationId":"ce201b4b6e0d56","effectiveType":"4g","downlink":1.15,"rtt":250,"serverDuration":110,"dbReadsTimeInMs":2,"dbConnsTimeInMs":10,"applicationHash":"ace47f9899e9ee25d7157d59aa17ab06aee30d3d","experiments":[]}}
aphyr Thanks for the bug report. My guess is that you might be hitting a race condition such as the following during a leader change:
1. Leader takes the write at offset N and parks the request in purgatory to await replication.
2. Follower acks offset N-1 and writes the data up to offset N.
3. Follower becomes new leader before it is able to ack offset N.
4. Old leader sees the leader change and returns NOT_LEADER to the still-parked produce request.
In this scenario, the new leader has the data and it may become committed even though the produce request returned NOT_LEADER. There might be a better error to return in this scenario, but we do want the producer to realize that it needs to find the new leader. I guess we could also hold the request a bit longer until we know whether or not it was committed. Failing that, we should make it clear that NOT_LEADER does not necessarily indicate a failure to write the data.
junrao Any thoughts?