The point the author makes is: partitions are, for many users, an implementation detail of today’s Kafka as a distributed system that ideally should not be exposed to producers and consumers. As a producer or consumer of events, everything should work the same way before and after the suggested change of “no partitions”.
Partitions are one of the most common confusions, particularly for Kafka newcomers. “What is a partiton? Why do I need one, and if I do, how many? If I picked the wrong number, how can I change that? What happens to existing data in my Kafka topic, and how can I migrate the data?”
And when developing apps that process the data: “Oh, the parallelism of my app is bound by the number of partitions? Why didn’t anyone tell me? Now how can I… (see above)”
A common problem is: I need global ordering for my data and thus have to use a topic with a single partition, but now the parallelism of any downstream app that consumes that data is limited to 1 (you can have many separate apps, but each individual app is limited to a parallelism of 1). This limitation sucks for larger volumes of data, because your app may fail to consume data as fast as it is coming in. So one starts to come up with workarounds that depend on the specific use case and the specific shape of the data. Honestly, this is very cumbersome.
Beyond that example, at least for Kafka as of today, partitions are cheap in Kafka. You can have millions of partitons in a cluster, and it will be fine. So people tend to over-partition their Kafka topics to future-proof them “just in case” because changing a topic‘s partition number is so cumbersome in practice. Still, this is a problem area that you ideally don’t want to have in the first place.
Doing away with partitions would be a tremendous improvement for Kafka‘s developer experience, in my personal opinion.
(Disclaimer: I spent many years working with+on Kafka and at Confluent, and part of that work was educating and explaining masses of developers how Kafka works and how to use it. Similar to the author of the linked post, my #1 wishlist feature for Kafka would be getting rid of partitons.)
ok, I think I get it. It’s removing an implementation detail from needing input.
Not a kafka expert, but I had problems in the past, where a, sort of, event sourcing model depends on messages in keys being in order, and works great. Things aren’t over or under partitioned, but if a key becomes hot, it doesn’t matter how many partitions you have, because that key needs to be processed in order.
So the ordering that made the model such a great fit for Kafka also meant your parallelism is reduced to one sometimes when it really counts. So you end up having to understand the computation model when you design your system regardless.
Yeah, exactly. Issues resulting from hot partitions or hot keys are another good example of the problem space created by the partition concept, as you said.
I don’t understand that whole paragraph in the post either. I think the author is thinking of one special case where partitioning is not great (not saying it’s great but in practice it has usually not annoyed me too much).
Interesting!
How would key level ordering be different than partitions? Like as a producer or consumer of events, would anything be different?
The point the author makes is: partitions are, for many users, an implementation detail of today’s Kafka as a distributed system that ideally should not be exposed to producers and consumers. As a producer or consumer of events, everything should work the same way before and after the suggested change of “no partitions”.
Partitions are one of the most common confusions, particularly for Kafka newcomers. “What is a partiton? Why do I need one, and if I do, how many? If I picked the wrong number, how can I change that? What happens to existing data in my Kafka topic, and how can I migrate the data?”
And when developing apps that process the data: “Oh, the parallelism of my app is bound by the number of partitions? Why didn’t anyone tell me? Now how can I… (see above)”
A common problem is: I need global ordering for my data and thus have to use a topic with a single partition, but now the parallelism of any downstream app that consumes that data is limited to 1 (you can have many separate apps, but each individual app is limited to a parallelism of 1). This limitation sucks for larger volumes of data, because your app may fail to consume data as fast as it is coming in. So one starts to come up with workarounds that depend on the specific use case and the specific shape of the data. Honestly, this is very cumbersome.
Beyond that example, at least for Kafka as of today, partitions are cheap in Kafka. You can have millions of partitons in a cluster, and it will be fine. So people tend to over-partition their Kafka topics to future-proof them “just in case” because changing a topic‘s partition number is so cumbersome in practice. Still, this is a problem area that you ideally don’t want to have in the first place.
Doing away with partitions would be a tremendous improvement for Kafka‘s developer experience, in my personal opinion.
(Disclaimer: I spent many years working with+on Kafka and at Confluent, and part of that work was educating and explaining masses of developers how Kafka works and how to use it. Similar to the author of the linked post, my #1 wishlist feature for Kafka would be getting rid of partitons.)
ok, I think I get it. It’s removing an implementation detail from needing input.
Not a kafka expert, but I had problems in the past, where a, sort of, event sourcing model depends on messages in keys being in order, and works great. Things aren’t over or under partitioned, but if a key becomes hot, it doesn’t matter how many partitions you have, because that key needs to be processed in order.
So the ordering that made the model such a great fit for Kafka also meant your parallelism is reduced to one sometimes when it really counts. So you end up having to understand the computation model when you design your system regardless.
Yeah, exactly. Issues resulting from hot partitions or hot keys are another good example of the problem space created by the partition concept, as you said.
I don’t understand that whole paragraph in the post either. I think the author is thinking of one special case where partitioning is not great (not saying it’s great but in practice it has usually not annoyed me too much).