Discussion about this post

User's avatar
Evan R.'s avatar

You've done truly impressive reverse engineering and benchmarking here. I'm interested in your comment "The ideal LLM inference strategy on M4 is hybrid: prefill (large batch, high throughput) on ANE, decode (single token, latency-sensitive) on SME."

It seems to me that the KV cache will not fit in on-chip SRAM for any but the very smallest LLMs. I would therefore expect the KV cache to be stored in DRAM for Apple silicon. The KV cache has to be read in full for each token output during the decode phase of LLM inference. Since Apple's GPU has access to more DRAM bandwidth than SME, I would expect the decode phase of LLM inference to be done by Apple's GPU. I agree the prefill phase of LLM inference could be done on the ANE.

Vol. 7 in the link below contains Maynard Handley's 122 page description of the ANE based on his analysis of Apple patents.

https://github.com/name99-org/AArch64-Explore

manjeet singh's avatar

Thanks for the follow. I agree that full kvcache for any decently size model can't stay in SRAM. Hence the decode on CPU/GPU makes more sense.

3 more comments...

Failed to load posts

Ready for more?