(cache)GitHub - marvinmboya/LFMs-continuous-batching: Implementation of LFM2-350M Dense model within a Continuous Batching and Dynamic Scheduling inference pipeline

Continuous Batching & Dynamic Scheduling

By Marvin Mboya | Featuring State-of-the-Art LFM2-350M

Technical Article Documentation
Large Language Models (LLMs) are large autoregressive models that, given a prompt, predict the next tokens (words or sub-words) until the end token. Through sequential stochastic decoding, a prompt response is generated.

Definition: Autoregressive means previous time-step observations are used to predict the current time-step observation. In LLMs, as more tokens are predicted, they are continuously appended to the input to predict even more tokens.

The pioneering paper was published by Google in 2017, and the first network architecture, the Transformer, was revealed. However, the first "LLM moment" was by OpenAI in 2018, named GPT-1. Since then, Frontier Labs have trained LLMs on billions and billions of text data. Scaling laws for Transformer-based models show performance improving with model size, data size, and compute. Really powerful open-source models have thus been released such as DeepSeek, Llama, GPT-OSS, and Mistral. However, their sizes limit usability in low-memory and low-compute CPU devices. Companies like NVIDIA thus focused on smaller specialized LLMs, Small Language Models (SLMs), which are powerful in agentic systems. This pioneers the working of powerful LLMs for low-memory and low-compute devices. This article takes one such SLM, the Liquid Foundational Model (350M), and builds the model graph and the CPU inference pipeline, dynamically scheduling token generation.

Implementation & Results

By building the graph from scratch in PyTorch and combining Conv and KV caching, dynamic scheduling, and ragged batching, the model achieves:

Prefill: ~1.2 tokens/second
Decode: 45 tokens/second
Overall: Over 16X batched inference for five prompts.

References & Links

Research Papers

[arXiv:2511.23404] LFM2 Technical Report – Amini et al. 2025
[arXiv:1706.03762] Attention Is All You Need – Vaswani et al. 2017
[arXiv:2001.08361] Scaling Laws for Neural Language Models – Kaplan et al. 2020
[arXiv:2506.02153] Small Language Models are the Future of Agentic AI – Belcak et al. 2025

Acknowledgments

Many thanks to LiquidAI open-source powerful LFM models, and Hugging Face team for the article on Continuous Batching written by Reboul, Zucker, and Georges, which motivated the writing of this article!

Name	Name	Last commit message	Last commit date
Latest commit marvinmboya add technical article link Jan 12, 2026 0cc7020 · Jan 12, 2026 History 36 Commits
.gitignore	.gitignore	ignore model file and tests dir	Jan 9, 2026
README.md	README.md	add technical article link	Jan 12, 2026
backbone.py	backbone.py	fix typo in residual conn implementation	Jan 9, 2026
feedforward.py	feedforward.py	add feedforward with three linear transformations	Jan 8, 2026
gqa.py	gqa.py	add dtype param pass to rmsnorm, SDPA func	Jan 6, 2026
gsc.py	gsc.py	pass params bundled in config	Jan 7, 2026
lfm_arch.py	lfm_arch.py	move model class implementation to own file	Jan 9, 2026
lfm_config.py	lfm_config.py	add d_hidden param in ff block	Jan 8, 2026
lfm_decode.py	lfm_decode.py	time token decode in terms of seconds taken per token	Jan 10, 2026
lfm_norm.py	lfm_norm.py	compute norms in float32, return in x dtype	Jan 6, 2026
lfm_rope.py	lfm_rope.py	remove debug print	Jan 6, 2026
lfm_tokenizer.py	lfm_tokenizer.py	implement LFM tokenizer	Jan 6, 2026
lfm_weight.py	lfm_weight.py	fix typo in residual conn implementation	Jan 9, 2026
main.py	main.py	explicitly state max tokens	Jan 9, 2026
sdpa.py	sdpa.py	implement sdpa function	Jan 6, 2026
tokenizer.json	tokenizer.json	tokenizer file	Jan 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Continuous Batching & Dynamic Scheduling

Implementation & Results

References & Links

Research Papers

Acknowledgments

About

Releases

Packages

Languages

marvinmboya/LFMs-continuous-batching

Folders and files

Latest commit

History

Repository files navigation

Continuous Batching & Dynamic Scheduling

Implementation & Results

References & Links

Research Papers

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages

Languages