Skip to content

Implementation of LFM2-350M Dense model within a Continuous Batching and Dynamic Scheduling inference pipeline

Notifications You must be signed in to change notification settings

marvinmboya/LFMs-continuous-batching

Folders and files

NameName
Last commit message
Last commit date

Latest commit

0cc7020 · Jan 12, 2026

History

36 Commits
Jan 9, 2026
Jan 12, 2026
Jan 9, 2026
Jan 8, 2026
Jan 6, 2026
Jan 7, 2026
Jan 9, 2026
Jan 8, 2026
Jan 10, 2026
Jan 6, 2026
Jan 6, 2026
Jan 6, 2026
Jan 9, 2026
Jan 9, 2026
Jan 6, 2026
Jan 6, 2026

Repository files navigation

Continuous Batching & Dynamic Scheduling

By Marvin Mboya | Featuring State-of-the-Art LFM2-350M

Technical Article Documentation
Large Language Models (LLMs) are large autoregressive models that, given a prompt, predict the next tokens (words or sub-words) until the end token. Through sequential stochastic decoding, a prompt response is generated.

Definition: Autoregressive means previous time-step observations are used to predict the current time-step observation. In LLMs, as more tokens are predicted, they are continuously appended to the input to predict even more tokens.

The pioneering paper was published by Google in 2017, and the first network architecture, the Transformer, was revealed. However, the first "LLM moment" was by OpenAI in 2018, named GPT-1. Since then, Frontier Labs have trained LLMs on billions and billions of text data. Scaling laws for Transformer-based models show performance improving with model size, data size, and compute. Really powerful open-source models have thus been released such as DeepSeek, Llama, GPT-OSS, and Mistral. However, their sizes limit usability in low-memory and low-compute CPU devices. Companies like NVIDIA thus focused on smaller specialized LLMs, Small Language Models (SLMs), which are powerful in agentic systems. This pioneers the working of powerful LLMs for low-memory and low-compute devices. This article takes one such SLM, the Liquid Foundational Model (350M), and builds the model graph and the CPU inference pipeline, dynamically scheduling token generation.

Implementation & Results

By building the graph from scratch in PyTorch and combining Conv and KV caching, dynamic scheduling, and ragged batching, the model achieves:

  • Prefill: ~1.2 tokens/second
  • Decode: 45 tokens/second
  • Overall: Over 16X batched inference for five prompts.

References & Links

Research Papers

Acknowledgments

Many thanks to LiquidAI open-source powerful LFM models, and Hugging Face team for the article on Continuous Batching written by Reboul, Zucker, and Georges, which motivated the writing of this article!

About

Implementation of LFM2-350M Dense model within a Continuous Batching and Dynamic Scheduling inference pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages