1 / 43 ページ
Continuous Batching & Dynamic
Scheduling
by Marvin Mboya �
featuring State-of-the-Art LFM2-350M1
Large Language Models (LLMs) are large autoregressive2 mod- els that given a prompt, predict the next tokens (words or sub- words) until the end token. Through sequential stochastic decod- ing, prompt response is generated.
Pioneering paper was published by Google in 20173
, and the
first network architecture, the Transformer, revealed. However,
the first “LLM moment” was by OpenAI in 2018, named GPT- 1. Since then, Frontier Labs have trained LLMs on billions and
billions of text data. Scaling laws for Transformer-based mod- els show performance improving with model size, data size,
and compute4
. Really powerful open-source models have thus
been released such as DeepSeek, Llama, GPT-OSS and Mistral5
. However, their sizes limit usability in low-memory and low- compute CPU devices. Companies like NVIDIA thus focused on
smaller specialized LLMs, Small Language Models (SLMs), pow- erful in agentic systems6
. This pioneers the working of power- ful LLMs for low-memory and low-compute devices. This arti- cle takes one such SLM, the Liquid Foundational Model (350M),
and builds the model graph and the CPU inference pipeline, dy- namically scheduling token generation. By building the graph
from scratch in PyTorch and combining conv and KV caching, dy- namic scheduling, and ragged batching, the model achieves about
1.2 tokens/second prefill, 45 tokens/second decode. Overall, the
model achieves over 16X batched inference for five prompts.
The figure above, courtesy of LiquidAI, shows much smaller LFM2 models perform
better than bigger SLMs.
1
arXiv:2511.23404
LFM2 Technical Report
Amini et al. 2025
2
autoregressive ∼ previous time-step observations are used
to predict current time-step observation. In LLMs, as more
tokens are predicted, they are continuously appended to the
input to predict even more tokens.
3
arXiv:1706.03762
Attention Is All You Need
Vaswani et al. 2017
4
arXiv:2001.08361
Scaling Laws for Neural Language Models
Kaplan et al. 2020
5 Model Links
DeepSeek-R1
https://huggingface.co/deepseek-ai/DeepSeek-R1
Llama3
https://huggingface.co/meta-llama/Meta-Llama-3-8B
GPT-OSS-120B
https://huggingface.co/openai/gpt-oss-120b
Mistral
https://huggingface.co/mistralai/Mistral-7B-v0.1
6
arXiv:2506.02153
Small Language Models are the Future of Agentic AI
Belcak et al. 2025
Many Thanks to HuggingFace’s article on
Continuous Batching
written by Reboul, Zucker, and Georges which motivated
my writing this awesome article!
1
2 / 43 ページ
From words to vectors
Given a prompt, “The ruler of a kingdom is a”, LLMs cannot process
this in its string format, hence the need to convert to a compatible
form for computing.
The prompt is first split into tokens7
, before each token is uniquely
mapped to an integer, resulting in a vector of integers known as
token ids now compatible for an LLM. This process is collectively
called encoding. One such algorithm used for LFMs is the byte- level Byte-Pair Encoding (BBPE)8
.
Let’s not build the encoder from scratch, as that’s not the article’s
objective. Rather, let’s use LFM2-350M’s packaged encoding from
Hugging Face remote repository.
import os; os.system("pip install -q huggingface_hub")
from huggingface_hub import hf_hub_download
hf_hub_download(
repo_id = "LiquidAI/LFM2-350M",
local_dir = "./",
filename = "tokenizer.json"
)
Let’s use the tokenizers package to import the encoding
os.system("pip install -q -U tokenizers")
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
Now with encoding instance loaded, prompt inputs to LFMs re- quire chat templates before encoding. This is implemented as
def wrap_chat(prompt):
s = (f"<|startoftext|>"
f"<|im_start|>user\n{prompt}<|im_end|>\n"
f"<|im_start|>assistant\n")
return s
wrapped_prompt = wrap_chat(prompt)
Once the prompt is wrapped, splitting chat tags before encoding
both prompt and the tags is implemented as:
SPECIALS=['<|startoftext|>','<|im_start|>','<|im_end|>']
specials_to_ids = {
t:tokenizer.token_to_id(t) for t in SPECIALS
}
import re; SPLIT_RE = re.compile(r"(<\|[^>]+?\|>)")
split_prompt = SPLIT_RE.split(wrapped_prompt)
# ...
7
token ∼ common chunks of texts in a vocabulary that
combine with other chunks to form words.
8
arXiv:1909.03341
Neural Machine Translation with Byte-Level Subwords
Wang et al., 2019
2
3 / 43 ページ
# ...
ids = []
for part in filter(None, split_prompt):
if part in SPECIALS:
ids.append(specials_to_ids[part])
else:
ids.extend(tokenizer.encode(part).ids[1:])
Thus formally, prompt encoding using function9
is as below
prompt = "The ruler of a kingdom is a"
encoded_prompt = encode(
prompt, tokenizer, specials_to_ids
)
# [1, 6, 6423, 708, 1098, 30095, 803, 768, 19662, 856,
768, 7, 708, 6, 64015, 708]
which gives the expected encoding10
.
Liquid Foundational Models
As SLMs were adopted for specialized tasks and agentic systems,
one company, LiquidAI, aimed to build efficient general-purpose
AI at every scale. The second generation of LFMs, powerful multi- modal SLMs, were released for edge devices with open weights11
. The technical report describing the end-to-end training pipeline
accompanied this release. This article builds on one of the mod- els, the LFM2 350 million-parameter dense model. The model
has a hybrid backbone combining gated short conbvolutions with
grouped query attention blocks.
The LFM2 architecture (Fig.2 on LFM2 Technical Report) shows the
model graph with hybrid backbone for both Dense and MoE models.
LFM2-350M is purely dense.
9
encode a prompt to a vector of integers (token ids), com- patible for an LLM
def encode(prompt, tokenizer, s_ids):
ids = []
_wrap = wrap_chat(prompt)
if _wrap in s_ids and "\n" not in _wrap:
return [s_ids[_wrap]]
_split = SPLIT_RE.split(_wrap)
_split = filter(None,_split)
for part in _split:
if part in s_ids:
ids.append(s_ids[part])
else:
ids.extend(tokenizer.encode(part).ids[1:])
return ids
10 asserting the expected encoding
def run_transformers_tokenizer(prompt):
from transformers import AutoTokenizer
tokenizer = \
AutoTokenizer.from_pretrained("LiquidAI/LFM2-350M")
messages = [{"role": "user", "content": prompt }]
return tokenizer.apply_chat_template(
messages, add_generation_prompt=True
)
assert run_transformers_tokenizer(prompt) ==
encoded_prompt
11 LiquidAI Models and Technical Report
https://huggingface.co/LiquidAI
3
4 / 43 ページ
LFM2-350M Architecture Layers
embeddings
converts low-dimension vector of
token ids to high-dimensional
latent rich token representation
start; set i = 0
pre-norm RMSNorm
ensures uniform gradients at all
layers during initialization12
13
= x
rms(x)
∗ w + b
where
rms(x) = q 1
n
Pn
i=1 x2
i +
and b = ~0
Gated
Short
Convolution
Block
† linear projections
14 GQA block
OR
if i == (2,5,8,10,12,14)? else
Linear
Norm Norm
† heads †kv groups
†kv groups
V Q K
Grouped Query Attention
rotary pos Q,K
repeat interleave K,V
SDPA
Linear
Linear
⊗
Conv1d
B x
⊗
C
Linear
⊕
RMSNorm
(Tri)
Feedforward
Block
Linear
⊗
SiLU
Linear
⊕
if i == 15?
yes, proceed
no
i++
go; start
.
.
.
RMSNorm
Linear
tensor size (batch, seqlen, dim)
12 arXiv:2002.04745
On Layer Normalization in the Transformer Architecture
Xiong et al., 2020
13 arXiv:1910.07467
Root Mean Square Layer Normalization
Zhang & Sennrich, 2019
14 arXiv:2305.13245
GQA: Training Generalized Multi-Query Transformer Models
from Multi-Head Checkpoints
Ainslie et al., 2023
4
5 / 43 ページ
Embeddings
This layer maps the vector of token ids ∈ R
batch×seqlen to a higher
dimensional latent-rich float representation ∈ R
batch×seqlen×dmodel
.
That is, each token id in the sequence is mapped to a vector of
R
dmodel . With token vocabulary of 65,536 for LFMs, the embedding
layer initializes a tensor where each token id is mapped to an ab- stract dimension,dmodel. dmodel is chosen as 1024 for LFM2-350M.
Having been one to buy Dr. Raschka’s great book15 , I’d also direct
you to his in-depth article on Embeddings16 !
In brief, consider a tiny vocabulary of 10 tokens. Let’s then con- sider a sequence of four tokens whose ids are as shown below
import torch, torch.nn as nn
seq = torch.tensor([[1, 5, 9, 3]])
Consider now choosing dmodel = 14, then the instantiated embed- ding layer will be
n_vocab = 10
d_model = 14
embedding = nn.Embedding(n_vocab, d_model)
The first token id is mapped to the second row in the embedding
matrix, the second id mapped to the sixth row, and so on. The re- sulting embedding output becomes a tensor ∈ R
batch×seqlen×dmodel
for batch = 1, seqlen = 4, dmodel = 14.
embed_out = embedding(seq)
print(embed_out.shape) # torch.Size([1, 4, 14])
Now, let’s implement the Embeddings layer specifically used for
LFM2-350M
n_vocab, d_model = 65_536, 1_024
embedding = nn.Embedding(n_vocab, d_model, padding_idx=0)
encoded_prompt_d = torch.tensor(
encoded_prompt,
device = "cpu" # for now
).unsqueeze(0) # create batch dim
embed = nn.Embedding(n_vocab, d_model, padding_idx=0)
embed_out = embed(encoded_prompt)
print(embed_out.shape) # torch.Size([1, 16, 1024])
This summarizes well mapping token ids to higher-dimensional
token embeddings17
.
15 Build a Large Language Model (From Scratch)
Sebastian Raschka
16 embeddings-and-linear-layers
Sebastian Raschka
17 And the first piece of our model from scratch given the
embeddings layer is then
class LFM2350M(nn.Module):
def __init__(self, n_vocab, d_model):
super().__init__()
self.embedding = nn.Embedding(
n_vocab,
d_model,
padding_idx=0
)
def forward(self, x):
return self.embedding(x)
model = LFM2350M(n_vocab, d_model)
device = "cpu" # for now
model.to(device)
embed_out = model(encoded_prompt_d)
print(embed_out.shape) # torch.Size([1, 16, 1024])
5
6 / 43 ページ
Pre-norm RMSNorm
As discussed in-depth by Zhang & Sennrich13, RMSNorm re- placed LayerNorm as a computational efficient way of rescaling
inputs invariantly. The normalization further gave the model im- plicit learning rate adaptation ability.
Why pre-norm? Well, building on the brief explanation in the ar- chitecture12, post-norm creates small scaling in gradients, hence
slow convergence when small larger learning rates are used for
stability. Pre-norm creates well behaved gradients, enabling sta- ble training with faster convergence.
On building RMSNorm
The equation for rms(x):
rms(x) =
vuut
1
n
Xn
i=1
x
2
i
can be implemented in code as:
var = x.pow(2).mean(dim = -1, keepdim = True)
s_var = torch.sqrt(var)
Now, to implement 1/rms(x), let’s use torch’s rsqrt:
rs_var = torch.rsqrt(var)
Now, onto the next part, the overall equation:
rmsnorm(x) = x
rms(x + )
∗ w + b
can be implemented in code as:
var_eps = 1e-5
norm_x = x * torch.rsqrt(var + var_eps)
# note that last dim of x is now 1_024
weight = torch.ones(d_model)
bias = torch.zeros(d_model)
rms_norm = norm_x * weight + bias
Note: In the final implemented instance for rmsnorm, weight and
bias is implemented using nn.Parameter to allow their updates
by optimizers and implicit movement to devices (gpus) from host
(host computer) when the model instance is moved to devices.
Now, the RMSNorm module-inherited instance is done18
.
18
class RMSNorm(nn.Module):
def __init__(self, d_model, eps = 1e-5):
super().__init__()
self.weights = torch.ones(d_model)
self.weight = nn.Parameter(torch.ones(d_model))
self.bias = nn.Parameter(torch.zeros(d_model))
self.eps = eps
def forward(self, x):
var = x.pow(2).mean(dim = -1, keepdim = True)
norm_x = x * torch.rsqrt(var + self.eps)
return norm_x * self.weight + self.bias
and the model is now:
class LFM2350M(nn.Module):
def __init__(self, n_vocab, d_model):
super().__init__()
self.embedding = nn.Embedding(
n_vocab,
d_model,
padding_idx=0
)
self.norm = RMSNorm(d_model)
def forward(self, x):
x = self.embedding(x)
return self.norm(x)
6