(cache)LFMs-Continuous-Batching-Dynamic-Scheduling.pdf

1 / 43 ページ

Continuous Batching & Dynamic

Scheduling

by Marvin Mboya �

featuring State-of-the-Art LFM2-350M1

Large Language Models (LLMs) are large autoregressive2 mod- els that given a prompt, predict the next tokens (words or sub- words) until the end token. Through sequential stochastic decod- ing, prompt response is generated.

Pioneering paper was published by Google in 20173

, and the

first network architecture, the Transformer, revealed. However,

the first “LLM moment” was by OpenAI in 2018, named GPT- 1. Since then, Frontier Labs have trained LLMs on billions and

billions of text data. Scaling laws for Transformer-based mod- els show performance improving with model size, data size,

and compute4

. Really powerful open-source models have thus

been released such as DeepSeek, Llama, GPT-OSS and Mistral5

. However, their sizes limit usability in low-memory and low- compute CPU devices. Companies like NVIDIA thus focused on

smaller specialized LLMs, Small Language Models (SLMs), pow- erful in agentic systems6

. This pioneers the working of power- ful LLMs for low-memory and low-compute devices. This arti- cle takes one such SLM, the Liquid Foundational Model (350M),

and builds the model graph and the CPU inference pipeline, dy- namically scheduling token generation. By building the graph

from scratch in PyTorch and combining conv and KV caching, dy- namic scheduling, and ragged batching, the model achieves about

1.2 tokens/second prefill, 45 tokens/second decode. Overall, the

model achieves over 16X batched inference for five prompts.

The figure above, courtesy of LiquidAI, shows much smaller LFM2 models perform

better than bigger SLMs.

arXiv:2511.23404

LFM2 Technical Report

Amini et al. 2025

autoregressive ∼ previous time-step observations are used

to predict current time-step observation. In LLMs, as more

tokens are predicted, they are continuously appended to the

input to predict even more tokens.

arXiv:1706.03762

Attention Is All You Need

Vaswani et al. 2017

arXiv:2001.08361

Scaling Laws for Neural Language Models

Kaplan et al. 2020

5 Model Links

DeepSeek-R1

https://huggingface.co/deepseek-ai/DeepSeek-R1

Llama3

https://huggingface.co/meta-llama/Meta-Llama-3-8B

GPT-OSS-120B

https://huggingface.co/openai/gpt-oss-120b

Mistral

https://huggingface.co/mistralai/Mistral-7B-v0.1

arXiv:2506.02153

Small Language Models are the Future of Agentic AI

Belcak et al. 2025

Many Thanks to HuggingFace’s article on

Continuous Batching

written by Reboul, Zucker, and Georges which motivated

my writing this awesome article!

2 / 43 ページ

From words to vectors

Given a prompt, “The ruler of a kingdom is a”, LLMs cannot process

this in its string format, hence the need to convert to a compatible

form for computing.

The prompt is first split into tokens7

, before each token is uniquely

mapped to an integer, resulting in a vector of integers known as

token ids now compatible for an LLM. This process is collectively

called encoding. One such algorithm used for LFMs is the byte- level Byte-Pair Encoding (BBPE)8

Let’s not build the encoder from scratch, as that’s not the article’s

objective. Rather, let’s use LFM2-350M’s packaged encoding from

Hugging Face remote repository.

import os; os.system("pip install -q huggingface_hub")

from huggingface_hub import hf_hub_download

hf_hub_download(

repo_id = "LiquidAI/LFM2-350M",

local_dir = "./",

filename = "tokenizer.json"

)

Let’s use the tokenizers package to import the encoding

os.system("pip install -q -U tokenizers")

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")

Now with encoding instance loaded, prompt inputs to LFMs re- quire chat templates before encoding. This is implemented as

def wrap_chat(prompt):

s = (f"<|startoftext|>"

f"<|im_start|>user\n{prompt}<|im_end|>\n"

f"<|im_start|>assistant\n")

return s

wrapped_prompt = wrap_chat(prompt)

Once the prompt is wrapped, splitting chat tags before encoding

both prompt and the tags is implemented as:

SPECIALS=['<|startoftext|>','<|im_start|>','<|im_end|>']

specials_to_ids = {

t:tokenizer.token_to_id(t) for t in SPECIALS

}

import re; SPLIT_RE = re.compile(r"(<\|[^>]+?\|>)")

split_prompt = SPLIT_RE.split(wrapped_prompt)

# ...

token ∼ common chunks of texts in a vocabulary that

combine with other chunks to form words.

arXiv:1909.03341

Neural Machine Translation with Byte-Level Subwords

Wang et al., 2019

3 / 43 ページ

# ...

ids = []

for part in filter(None, split_prompt):

if part in SPECIALS:

ids.append(specials_to_ids[part])

else:

ids.extend(tokenizer.encode(part).ids[1:])

Thus formally, prompt encoding using function9

is as below

prompt = "The ruler of a kingdom is a"

encoded_prompt = encode(

prompt, tokenizer, specials_to_ids

)

# [1, 6, 6423, 708, 1098, 30095, 803, 768, 19662, 856,

768, 7, 708, 6, 64015, 708]

which gives the expected encoding10

Liquid Foundational Models

As SLMs were adopted for specialized tasks and agentic systems,

one company, LiquidAI, aimed to build efficient general-purpose

AI at every scale. The second generation of LFMs, powerful multi- modal SLMs, were released for edge devices with open weights11

. The technical report describing the end-to-end training pipeline

accompanied this release. This article builds on one of the mod- els, the LFM2 350 million-parameter dense model. The model

has a hybrid backbone combining gated short conbvolutions with

grouped query attention blocks.

The LFM2 architecture (Fig.2 on LFM2 Technical Report) shows the

model graph with hybrid backbone for both Dense and MoE models.

LFM2-350M is purely dense.

encode a prompt to a vector of integers (token ids), com- patible for an LLM

def encode(prompt, tokenizer, s_ids):

ids = []

_wrap = wrap_chat(prompt)

if _wrap in s_ids and "\n" not in _wrap:

return [s_ids[_wrap]]

_split = SPLIT_RE.split(_wrap)

_split = filter(None,_split)

for part in _split:

if part in s_ids:

ids.append(s_ids[part])

else:

ids.extend(tokenizer.encode(part).ids[1:])

return ids

10 asserting the expected encoding

def run_transformers_tokenizer(prompt):

from transformers import AutoTokenizer

tokenizer = \

AutoTokenizer.from_pretrained("LiquidAI/LFM2-350M")

messages = [{"role": "user", "content": prompt }]

return tokenizer.apply_chat_template(

messages, add_generation_prompt=True

)

assert run_transformers_tokenizer(prompt) ==

encoded_prompt

11 LiquidAI Models and Technical Report

https://huggingface.co/LiquidAI

4 / 43 ページ

LFM2-350M Architecture Layers

embeddings

converts low-dimension vector of

token ids to high-dimensional

latent rich token representation

start; set i = 0

pre-norm RMSNorm

ensures uniform gradients at all

layers during initialization12

= x

rms(x)

∗ w + b

where

rms(x) = q 1

i=1 x2

i +

and b = ~0

Gated

Short

Convolution

Block

† linear projections

14 GQA block

if i == (2,5,8,10,12,14)? else

Linear

Norm Norm

† heads †kv groups

†kv groups

V Q K

Grouped Query Attention

rotary pos Q,K

repeat interleave K,V

SDPA

Linear

⊗

Conv1d

B x

⊗

Linear

⊕

RMSNorm

(Tri)

Feedforward

Block

Linear

⊗

SiLU

Linear

⊕

if i == 15?

yes, proceed

i++

go; start

RMSNorm

Linear

tensor size (batch, seqlen, dim)

12 arXiv:2002.04745

On Layer Normalization in the Transformer Architecture

Xiong et al., 2020

13 arXiv:1910.07467

Root Mean Square Layer Normalization

Zhang & Sennrich, 2019

14 arXiv:2305.13245

GQA: Training Generalized Multi-Query Transformer Models

from Multi-Head Checkpoints

Ainslie et al., 2023

5 / 43 ページ

Embeddings

This layer maps the vector of token ids ∈ R

batch×seqlen to a higher

dimensional latent-rich float representation ∈ R

batch×seqlen×dmodel

That is, each token id in the sequence is mapped to a vector of

dmodel . With token vocabulary of 65,536 for LFMs, the embedding

layer initializes a tensor where each token id is mapped to an ab- stract dimension,dmodel. dmodel is chosen as 1024 for LFM2-350M.

Having been one to buy Dr. Raschka’s great book15 , I’d also direct

you to his in-depth article on Embeddings16 !

In brief, consider a tiny vocabulary of 10 tokens. Let’s then con- sider a sequence of four tokens whose ids are as shown below

import torch, torch.nn as nn

seq = torch.tensor([[1, 5, 9, 3]])

Consider now choosing dmodel = 14, then the instantiated embed- ding layer will be

n_vocab = 10

d_model = 14

embedding = nn.Embedding(n_vocab, d_model)

The first token id is mapped to the second row in the embedding

matrix, the second id mapped to the sixth row, and so on. The re- sulting embedding output becomes a tensor ∈ R

batch×seqlen×dmodel

for batch = 1, seqlen = 4, dmodel = 14.

embed_out = embedding(seq)

print(embed_out.shape) # torch.Size([1, 4, 14])

Now, let’s implement the Embeddings layer specifically used for

LFM2-350M

n_vocab, d_model = 65_536, 1_024

embedding = nn.Embedding(n_vocab, d_model, padding_idx=0)

encoded_prompt_d = torch.tensor(

encoded_prompt,

device = "cpu" # for now

).unsqueeze(0) # create batch dim

embed = nn.Embedding(n_vocab, d_model, padding_idx=0)

embed_out = embed(encoded_prompt)

print(embed_out.shape) # torch.Size([1, 16, 1024])

This summarizes well mapping token ids to higher-dimensional

token embeddings17

15 Build a Large Language Model (From Scratch)

Sebastian Raschka

16 embeddings-and-linear-layers

Sebastian Raschka

17 And the first piece of our model from scratch given the

embeddings layer is then

class LFM2350M(nn.Module):

def __init__(self, n_vocab, d_model):

super().__init__()

self.embedding = nn.Embedding(

n_vocab,

d_model,

padding_idx=0

)

def forward(self, x):

return self.embedding(x)

model = LFM2350M(n_vocab, d_model)

device = "cpu" # for now

model.to(device)

embed_out = model(encoded_prompt_d)

print(embed_out.shape) # torch.Size([1, 16, 1024])

6 / 43 ページ

Pre-norm RMSNorm

As discussed in-depth by Zhang & Sennrich13, RMSNorm re- placed LayerNorm as a computational efficient way of rescaling

inputs invariantly. The normalization further gave the model im- plicit learning rate adaptation ability.

Why pre-norm? Well, building on the brief explanation in the ar- chitecture12, post-norm creates small scaling in gradients, hence

slow convergence when small larger learning rates are used for

stability. Pre-norm creates well behaved gradients, enabling sta- ble training with faster convergence.

On building RMSNorm

The equation for rms(x):

rms(x) =

vuut

i=1

can be implemented in code as:

var = x.pow(2).mean(dim = -1, keepdim = True)

s_var = torch.sqrt(var)

Now, to implement 1/rms(x), let’s use torch’s rsqrt:

rs_var = torch.rsqrt(var)

Now, onto the next part, the overall equation:

rmsnorm(x) = x

rms(x + )

∗ w + b

can be implemented in code as:

var_eps = 1e-5

norm_x = x * torch.rsqrt(var + var_eps)

# note that last dim of x is now 1_024

weight = torch.ones(d_model)

bias = torch.zeros(d_model)

rms_norm = norm_x * weight + bias

Note: In the final implemented instance for rmsnorm, weight and

bias is implemented using nn.Parameter to allow their updates

by optimizers and implicit movement to devices (gpus) from host

(host computer) when the model instance is moved to devices.

Now, the RMSNorm module-inherited instance is done18

class RMSNorm(nn.Module):

def __init__(self, d_model, eps = 1e-5):

super().__init__()

self.weights = torch.ones(d_model)

self.weight = nn.Parameter(torch.ones(d_model))

self.bias = nn.Parameter(torch.zeros(d_model))

self.eps = eps

def forward(self, x):

var = x.pow(2).mean(dim = -1, keepdim = True)

norm_x = x * torch.rsqrt(var + self.eps)

return norm_x * self.weight + self.bias

and the model is now:

class LFM2350M(nn.Module):

def __init__(self, n_vocab, d_model):

super().__init__()

self.embedding = nn.Embedding(

n_vocab,

d_model,

padding_idx=0

)

self.norm = RMSNorm(d_model)

def forward(self, x):

x = self.embedding(x)

return self.norm(x)