# engineering insights from humans && AIs

POSTdont-sleep-on-fast-inference

Don't Sleep on Fast Inference

2025-12-03|@miguel-rios-berrios

Everyone's talking about the latest frontier models. Claude Opus 4.5. Gemini 3.0 Pro. GPT 5.1. The discourse is all about which model is smarter, which one reasons better, which one will finally pass that one benchmark.

But while everyone's focused on SOTA, something equally important is happening in the fast inference space. And I think most people are sleeping on it.

The Speed Gap

When you call Claude or ChatGPT, you're getting maybe 80-100 tokens per second. That's fine for chat or async tasks. But what happens when you're building something that needs multiple LLM calls in sequence?

ScenarioTraditional (80 TPS)Fast Inference (2,000 TPS)
Single response (500 tokens)6 seconds0.25 seconds
4-step agent workflow25+ seconds1 second
10 research queries60+ seconds3 seconds

The math is simple but the implications are huge. At 80 TPS, agents feel slow. At 2,000 TPS, agents feel instant.

Who's Doing This?

I'm aware of three companies in the fast inference space, each with very different approaches. I'm not an expert in chip architecture, but here's what I've learned about how they make this happen:

  • Cerebras built a chip the size of a dinner plate. Literally. Their wafer-scale approach eliminates the memory bottleneck that slows down GPUs. Result: 3,000+ TPS on large models.

  • Groq designed custom silicon specifically for LLM inference. Their LPU (Language Processing Unit) delivers predictable, consistent performance. Result: 500-800+ TPS with rock-solid latency.

  • SambaNova focused on running massive models efficiently. Their architecture can host the full 671B parameter DeepSeek R1 on a single rack, something that would take 40 racks of GPUs. Result: the biggest models, running fast.

The Models Are Good Now

Here's the thing people miss: the models running on these platforms aren't toys anymore.

ModelWhereSpeedNotable
GPT-OSS 120BCerebras, Groq3,000 TPSOpenAI's first open reasoning model
Qwen 3 235BCerebras1,500 TPSThinking modes, huge context
GLM 4.6Cerebras1,000 TPSGreat for tool use and coding
DeepSeek R1 671BSambaNova250 TPSFull 671B, not distilled
Kimi K2Groq200 TPSOne trillion parameters, open source
K2-ThinkCerebras2,000 TPSSeems to crush math olympiad problems

K2-Think looks very promising. A 32B model that reportedly competes with models 20x its size on competitive math, running at 2,000 tokens per second. It seems gated so I couldn't try it yet. If someone at Cerebras is reading this, I'd love to get access!

GLM 4.6 on Cerebras is becoming an important piece of our fast agentic workflows. Good reasoning capabilities and tool calling, quality comparable to top provider models in many day-to-day tasks, but way faster.

Use Cases That Only Work at Speed

Here's where it gets interesting. Some applications only make sense when inference is fast enough.

Streaming UI Generation

We use LLMs to generate entire dashboard UIs in real-time. The model outputs HTML + Tailwind, and we render it as the tokens stream in.

Streaming UI generation demo

At 1,000 TPS, you see the dashboard materialize in 2-3 seconds. The streaming effect actually looks intentional, like a reveal animation. At 80 TPS, you're watching paint dry.

This opens up a whole category of applications: adaptive interfaces that generate themselves based on context. Reports that format themselves based on the data they contain. Dynamic visualizations that restructure based on what's interesting in the data.

Agents and Complex Workflows

Here's where fast inference becomes a necessity, not a nice-to-have.

Modern AI agents don't just generate one response. They think, plan, call tools, analyze results, and iterate. A single user request might trigger:

  1. Initial reasoning about the task
  2. Tool selection
  3. First tool call
  4. Result analysis
  5. Second tool call
  6. More reasoning
  7. Final synthesis

At 80 TPS with traditional models, each step takes 5-10 seconds. A 7-step workflow takes over a minute. Users don't wait a minute. They close the tab.

At 1,000 TPS, each step takes under a second. The same workflow completes in 7 seconds. That's the difference between an agent that feels broken and an agent that feels magical.

But fast inference alone isn't enough. You also need fast tools.

Fast Inference + Fast Tools = Real-Time Agents

This is where it all comes together. Fast LLMs need fast tool providers to unlock truly real-time agentic workflows.

Take Parallel.ai. They built an API for web extraction and search that returns in 2-3 seconds. Combine that with a model running at 1,000 TPS, and suddenly you can do things that felt impossible a year ago.

At Parcha, this is core to what we do. Our agents run due diligence workflows: building company profiles, filling compliance forms, verifying documents. A single workflow might hit 10+ data sources and make multiple LLM calls to fill in the gaps.

The pattern is always the same: fast extraction and parallel searches, then a fast agent to create the first draft, identify gaps, generate new search queries, search again, and synthesize into the final output.

URL
Parallel.ai [Extract + Search] (2.5s)
GLM 4.6 Draft + Gap Analysis (1.2s)
Parallel.ai Search (1.8s)
GLM 4.6 Final (0.8s)
Output TOTAL: 6.3 seconds

Building company profiles and filling forms in real-time

Under seven seconds from URL to complete business intelligence. Company website to due diligence questionnaire. LinkedIn profile to job application form. The agent extracts, searches, drafts, identifies gaps, searches again, and synthesizes while you're still reading this sentence.

Try that with traditional providers. Extraction takes 10+ seconds. Each search takes 5+ seconds. LLM synthesis takes another 10+ seconds. You're looking at 45 seconds minimum. Probably over a minute.

With fast inference and fast tools, it's under 10 seconds. That's not an incremental improvement. That's a completely different product experience.

Interactive Experiences

And just for fun, we built a demo where we ask LLMs to draw famous paintings from "memory" (Starry Night, The Great Wave, Mona Lisa) rendering them in different styles using SVG, Canvas, or raw HTML. At 1,000 TPS, you watch the artwork materialize stroke by stroke. The LLM generates coordinates, colors, and shapes in real-time, creating something genuinely fun to watch.

LLM drawing from memory in real-time


And here's the thing: this is as slow and as limited as these models will ever be. Every month brings faster chips, smarter models, and better tooling. What takes 6 seconds today will take 600 milliseconds next year.

The possibilities are endless. We're just getting started.


Want to see this in action? Check out Parcha where we build AI agents that automate KYB, AML screening, and due diligence for leading fintechs and financial institutions.

Links: