Don't Sleep on Fast Inference
Everyone's talking about the latest frontier models. Claude Opus 4.5. Gemini 3.0 Pro. GPT 5.1. The discourse is all about which model is smarter, which one reasons better, which one will finally pass that one benchmark.
But while everyone's focused on SOTA, something equally important is happening in the fast inference space. And I think most people are sleeping on it.
The Speed Gap
When you call Claude or ChatGPT, you're getting maybe 80-100 tokens per second. That's fine for chat or async tasks. But what happens when you're building something that needs multiple LLM calls in sequence?
| Scenario | Traditional (80 TPS) | Fast Inference (2,000 TPS) |
|---|---|---|
| Single response (500 tokens) | 6 seconds | 0.25 seconds |
| 4-step agent workflow | 25+ seconds | 1 second |
| 10 research queries | 60+ seconds | 3 seconds |
The math is simple but the implications are huge. At 80 TPS, agents feel slow. At 2,000 TPS, agents feel instant.
Who's Doing This?
I'm aware of three companies in the fast inference space, each with very different approaches. I'm not an expert in chip architecture, but here's what I've learned about how they make this happen:
-
Cerebras built a chip the size of a dinner plate. Literally. Their wafer-scale approach eliminates the memory bottleneck that slows down GPUs. Result: 3,000+ TPS on large models.
-
Groq designed custom silicon specifically for LLM inference. Their LPU (Language Processing Unit) delivers predictable, consistent performance. Result: 500-800+ TPS with rock-solid latency.
-
SambaNova focused on running massive models efficiently. Their architecture can host the full 671B parameter DeepSeek R1 on a single rack, something that would take 40 racks of GPUs. Result: the biggest models, running fast.
The Models Are Good Now
Here's the thing people miss: the models running on these platforms aren't toys anymore.
| Model | Where | Speed | Notable |
|---|---|---|---|
| GPT-OSS 120B | Cerebras, Groq | 3,000 TPS | OpenAI's first open reasoning model |
| Qwen 3 235B | Cerebras | 1,500 TPS | Thinking modes, huge context |
| GLM 4.6 | Cerebras | 1,000 TPS | Great for tool use and coding |
| DeepSeek R1 671B | SambaNova | 250 TPS | Full 671B, not distilled |
| Kimi K2 | Groq | 200 TPS | One trillion parameters, open source |
| K2-Think | Cerebras | 2,000 TPS | Seems to crush math olympiad problems |
K2-Think looks very promising. A 32B model that reportedly competes with models 20x its size on competitive math, running at 2,000 tokens per second. It seems gated so I couldn't try it yet. If someone at Cerebras is reading this, I'd love to get access!
GLM 4.6 on Cerebras is becoming an important piece of our fast agentic workflows. Good reasoning capabilities and tool calling, quality comparable to top provider models in many day-to-day tasks, but way faster.
Use Cases That Only Work at Speed
Here's where it gets interesting. Some applications only make sense when inference is fast enough.
Streaming UI Generation
We use LLMs to generate entire dashboard UIs in real-time. The model outputs HTML + Tailwind, and we render it as the tokens stream in.
At 1,000 TPS, you see the dashboard materialize in 2-3 seconds. The streaming effect actually looks intentional, like a reveal animation. At 80 TPS, you're watching paint dry.
This opens up a whole category of applications: adaptive interfaces that generate themselves based on context. Reports that format themselves based on the data they contain. Dynamic visualizations that restructure based on what's interesting in the data.
Agents and Complex Workflows
Here's where fast inference becomes a necessity, not a nice-to-have.
Modern AI agents don't just generate one response. They think, plan, call tools, analyze results, and iterate. A single user request might trigger:
- Initial reasoning about the task
- Tool selection
- First tool call
- Result analysis
- Second tool call
- More reasoning
- Final synthesis
At 80 TPS with traditional models, each step takes 5-10 seconds. A 7-step workflow takes over a minute. Users don't wait a minute. They close the tab.
At 1,000 TPS, each step takes under a second. The same workflow completes in 7 seconds. That's the difference between an agent that feels broken and an agent that feels magical.
But fast inference alone isn't enough. You also need fast tools.
Fast Inference + Fast Tools = Real-Time Agents
This is where it all comes together. Fast LLMs need fast tool providers to unlock truly real-time agentic workflows.
Take Parallel.ai. They built an API for web extraction and search that returns in 2-3 seconds. Combine that with a model running at 1,000 TPS, and suddenly you can do things that felt impossible a year ago.
At Parcha, this is core to what we do. Our agents run due diligence workflows: building company profiles, filling compliance forms, verifying documents. A single workflow might hit 10+ data sources and make multiple LLM calls to fill in the gaps.
The pattern is always the same: fast extraction and parallel searches, then a fast agent to create the first draft, identify gaps, generate new search queries, search again, and synthesize into the final output.
Under seven seconds from URL to complete business intelligence. Company website to due diligence questionnaire. LinkedIn profile to job application form. The agent extracts, searches, drafts, identifies gaps, searches again, and synthesizes while you're still reading this sentence.
Try that with traditional providers. Extraction takes 10+ seconds. Each search takes 5+ seconds. LLM synthesis takes another 10+ seconds. You're looking at 45 seconds minimum. Probably over a minute.
With fast inference and fast tools, it's under 10 seconds. That's not an incremental improvement. That's a completely different product experience.
Interactive Experiences
And just for fun, we built a demo where we ask LLMs to draw famous paintings from "memory" (Starry Night, The Great Wave, Mona Lisa) rendering them in different styles using SVG, Canvas, or raw HTML. At 1,000 TPS, you watch the artwork materialize stroke by stroke. The LLM generates coordinates, colors, and shapes in real-time, creating something genuinely fun to watch.
And here's the thing: this is as slow and as limited as these models will ever be. Every month brings faster chips, smarter models, and better tooling. What takes 6 seconds today will take 600 milliseconds next year.
The possibilities are endless. We're just getting started.
Want to see this in action? Check out Parcha where we build AI agents that automate KYB, AML screening, and due diligence for leading fintechs and financial institutions.
Links: