Gemini 3.1 Pro Preview

Claude Sonnet 4.6

Qwen3.5-397B-A17B

GLM-5

Qwen3-Max-Thinking

Qwen3-Coder-Next

Claude Opus 4.6

Step 3.5 Flash

Kimi-K2.5

GLM-4.7-Flash

GLM-4.6V

LFM2.5-1.2B-Instruct

Nemotron-3-Nano-30B-A3B

GLM-4.7

Devstral 2 123B Instruct 2512

Gemini 3 Flash Preview

Rnj-1-Instruct

Ministral 3 8B Instruct

DeepSeek V3.2

Olmo 3 7B Instruct

Claude Opus 4.5

GPT-5.1

Kimi-K2-Thinking

Qwen3-VL-30B-A3B-Instruct

MiniMax-M2

LFM2-8B-A1B

Claude Haiku 4.5

LongCat-Flash-Chat

Granite-4.0-H-Tiny

Claude Sonnet 4.5

GPT-5 Codex

Grok-4-fast-non-reasoning

Seed-OSS-36B

Qwen3-Next-80B-A3B-Instruct

Kimi-K2-Instruct-0905

Claude Opus 4.1 Thinking

Grok Code Fast 1

ERNIE-4.5-21B-A3B

Jamba 1.7

Qwen3-4B-Instruct-2507

GPT-5

gpt-oss-20b

XBai-o4

Qwen3-Coder-30B-A3B-Instruct

Qwen3-30B-A3B-Instruct-2507

Llama-3.3-Nemotron-Super-49B-v1.5

Qwen3-235B-A22B-Thinking-2507

Qwen3-Coder-480B-A35B

Qwen3-235B-A22B-2507

ERNIE-4.5-300B-A47B

Grok-4

Hunyuan-A13B Instruct

Llama-3.1-Nemotron-Ultra-253B-v1

o3-2025-04-16

Gemini 2.5 Flash

MiniMax-M1

Gemini 2.5 Pro

Magistral Medium 2506

DeepSeek-R1 0528-Qwen3-8B

DeepSeek-R1 0528

Sonnet 4 Thinking

Codex Mini

Mistral Medium 3

Qwen3-235B-A22B

Phi-4-reasoning plus

Qwen3

Gemini 2.5 Flash Preview 04-17

o4-mini

GPT-4.1

Llama-3.1-Nemotron-Nano-8B-v1

Grok-3 mini

Gemini 2.5 Pro Experimental 03-25

DeepSeek V3 0324

EXAONE Deep 32B

Olmo 2 32B

Mistral Small 3.1

Reka Flash 3

Command A

Gemma 3

QwQ-32B

GPT-4.5 Preview

Claude 3.7 Sonnet Thinking

R1 1776

Grok-3

Qwen2.5

o3-mini

Mistral-Small-24B-Instruct-2501

R1-Distill

DeepSeek R1-Zero

DeepSeek-R1

QVQ-72B-Preview

DeepSeek V3

Gemini 1206

o1-2024-12-17

Command R7B

Llama 3.3 70B

QwQ-32B-Preview

DeepSeek-R1-Lite-Preview

Claude 3.5 Haiku

Aya Expanse

Granite-3.0-8B-Instruct

Yi-Lightning

Ministral

Llama-3.1-Nemotron-70B-Instruct

Llama 3.2 11B & 90B

Llama-3.1-Nemotron-51B

Qwen2.5-14B-Instruct

Mistral-Small-Instruct-24-09

o1-preview

DeepSeek V2.5

Reflection Llama-3.1 70B

Command R 08-2024

Jamba 1.5

Gemini Pro 1.5 experimental

Claude-3.5-Sonnet

Llama-3-70B-Instruct

claude-3-haiku-20240307

Gemini, Claude, Mistral, GPT-4 Turbo

Earlier first impressions

Dubesor - First Impressions Blog

Initial first model impressions & general vibe, copy-pasted from my discord comments

Gemini 3.1 Pro Preview

2026-02-20

Tested Gemini 3.1 Pro Preview:
Token-efficiency update with claimed improvements in toolcalls and overall agentic reliability.

Verbosity went down ~25% overall, reasoning by 32%. Very light thinker, akin to GPT-5-Mini.
In the same breath, long context pricing has been increased above 200k ctx ($2/12 → $4/18). This same price increase is also applied to Gemini 3 Pro.

Output quality was largely equivalent to Gemini 3 Pro
Produced less complex code, and performed slightly worse on the most complex tasks
Censorship was tighter than on 3 Pro

Vision was top notch, sharing first place with Gemini 3 Flash.

Chess play was noticeably impacted by the reduced reasoning. In contrast to Gemini 3 Pro, an undefeated beast holding 60+ matches without a single loss* (not accounting for 2 human players), Gemini 3.1 Pro already lost 5 times during initial placement matches (0-2 vs 3 Pro). Move accuracy, and legality are still very strong, but the unique sharpness in key moments seems to have been partially lost.

Overall, it's still a strong model and will be significantly cheaper for short-context tasks.
Compared to Gemini 3 Pro Preview however, to me this is a downgrade and feels like an economically driven release. YMMV.

Update: The day after my initial Chess disappointment, I ran several more mirror matches and while blind continuation play favors Gemini 3 Pro, full information reasoning chess actually ended up being a close 3-2 series. All matches are viewable here: Replays

Claude Sonnet 4.6

2026-02-18

Tested Claude Sonnet 4.6:
Sonnet update, promising upgrades to coding, computer use, long-context reasoning, agent planning, knowledge work, and design.

As mentioned in my Opus 4.6 impression, "High effort" is the default and thus was the effort level being tested.

Token use was up +71% (*not on Claude.ai)
least censorship seen by any Claude model yet (*not on Claude.ai)
greatly improved instruction-following
improved STEM performance
generally better front-end results (see some examples)
small gains in many other areas

Chess performance fell between Sonnet-4 and Opus-4, though using the most tokens per move of any non-thinking Claude model.

While Vision saw some improvements compared to Sonnet 4.5, it remains weak & isn't a scope I'd use it for.

The combination of increased verbosity, and same 50% price-inflation as Opus 4.6 on large 200k+ ctx ($3/15 → $6/22.50) leads to quite expensive interactions.
However, at time of testing, it indeed performed like a Opus-level intelligence, ranking 1/5 depending on chosen censor weight.

This was a somewhat unexpected result, and effort as well as reply quality varied (at times heavily) between API and claude.ai implementation. Thus: YMMV.

Update: In comparison, Claude.ai testing (artifacts/code execution/memory/skills/web search OFF, style: normal) Sonnet 4.6 produced 35% fewer tokens, and scored -3% total (more censored, still top3).

Qwen3.5-397B-A17B

Qwen3.5-Plus

2026-02-17

Tested Qwen3.5-Plus / Qwen3.5-397B-A17B:

Latest Alibaba hybrid reasoning MoEs with vision capability.
Technically they have the same underlying model, however the Plus model is API only with enhanced context window 262k > 1M, ctx-variable pricing (3x at >256K), and adaptive tool use features.
I did test each endpoint individually, due to performance being almost identical (as is to be expected), I'll cluster them.

Noticeably, OpenRouter defaulted to Reasoning enabled on Qwen3.5-397B-A17B & disabled on the Qwen3.5-Plus endpoint.
397B-A17B had to be rerun multiple times due to aggressive rate limiting at time of testing.

Thinking ("enable_thinking": True):
With >8x verbosity, upper Qwen3 lvl verbosity
80% of tokens were generated during reasoning.

High general logic
Overall performance Around Kimi-K2.5 Thinking & GLM-4.6

Nonthinking ("enable_thinking": False):

~75% token savings
lack of reasoning impacted instruction adherence and hard coding issues
Generic utility for size is low compared to Qwen3-235B-A22B-Instruct-2507

Chess performance is underwhelming, hovering around 700 MixedElo with accuracy in the 50s. Reasoning multiplied tok/move by ~8.2x which aligns with my observations in generic use.

Vision performance was very strong overall, slotting in just below Gemini 3. On identical inference settings, Qwen3.5-Plus performed slightly worse than the 397B-A17B endpoint, in particular with reasoning enabled and on certain counting tasks, where I could repeatedly observe endless recounting loops.

Overall, decent models with great vision, though I don't think they performed particularly well in creative tasks, but YMMV.

GLM-5

2026-02-12

Tested GLM-5:

Beefier GLM hybrid-reasoning MoE model (355B-A32B → 744B-A40B).

Default/Thinking:
Slightly more verbose than previous GLM models, DeepSeek-R1 0528 level.
76% of tokens were generated during reasoning.

very high general logic and reasoning
I saw no leaps in my STEM & tech tasks
reasonably censored
Unlike 4.7, no reasoning loops encountered

Chess performance wasn't great in a vacuum: 6k tok/move ~780 mixed elo /w 62% accuracy, decent blind legality, around o1-mini. Best among GLM family though.

Nonthinking:

~76% token savings (non-reasoning segments were samey).
negatively impacts logic and maths
was slightly less likely to refuse in censorship testing

Overall, very solid and one of the best open models currently, but YMMV.

Qwen3-Max-Thinking

2026-02-11

Tested Qwen3-Max-Thinking:
Alibaba's API-only reasoning proprietary model.

Excessively overthinks and over-explains everything - example
With 16.4x verbosity (85-15 split), bottom 4 for reasoning efficiency. Reasoning added 770% verbosity, non-reasoning content was up +30%.
API pricing is too high for this much reasoning, bottom line was 2x Opus-4.6 cost
General use due to style and efficiency out of the question. ⚠️Exclaimer galore. Treats user like a toddler
Replies are long-winded, all-encompassing and annoying to read
While severely inefficient, math/coding results were ok
In general use, scored around Qwen3-235B-A22B-Thinking-2507 with better math/coding

Chess testing is not feasable due to inefficiency, as the model generates 30k tokens taking ~20 minutes/move. I conducted 1 match which I had to restart, because I forgot it was still running, between sleep and work days. It eventually won against its weak 680 Elo, non-thinking but usable, counterpart: match

Overall, unusable model. It's extremely slow, inefficient, way too expensive, and has abhorrent vibes/style. Feels like a model solely trained for marketing benchmarks. YMMV.

Qwen3-Coder-Next

2026-02-08

Tested Qwen3-Coder-Next (local, Q4_K_M):

New Qwen3 MoE 80B-A3B "non-thinking" coding model with focus on agentic coding, local deployment and tool usage.

Despite not officially ＜think＞ing, utilizes plenty of chain of thoughts and self-corrections in its responses
Thus not very concise, around 2x Qwen3-Coder-30B-A3B-Instruct
General capability was decent, around oss-120b or Seed-OSS-36B level
Non-agentic coding was rather lackluster though, around ~oss-20b

Chess testing was conducted via API (bf16) and showed very flakey json adherence. Above expected levels of malformed responses were generated, caused by issues such as forgetting to include mandatory move keys. On top of malformed json, nearly 4% of moves were illegal, despite receiving a legal move list, which isn't great. It was also the most verbose qwen-coder at ~1400 tok/move; similar to light thinkers like o1-mini/grok-3-mini.
Chess skill itself was weak; placed around devstral-medium & claude-3.5-haiku lvl

Local inference was quite slow (~9 tok/s on 4090).

Overall, for general use I don't have many complaints since it's competent enough, albeit censored.
As a coding-specific model I found the results to be far too weak in my use and test-cases. However, since I don't utilize agentic coding, YMMV.

Claude Opus 4.6

2026-02-06

Tested Claude Opus 4.6 (default/non-thinking):

Iterative upgrade to Opus, with focus on agentic coding, autonomous works and large context coding.

"High effort" is the new default, meaning the model is wordier, which correlates to my findings: +~77% tok/benchprice. Claude.ai platform seems to default to a lower effort, resulting in more concise replies.

General logic was very good
STEM performance was weaker; minor issues with USdefaultism and estimations
Coding is pretty much saturated with the exception of 1 task
Effort did not translate to better outcome generally, thus poor bang/buck

Vision scored identical to 4.5

Chess performance was disappointing, samey or slightly weaker than 4.5, while generating more tokens (+25-70%). With 800ish elo and 60% accuracy, one of the worst chess players, among SOTA models. Losses to gpt-4.1, gpt-3.5, Opus 4.1, Opus 4.5, and draws against tiny models, e.g. seed-1.6 & nano-30b-a3b. Legality was fine.

Overall, the model tweaks have no notable positive impact in my general usage tasks, in particular in terms of token-efficiency.
Price/performance is rather poor, in particular if you want to make use of the extended context due to price inflation after 200k ctx ($5/25 → $10/37.50).
Extremely incremental, for my use cases. But, it's still an Opus class model and performs as such. YMMV.

Step 3.5 Flash

2026-02-03

Tested Step 3.5 Flash:

Chinese open, mandatory reasoning MoE (196B-A11B).

Very long reasoning-plus chains akin to DeepSeek V3.2 Speciale
88% of tokens were used for reasoning
Decent performance overall, though STEM wasn't great
fairly censored

Chess performance was most verbose ever recorded. tok/move was worst efficiency to date, close to double of Grok-4 or DeepSeek V3.2 Speciale. Blind play was poor: kimi-k2 level while generating 68x more tokens. However, with full information in reasoning Chess, it placed #11 and around gpt-5.2, better than any other current open model.

In Totality, around GLM-4.5/4.6 capability.
No paid providers yet, but due to it's extreme verbosity, any output mtok above ~$0.40 will be too cost-prohibitive.

Provided inference is fast enough, I think this is a model worth checking out. YMMV.

Kimi-K2.5

2026-01-27

Tested Kimi-K2.5:

Newest Kimi model, still MoE 1TA32B, now with hybrid thinking and vision.

Default/Thinking:

similar verbosity as Kimi-K2-Thinking
77% of tokens were used for reasoning
very good instruction following
performed slightly better than Kimi-K2-Thinking

Chess saw improvements compared to Kimi-K2-Thinking (which was very inefficient). Token usage halved, +200 Elo, +12% accuracy and legal play. Performed around gpt-5.1-codex-mini which means top3 open models.

Nonthinking (template "thinking": False):

~75% token savings
Utilizes light cot and self-corrections; produced a lot of code. ~25% more verbose than concise Kimis (0711/0905)
Samey capability in generic tasks
Negative effects on general logic and complex technical tasks

Vision was good overall, around Qwen3-VL-235B-A22B-Instruct / GLM 4.6V, though on counting tasks reasoning failed to exit and tended to recount endlessly.

At $0.60/3, accounting for verbosity, price/performance is rather average.

Overall, it's a decent upgrade to Kimi-K2-Thinking, though I am not convinced by hybrid/non-thinking, but obviously YMMV.

GLM-4.7-Flash

2026-01-23

Tested GLM-4.7-Flash:

Small 30B-A3B MoE model. Due to borked llama.cpp implementations at time of testing, tested API (official Z.ai endpoint).

Default/Thinking:
Similar verbosity as 4.5-Air or Qwen3-14/32B. 83% of generated tokens were used for reasoning.

Dumber than Qwen3-30B-A3Bs, though more coding focus akin Qwen3-Coder-30B-A3B
Neutral style; didn't stand out positively nor negatively
around gpt-oss-20b or Olmo 3 32B lvl

Chess performance was incredibly inefficient, around Gemma 3 27B but 20k+ tok/move, though low sample size for now due to snails pace generation.

Nonthinking:

80% token savings
Stripping away reasoning had massive negative impact on general reasoning and STEM
Found model capability loss too be high to be worth; performed ~Qwen3-8B non-thinking level

Inference speed for the first few days was extremely crippled, mostly between 1-12 tok/s, which for a "lightweight" active 3B model is obviously beyond terrible.
Ignoring 0 usability from crippled throughput, the model didn't perform well enough to be a worthy contender for me, but YMMV.

GLM-4.6V

2026-01-10

Tested GLM-4.6V:

Zhipu AI's Vision model, was already released a month ago but better late than never.
At $0.30/$0.90 almost 60% cheaper than GLM-4.6.
Tested via OpenRouter official Z.ai fp8 endpoint on recommended params (temp 0.8, top_p 0.6, top_k 2, repetition_penalty 1.1)

Default/Thinking:

More verbose than non-vision, partially due to looping issue:
3 tasks cause reproducible finishing loops in reasoning
4x token usage of non-reasoning, which accounts for ~80% of tokens generated
Significantly weaker general intelligence and instruction following
Around MiniMax-M1 level

Chess performance remained in the same ballpark (poor) with slightly higher accuracy.

Nonthinking:

samey verbosity as non-vision counterpart, ~75% token savings, no response issues.
significantly worse at logic tasks and STEM
Overall performed around 4.5-Air

Vision performance between reasoning enabled/disabled was near identical, in my test set; around Opus 4.5.
So, while vision was good, Qwen3-VL-235B-A22B-Instruct might be a better choice.

I wouldn't use this model for non-vision tasks, as it showed regressed text-only capability compared to non-vision counterparts. YMMV, though.

LFM2.5-1.2B-Instruct

2026-01-06

Tested LFM2.5-1.2B-Instruct (local, f16):

Fast phone-sized model. Generates text fast (~280 tok/s on 4090).
Not much to say, since my benchmark is far too hard for it.

It's coherent but obviously very unintelligent. If you can go a tad larger, a model like Gemma 2 2B (1.5y old) offers much more general capability/utility; for this size segment.
Still, fast and tiny, generates text just fine. While it has poor instruction following, I didn't notice any critical flaws. Reasonably concise for this era. I have no use for this, but YMMV.

Nemotron-3-Nano-30B-A3B

2025-12-28

Tested Nemotron-3-Nano-30B-A3B (local, Q4_K_M):

Small open NVIDIA MoE model.
On my setup (24GB VRAM), due to being significantly larger (+~5GB on Q4_K_M), I wasn't able fit the entire model into VRAM which resulted in ~60% slower inference when compared to Qwen's 30B-A3B variants.

Default/Thinking:

average verbose for a thinker, around DeepSeek V3.2 level, 60% of tokens used in reasoning
not particularly smart, pretty mediocre results in most fields

While it cannot play blind (like all small models), reasoning Chess performance was very respectable, though inefficient at 19k tok/move. Beating much larger models, 900 elo @68% accuracy hit way above weight, oss-120b level

Nonthinking (template enable_thinking=False):

same verbosity, minus reasoning ~60% token savings
No reasoning hurts logical deductions and maths
General utility and code quality remained on similar level

Overall, size/performance isn't too interesting. The combination of much slower inference and lower capability than competing param-sized models, excludes it from being a worthy consideration. Style is rather sterile and did not pass my vibe check (subjective, unscored).
Not terrible, not great. YMMV.

MiniMax-M2.1

2025-12-26

Tested MiniMax-M2.1:

Less than 2 months later, upgrade to MiniMax-M2. As always, promises upgrades to agentic coding, tool use, and the usual focus areas.
(you notice correctly, I am a bit tired of these type of releases).

longer reasoning chains, tok use was up +~40%
a bit dumber at logic/common sense
much worse format adherence/strict instruction following
saw no improvements to non-agentic coding, quite mediocre with low attention to detail/intent
fixed excessive "policy" considerations; slightly less censored

While reasoning chains looked cleaner, at recommended params and through official API in 6 of my tasks (4 reasoning 1 STEM 1 Utility) the model exhibited thought loops and repetition issues.

In Chess, tokens generation was fine and roughly halved but performance was weaker than on m2. -100 elo, -5% accuracy, 13x higher illegal move attempts, despite providing legal move list.

Overall, weaker model for general use. The decreased utility, tok usage (other than poor-performance chess), and general lack of non-agentic coding performance makes this completely uninteresting to me.
If you liked MiniMax M2 you might like M2.1. If you use coding agents, worth a look. Not my cup of tea, YMMV.

I'll also stop testing any further short-window iterative "agentic coding" model updates. It's time-consuming and frankly boring as hell to me, doesn't fit my use/test case, and either has no or negative impact on general use.

GLM-4.7

2025-12-25

Tested GLM-4.7:

Released merely 7 weeks after GLM-4.6, Hybrid thinker MoE, with claimed improved gains in agentic coding, complex tasks, and tool use.

Default/Thinking:

Slightly less efficient, +14% overall token use, longer reasoning chains
Minuscule gain in STEM/non-agentic-coding, close to variance
Slightly worse at logic and creative tasks
Slightly more censored, more prone to refuse

No Chess matches could be concluded, as the model was constantly getting stuck in reasoning loops ("Actually, wait, let me reconsider"), while wasting max_tokens per move without ever finishing a response nor providing a finalized final move. This behaviour was consistent across all 6 providers on OpenRouter. The model could not consistently terminate its own looping reasoning. This is a serious and rare failure. Move example: https://dubesor.de/chess/glm-4.7-reasoning-loop.txt

While I did not encounter this issue to this degree in other domains, I don't know the exact cause, but if the model regularly enters a paralysis loop in a closed system with perfect information, it's not improbable it will enter the same loop when facing ambiguous variable names, spaghetti code, or edge cases in programming or other untested domains. Dangerous possibility for agentic use. I'd cap tokens and closely monitor token usage.

Nonthinking:

Less verbose than 4.6-nonthinking, -29% tok
Used about 78% less tokens than with default reasoning enabled
performed slightly worse at strict instruction following/format requirements
otherwise, samey total capability as GLM-4.6 nonthinking while being more tok-efficient

Thinking still is a reasonable improvement in more complex tasks, such as maths and coding.

Overall, probably a slight upgrade for agentic coders (not my test/use case), but worse general utility. The thought loops are a bit spooky but might be an edge case discovery to look out for. YMMV.

Devstral 2 123B Instruct 2512

2025-12-21

Tested Devstral 2 123B Instruct 2512:

Mistral open agentic coding LLM. Non-thinking Instruct model but default behaviour was verbose (2.9x)
My benchmarks don't test tool use/agentic coding, thus results are nice-to-know but not exactly target use case.

Generic use was yappy but overall fine, Mistral Medium 3/3.1 level.
Non-agentic coding performance was pretty mid, often technically just passable but with low attention to detail and dated frontend results. I added some demo pages to shared assets (not part of any testing/scoring).

Chess performance ~600 Elo @51% accuracy was poor. With provided movelist, 98.7% legal output at raw instruction following was okay (around gpt-4.1 nano & gemini 2.5 flash level).

Current API offerings are either free or very cheap, though Mistral's published future pricing at $0.4/$2 would be too high for this caliber model imo.
As stated, not my use-case whatsoever, so YMMV!

Gemini 3 Flash Preview

2025-12-20

Tested Gemini 3 Flash Preview:
Google's newest speedy model, cost-efficient with SOTA performance.

very high capability in all areas tested
best bang/buck among SOTA models
around GPT-5 level, insanely cheaper

Reasoning -vs- non-reasoning:
Default/Medium reasoning accounted for 60% of tokens, which is light (around o1-mini or oss-120b level)
This model did not require any reasoning in the totality of my testing. ("minimal"/off)
There were very small gains in limited areas such as complex reasoning tasks, and maths, but in the vast amount of scenarios non-reasoning performed even or better (e.g. Vision/Creative writing)

Great Vision, and as seems to be the trend among hybrids, more consistently correct results when stripping away its reasoning.

Chess play was excellent and highly token efficient. Outperformed GPT-5 while using 98% fewer tokens. Currently ranked #4 with ~86% accuracy and >1500 mixed elo. Blind play and Internal state tracking were almost error free with 99% legality.

Overall, this is a fantastic model that embarrasses some competition in comparison (mainly OpenAI). It's not quite as Smart as Gemini 3 Pro (obviously) but comes close enough in many scenarios. Obviously YMMV.

GPT-5.2

GPT-5.2 Chat

2025-12-13

Tested GPT-5.2:

Chat:

hyper conciseness of 5.1 Chat was reversed, now same verbosity as GPT-5 Chat
No reasoning traces visible but ~25% of charged output was used in reasoning
Performed around the previous Chat models but substantially worse instruction following

By far the worst Chess player out of all 5-series Chat models, despite using more tokens: -10% accuracy, -300 Elo, +12% illegal play
Vision result were also slightly regressed.
This one is a complete dud, in my testing.

GPT-5.2:

tok-efficiency update, ~62%|32% less tokens generated than GPT-5|GPT5.1
Remains competent performance at basic logic and code

Vision performance was weaker than GPT-5, though using less tokens
In Chess now used significantly less tokens (around Gemini 3 Pro level), though performance dropped noticeably: -12% accuracy, -400 Elo, +13% illegal play.
Internal board state tracking suffered immensely, in particular when compared to GPT-5 in continuation (no choice list provided) where it was 9.4x more likely to attempt illegal moves.

Overall, this release is disappointing. They are trying to make their models more token-efficient but seemingly unable to do so without performance regression. I haven't encountered any field across my entire testing where GPT-5.2 is best or best for bang/buck. But, obviously YMMV.

Rnj-1-Instruct

2025-12-11

Tested Rnj-1-Instruct (local, bf16):

8B model claiming to be "SOTA open weight" for STEM and code.

Replies to most queries in python
asking for non-python code results in python
python quality itself was poor

Chess results were slightly above random-mover (~510 elo, 38% accuracy), though with unacceptable 14% illegal play when providing a legal move list (50% higher illegal play than llama-3-8b).

One of the worst models on my main bench. Even when completely looking past the python overfit issue, it's simply unintelligent with abysmal utility. YMMV.

Ministral 3 8B Instruct

Ministral 3 14B Instruct

2025-12-04

Tested Ministral 3 Instruct:

8B (local, bf16)

Very yappy, +160% token generated compared to last years 2410
Still roughly in the same ballpark, improvements were minuscule
Extremely obnoxious style (subjective, unrated)

Capability around a modern non-thinking 2-4B model. Wouldn't use this other maybe in very niche creative scenarios with heavy system prompts trying to combat its default behaviour.

14B (local, Q8_0)

More concise and far less obnoxious style
much more useful overall

Capability around a modern non-thinking 4-8B model. Still not good, but shines in comparison to its 8B counterpart.

Overall, weak performance for size. The 8B in particular is very unintelligent combined with an obnoxious "personality", the 14B might be worth checking out for some creative use cases. YMMV.

I also ran several tests on the reasoning variants, encountering several inference issues mostly related to reasoning specific template issues and context overflow. Thus I have canned those results for now.

Mistral Large 3 2512

2025-12-03

Tested Mistral Large 3 2512:

Mistrals new "sota" model. Following industry trends, architectural shift from 123B dense to now MoE (675B-A41B).
Not a thinker, though also not very concise, utilizes step by step reasoning, ~x2.38 bench verbosity.

mediocre general reasoning/logic
improvements in math
generally better results at frontend

Chess performance @45% accuracy was poor, currently ~100 elo below Mistral Large 2.
Very basic Vision is supported, though vision results were weak.

Overall, general capability falls around Llama 3.3 70B / DeepSeek V3.2 (nonthinking) level.
Pricing got reduced compared to Large 2, but is still a bit high when compared to similarly performing models.
Didn't feel particularly "large" during testing, and falls behind real sota models, but YMMV.

DeepSeek V3.2

DeepSeek V3.2 Thinking

DeepSeek V3.2 Speciale

2025-12-03

Tested DeepSeek V3.2:

DeepSeek V3.2 (non-thinking):
Performed similarly to V3.1, though did slightly worse in STEM segment and during Censorship probing.
Behaviour, token use, general use etc is samey as before. Chess play saw some improvements (+~6% accuracy).
Price/Performance at time of testing was very good.

DeepSeek V3.2 Thinking:
+180% tokens (71% used in reasoning).
Fairly efficient reasoner (5.9x verbosity). Improved reasoning/logic, and STEM. Instruction following and programming did not benefit in my test scenarios.

DeepSeek V3.2 Speciale:
Basically the "reasoning plus" model. +520% tokens (or more than twice of Thinking), ~93% used in reasoning.
Improvements seen in general logic, STEM, instruction following and programming.

This one behaves less conversational and more factual. Reminded me a bit of R1-Zero. Thus, final replies are concise and contain less fluff.

Best chess player out of all open models tested, ~69% accuracy @1000 elo while using roughly 19k tok/move; near Grok-4 inefficiency. Still not great, but an improvement (+300-500 Elo over previous DeepSeek models).

Overall, I'm still not a fan of the hybrid-thinking approach and think it actively hurts model performance when compared to a dedicated approach (see R1+0324).
Speciale is the more interesting release here imo, but YMMV.

Olmo 3 7B Instruct

Olmo 3 7B Think

Olmo 3 32B Think

2025-11-29

Tested Olmo 3:

Olmo 3 7B Instruct (local, f16)
Fairly yappy unintelligent model, around Gemma 3n E4B.
Broke chess record by placing at the absolute bottom of my chess benchmark, playing significantly worse than random (negative skill), and losing to bottom tier players.

Olmo 3 7B Think (local, Q8_0)
Most verbose model I have ever tested thus far. Unfathomable wasteful thinking. Due to massive amount of context required (24k+ for single turn), couldn't fit f16 on 4090. Almost 23x verbosity for slight gains in logic and math. Weaker than first Qwen3-4B Thinking while wasting massively more tokens on theater reasoning.
In reasoning chess, it used over 130x tokens of 7b-instruct, while playing at essentially random-mover level, though the insane inefficiency forced low sample size.
Unusable model with truly comical token-inefficiency.

Olmo 3 32B Think (local, Q4_K_M)
Also very verbose, but not the same extreme degree (15.8x).
Around distilled DeepSeek-R1 0528-Qwen3-8B level, though more wasteful and terrible coder.
With ~45% accuracy, it was able to play slightly above random chess.

Overall, this family is very weak for size, and unusable when considering token efficiency. I see no realistic use case here, but YMMV.

Claude Opus 4.5

2025-11-25

Tested Claude Opus 4.5 (default/non-thinking, 20251101):

Bit less concise than Opus 4/4.1, +~15% tok during general use. Much cheaper base price ($15/75 → $5/25) resulted in ~60% lower cost during benchmarking.

raw logic & common sense was a bit weaker than other Opus 4 models
instruction following was better
tech-related issues and coding were amazing, clear focus
less issues with over-censoring / false policy violations

I saw huge improvements in vision, from formerly "embarrassing for a SOTA", catching up to around GPT-5 level
Chess performance remained samey, around 65% accuracy near 900 Elo, not top level.

Overall, this model is clearly optimized for coding, and excels at it.
It might be a fair bit smaller than Opus 4/4.1, based on token efficiency, price, and some nuanced common sense observations, but this doesn't hamper the model much. Another positive were some instances where the model asked for clarifications before proceeding, which is quite rare.

Overall, a more affordable Opus model, with a clear focus on coding, instruction following, and vision improvements. YMMV

Gemini 3 Pro Preview

2025-11-19

Tested Gemini 3 Pro Preview:
Newest Google Reasoning SOTA. Slightly more expensive base price than 2.5 Pro ($1.25/10 → $2/12), though more token efficient in general use (-15% tokens), so bottom line cost was in the same ballpark (+~3%). Roughly 74% of generated tokens were used for reasoning.

Highest reasoning/logic/common sense
nice boost to STEM
precise instruction following was only okay
Improvements in tech and coding related tasks
Censorship fairly low, no hard refusals (likely to change when transitioning from preview/experimental versions)

This model is a true upgrade to Gemini 2.5 Pro. No incremental nonsense. There are a plethora of tasks across many domains, where substantial improvements could be observed, i.e. the above mentioned and things such as:

Vision:
Best vision of any model I ever tested thus far. While it didn't ace my challenging vision test, it performed substantially better than any other model.

Chess:
Hugely better chess player, ~+700 Elo, ~89% accuracy, currently ranked #1, 1700+ in both modes simultaneously (reasoning+continuation). Continuation (blind chess with only movetext) was particularly impressive, as this is challenging for reasoning models and the only model on a similar level was the massive deprecated GPT-4.5 Preview. With only 0%|1.8% illegal play it was also the most precise player after 4.5 Preview.
It's also worth mentioning, that for a reasoning model, it was fairly token efficient, only using a small fraction of competing reasoning models.

There isn't too much negative to say about this model, from my testing. I could mention some nitpicks, e.g. similar to 2.5 Pro, it wrote way too many instructions in comments that have no business being included in codeblocks.

Overall, fantastic model, true noticeable upgrade, and excels across many completely varying fields. YMMV.

GPT-5.1

GPT-5.1 Chat

GPT-5.1 Codex Mini

GPT-5.1 Codex

2025-11-15

Tested the GPT-5.1 series:

Codex Mini:

Around Haiku 4.5 or o3-mini capability
Worse coder than either
Not cheap due to less efficient reasoning

Failed to convince me, this one is a miss imo

Codex:

35% less verbose than GPT-5 Codex

Remains similar capability, though less consistent at harder coding tasks

Chat:

hyper concise, 63% less verbose than GPT-5 Chat
This results in lower costs, ~2.5 flash level

As to be expected, bit weaker than GPT-5-Chat, around GPT-5 mini level.
Fast, efficient model.

GPT-5.1:

44% less verbose than GPT-5

Performed worse in coding and instruction following. Poor showing at creative tasks.

Elo still has to fully settle in, but here are some initial chess results from 52 matches:

CHESS performance (provisional)	All/Mixed	Reason	Continue	tok/move	%acc	%legal
GPT-5.1 Codex	1508 (#3)	1619 (#4)	1396 (#6)	11,196	84	100/94
GPT-5.1	1502 (#5)	1559 (#5)	1388 (#7)	5,571	84	100/88
GPT-5.1 Chat	1043 (#18)	934 (#17)	1224 (#15)	232	70	100/87
GPT-5.1 Codex Mini	967 (#23)	1167 (#11)	726 (#42)	4,777	68	100/78

Overall, this release addresses efficiency & model cost and does not improve on raw capability.
Obviously I can only speak for my own testing, and YMMV!

Kimi-K2-Thinking

2025-11-08

Tested Kimi-K2-Thinking:
Long-Chain-of-Thought Reasoning variant of Kimi-K2.
More than quintupled verbosity, though for a reasoning model still slightly below average at 6.07x bench verbosity; GPT-5 level.

Saw slight gains in general intelligence, logic, instruction following and hard coding challenges.
Surprisingly, STEM performance remained samey, though I do include non-math subjects.

Overall, the model performed in the same ballpark as GLM-4.6-Thinking.

While I don't specifically rate for it, it is absolutely worth mentioning that I found its reasoning chains to negatively influence creative writing. In roleplays, casual talk, and other creative tasks it lost a lot of its charm and magic, that the concise Kimi-K2 has. This lead to more clinical approaches and somewhat hamfisted forced replies, which resulted in more generic final outputs. Edit: 0-shot examples

Chess testing revealed reasoning scaling flaws: In reasoning chess (full information, shown to be highly beneficial to reasoning models), it draws to concise Kimi & seeded between K2 and K2-0905 level, staying around 800 Elo w/60% accuracy but generating more than 50x tokens per move. Extremely disappointing.
*On a sidenote, and this is likely only an initial launch problem, that will be solved in the future: I was shocked to see the actual cost the model caused in chess-testing, as the massive >50x token waste was combined with Openrouter's autorouting to the very expensive moonshotai/turbo endpoint.

To conclude, the reasoning is only beneficial for a select number of tasks, such as requiring logical step by step evaluations, or in code-related issues that aren't solvable by concise Kimi. It is not a universal upgrade to every use case, and actively harms some. Smarter, but more generic. YMMV. Price/Performance at point of writing is rather poor, unfortunately.

Qwen3-VL-30B-A3B-Instruct

Qwen3-VL-32B-Instruct

2025-11-08

Tested two more Qwen3-VL-Instruct models:
Qwen3-VL-30B-A3B-Instruct (local, q4km):
Text capability fell between original Qwen3-30B-A3B and Qwen3-30B-A3B-Instruct 2507.
While being labeled an Instruct models, it is very yappy (+40% to +150% more tokens than aforementioned)
Good instruction following and general intelligence with okay vision ability.

Due to very fast inference its best use is probably for bulkwork on simple vision tasks.
Vision was also tested on fp8 and it performed worse than other Qwen3-VL models, though slightly better than its thinking-counterpart.

Qwen3-VL-32B-Instruct (local, q4km):
While more verbose than non-vl 32B (+60% tok), significantly less yappy than VL-30B-A3B-Instruct (-34%).
It was around the same text-capability, though far better on coding issues.

Vision tests on q4 and fp8 were consistently very good, trading blows with Qwen3-VL-8b-Instruct.
While it's significantly smarter than 8B, as is to be expected, I didn't find it it to be noticeably better at raw vision tasks, thus a smaller model might be better bang/buck, depending on the vision task.

Overall, the 32B is the far more interesting vision model, unless you desperately require inference speed. YMMV

Qwen3-VL-8B-Instruct

2025-11-02

Tested Qwen3-VL-8B-Instruct (local, bf16):

Despite being a non-reasoning model, due to thoughts and self-corrections within replies, it is quite verbose and yappy; more than 2x verbosity of Qwen3-8B (non-thinking).
In terms of raw text-capability, it performed around original Qwen3-4B Thinking or Qwen3-14B (non-thinking).

Isolated vision testing was already conducted 3 weeks ago. It performed exceptionally well for size, and like all Qwen3-VL models tested, also consistently outperformed its Thinking-counterpart.
When it comes to open general vision capability, I haven't tested a more capable model other than Qwen3-VL-235B-A22B-Instruct or, in some instances, Qwen3-VL-32B-Instruct.

Overall - great Vision model, and raw text capability, while not amazing, is good enough for the occasional non-vision interaction between image queries. YMMV

MiniMax-M2

2025-10-28

Tested MiniMax-M2:
At 230B-A10B a much smaller reasoning MoE model than the predecessor MiniMax-M1 (456B-A46B).

Despite being significantly smaller, achieves roughly same capability
non-agentic coding was worse, slightly smarter in other areas
much more manageable thought-chains, verbosity was down from 13.9x to 7.7x (8/2 split)
heavy gpt-oss-reek was present in reasoning chains, such as excessive "policy" considerations
This influenced risk-queries, where this model is far more likely to produce refusals

Chess verbosity was still quite high (10k tok/move), but performance improved (+120 Elo, +4% accuracy, -13% illegal play); now around gpt-oss-120b level.

Overall, when directly compared to MiniMax-M1, it is a superior model for most usecases.
When pitted against leading open models (GLM-4.6, Qwen3-235B, Kimi-K2, DeepSeek variants, etc.), it didn't perform on the same level. For general use, it's roughly around Llama 4 Maverick or Qwen3-32B (Thinking) capability.

I don't test agentic use cases nor tool calls, and this model suggests to be heavily trained on gpt-oss-120b outputs, so as always YMMV!

LFM2-8B-A1B

LFM2-2.6B

2025-10-24

Tested 2 new LiquidAI models:

LFM2-2.6B
Small hybrid-thinker, which is meant to initiate ＜think＞ing in complex or multilingual tasks though it only did so in 6% of my queries, among which ⅓ were in very low complexity queries. Thus, it's merely a gimmick imho.
Unable to play chess even when providing a legal move list, attempted to play 102 illegal moves in a 140 move match.
Overall, very weak, around Granite-4.0-H-Tiny level. Official API is overpriced.

LFM2-8B-A1B:
Still a very weak model, but universally "smarter" than 2.6B.
Around Ministral 8B level.
Actually able to play Chess, though poorly ~530 elo @40% accuracy.

Thus, these models are for locally GPU starved. API pricing is a tad too high imo.
The biggest downside of the release of these models is the deprecation of the LFM-7B API endpoint, which at $0.01 mtok was a fantastic default for all types of dev testing.

Ling-1T

2025-10-23

Tested Ling-1T:
Massive MoE with 1T params, 50B active.

Not a thinker, but quite verbose, utilizing chain of thoughts in responses (2.35x verbosity).
Zero character, personality same as a wet towel
Disappointing intelligence across most tasks, very poor size/performance
STEM capability was decent though
Around non-thinking Llama 3.3 Nemotron Super 49B v1.5 capability

Chess performance was laughable at ~600 Elo w/ 45% accuracy, below llama 4 maverick

Overall, this model is completely uninteresting. Far too large, far too low performance and not a shred of uniqueness about it. YMMV.

Claude Haiku 4.5

2025-10-16

Tested Claude Haiku 4.5:
Anthropic's small, flash/mini equivalent model. Price is ⅓ of Sonnet 4/4.5.

Default:
Competent enough in most fields, decent general intelligence for a haiku sized model.
Tech performance was very good, a clear focus area. Webdesign for my own purpose was kinda meh, though.

Chess performance scored within expected, ~700 Elo w/ 58% accuracy, around 3.7 Sonnet level

Overall, it performed a bit better than Gemini 2.5 Flash around Llama 3.3 70B level.
Real pricing is on par with non-thinking variants of 2.5 Flash or GLM-4.5.

Thinking:
x2.1 Token use, 60% spent on reasoning
Unlike larger Claude models, leads to actually consistent improvements in more complicated tasks
Thought chains were counterproductive in creative tasks (e.g. RP, creative writing), partially due to unwanted risk analysis and attempts and using thought chains to plan combating user instructions
Around Grok-3 mini or DeepSeek V3.1

Overall I think this is a decent model for a variety of generic tasks. The $5 mToK is quite steep for this class of model, though it not using too many tokens kinda evens it out to be relatively cost-neutral.
As always, YMMV.

LongCat-Flash-Chat

2025-10-12

Tested LongCat-Flash-Chat:

~560B/27B non-thinking MoE, though it utilizes thinking without explicit tags, at 3.3x verbosity most verbose "non-thinker" I tested thus far.

Overall capability was around Llama 3.3 70B / Llama 4 Maverick, though better coder
As stated, kinda cheats in the "non-thinking" department, baked into responses
Not great size/performance, behind GLM 4.5/4.6 (non-thinking)
Style is verbose and DeepSeek-esque

Chess performance at 52% accuracy 740 Elo was mediocre, around Qwen2.5 72B level.

Overall, I think this model is fairly mediocre. Current API pricing is alright, though not best bang/buck. Due to its relatively poor size/performance and token inefficiency I think there are better alternatives for most use cases, though tech related tasks were solid. YMMV.

Granite-4.0-H-Tiny

Granite-4.0-H-Small

2025-10-04

Tested IBM Granite 4.0:

IBM's newest Mamba-2 MoE models series (32B-A9B, 7B-A1B, 2x 3B), nonthinking, concise

Granite-4.0-H-Tiny (7B-A1B, bf16, local):
intended use case: low latency agentic work & function calling

Worst STEM results I have recorded for this size
Abysmal capability in every field, around Granite 3.0-8B
inference on my 4090 was nice at 80tok/s (~15 on 7950X3D CPU only)
can generate text

Granite-4.0-H-Small (32B-A9B, Q4_K_M, local):
intended use case: Workhorse model for key enterprise tasks like RAG and agents

very weak capability for size, around Gemma 3n E4B level
inference 60tok/s was good (~9 on CPU)
actually somewhat usable for very easy generic tasks

I didn't bother testing the even smaller models.
Overall, testing these models invoked nostalgic feelings. While reading their responses, I was reminded of the very early days of my testing.
Other than nice inference, they feel and behave like ancient models. Very concise, low attention to detail and easily susceptible to all types of even 2023-era jailbreaks.
I cannot see any use for these, outside of hyper-niche RAG implementations, but even so, I doubt there aren't far better models out there. YMMV.

GLM-4.6

2025-10-02

Tested GLM-4.6:

Hybrid thinker MoE, improved token efficiency, reasoning, context window

Default/Thinking:

More efficient reasoning (7/3 split), token use -19%, turning it into an average verbose reasoning model
Despite this, improvements in STEM, general reasoning, and coding.
Sycophancy can lead to hallucinations, noticed riddle overfit
No improvements in chess

Nonthinking:

More efficient token use, -22%
Capability remained around 4.5 nonthinking, while being more efficient

Overall, far better CoT scaling than on 4.5, and less token spam without loss of capability is appreciated.
This is quite a substantial improvement for a 2 month gap, but YMMV.

Claude Sonnet 4.5

2025-09-30

Tested Claude Sonnet 4.5:
Sonnet update, focusing on coding, agentic workflows, and tool calling.

Token use was up +15%
non-agentic coding was slightly more consistent
Noticeable weaker performance in STEM and some common sense & creative tasks
Similar to Opus, increased safety leading to false-positive policy violations

I found generated frontend often quite generic and creatively a step back when directly compared to 3.7 and Sonnet 4.
Encountered multiple instances of false-positive policy violations during my standard demo page creation of UI and RP.

Vision remains a major family weakness, scoring at Gemma 3 4/12B level, placing ~30 ranks below SOTA models.
Chess placements were very weak, below Sonnet 4 and currently at 52% accuracy ~700 Elo, Llama 3.3 70B level.

I am disappointed in the general ability of this model as it's either mostly samey or a step back.
Overall, this might be a great update for agentic coding, which is not my use- nor test case. Thus YMMV.

GPT-5 Codex

2025-09-28

Tested GPT-5 Codex:
Coding specific GPT-5 version with emphasis on agentic coding.

used 43% less tokens than gpt-5 in my general purpose benchmark (73% tokens spent on reasoning)
roughly same performance as gpt-5, though stem/math performance was weaker
saw no improvements in non-agentic coding tasks
vision testing scored between gpt-5 and gpt-5-chat, thus for vision tasks gpt-5 might be preferable

In Chess testing it generated ~18k tokens per move, though sometimes racking up 50-70k reasoning tokens. It excelled in reasoning chess, placing 10-0-0 with 96% avg. accuracy, vastly outperforming gpt-5 and beating the strongest competition; currently #1 with a substantial 150 Elo lead.

Thus, some of its coding optimizations might be surprisingly beneficial in seemingly unrelated areas.
At same API pricing as gpt-5, the biggest draw could be the decreased token use, though that heavily depends on the use case and exact environment.
Obviously YMMV.

Grok-4-fast-non-reasoning

Grok-4-fast-reasoning

2025-09-21

Tested Grok-4-fast:
Cost-efficient xAI model at $0.20/$0.50 mTok, in 2 variants.

-non-reasoning:

very cheap, benchprice around 2.5 Flash Lite or 4o-mini
not the smartest, performed around oss-20b / 2.5 flash lite / gpt-5 nano level
strength clearly in lighter tasks with great price & speed efficiency

-reasoning:

2/3 tokens were used for reasoning (unprovided), +180% token use, 60% less tokens than Grok-4
reasoning significantly improved model logic & output quality in most cases
Similar to Grok-4, sometimes spends thousands of tokens on calculations, only providing a single number with no context.
Repeatedly refused several of my utility tasks, false-identifying them as jailbreak attempts
benchprice was around grok-3 mini or DeepSeek V3 0324
performed much stronger than non-reasoning, around GLM-4.5 or 2.5 Flash thinking models

Chess was obviously not as strong as Grok-4 (but unfathomably cheaper); still placed in top 10, around gpt-5-nano level, with 77% accuracy.

Overall, the gains from reasoning here are very worthwhile.
Some weird quirks with the refusals, which is very unlike xAI, but good model in terms of bang/buck and much higher usability than the unfathomably verbose & expensive Grok-4.
Grok-3 mini still stands out as a potentially better alternative though, but YMMV.

Qwen3-Coder-Flash

Qwen3-Coder-Plus

2025-09-20

Tested Qwen3-Coder-Flash & Qwen3-Coder-Plus :
API only proprietary, non-thinking Qwen3 coders.

Flash:

Performed around Qwen3-Coder-32B-Instruct level
tiered Pricing is too high for this caliber model

Plus:

Performed around Qwen3-Coder-480B-A35B in coding specific tasks, otherwise weaker
Mid-tier coder being charged like SOTA
It performed decently but beaten by a plethora of alternatives.

I don't have much to add since these models are completely uninteresting to me.
Being proprietary and not performing peak, combined with poor bang/buck makes them instantly skippable, but of course YMMV.

Seed-OSS-36B

2025-09-17

Tested Seed-OSS-36B (local, Q4_K_M):
Long-cot thinker, roughly around Qwen3-Thinking verbosity

Similar performance to models such as Qwen3-32B Thinking, Phi-4-reasoning-plus, Qwen3-30B-A3B-Thinking-2507
Decent showing in all areas I test, though not excelling at any
For a reasoning model, high general utility for a variety of multipurpose tasks
Good instruction following.

Chess probing, considering size, was comparatively good, better than comparable models, around kimi-k2-0905 level

On a 4090/24GB VRAM, this quant was slightly too large, as I couldn't fit all layers, with context, so a smaller Q4 would have been preferable in retrospect.

Overall, I think this is a decent model that deserves a look in particular if you want to try a non-qwen model. YMMV.

Qwen3-Next-80B-A3B-Instruct

Qwen3-Next-80B-A3B-Thinking

2025-09-13

Tested Qwen3-Next-80B-A3B:
Two distinct MoE models, with and without thinking.

Instruct:
Non-thinking, same verbosity as Qwen3-235B-A22B-Instruct-2507 (2.3x)
Good model for most tasks, around o3-mini on non-coding, some weaknesses in instruction following and creative work

Thinking:
Very verbose thinker (80% reasoning), ~4.5x token use of Instruct, near Grok-4 level.
Despite using massively more tokens, not universally better results. Reasoning tokens were wasted reflecting on unwanted risk analysis.
It is worth noting that Alibabas own API offering ($6 output multiplied by reasoning at time of writing) should be avoided as it's massively overpriced and resulted in GPT-5 level costs.

In Chess probing thinking averaged a slightly higher accuracy (~+7%), though both variants played poorly (620-740 Elo).
Interestingly, this application revealed the massive drawback and diminishing returns of Thinking.
Here is a 77 move reasoning match between the 2 variants, that ended in a draw. I'll let the numbers speak for itself:

	avg. tok/move	avg. move latency	game cost
Qwen3-Next-80B-A3B-Instruct	223	1.2s	$0.025
Qwen3-Next-80B-A3B-Thinking	9652 (43.3x)	68.2s (56.8x)	$0.743 (30.3x)

Overall, they performed around Llama 3.3 Nemotron Super 49B or near GLM-4.5 level.
I think that the Instruct model is the vastly superior choice for many users and use cases, but obviously YMMV.

Qwen3-Max

2025-09-10

Tested Qwen3-Max:
Alibaba's API-only Non-thinking proprietary model

Much smarter than Qwen2.5-Max, though quite yappy, token use was up +26%
High common sense, good overall reasoning
Good coder. though frontend and UX can be hit or miss; also added various example pages
Generic utility is mediocre, flaws in instruction following and censorship
Initial Chess performance was slightly stronger, though still very weak (750 Elo, 58% accuracy, 1 win in 10 matches, lost to 2.5-max)

Overall, this model performed slightly better than Qwen3-235B-A22B-Instruct-2507, though at much higher cost (~8x).
Though its larger size is noticeable in nuanced tasks and common sense scenarios.
Compared to Kimi-K2, its style isn't really my cup of tea, though that's subjective and YMMV!

Kimi-K2-Instruct-0905

2025-09-05

Tested Kimi-K2-Instruct-09-05 update:

Bit more wordy, token use was up +13%.
Overall Performance same as before, within my environment
Initial Chess performance, it slightly edged out against 0711, though still within variance

In other smaller random tests, it performed largely as before.

Thus, targeted improvements at tool calling and context have no noticeable impact within the majority of my testing scenarios. YMMV.

Claude Opus 4.1 Thinking

2025-08-30

Tested Claude Opus 4.1 Thinking:
Performed as expected, I am not gonna repeat everything I already laid out in detail in the Claude 4 and 4.1 impressions.

Compared to 4 Thinking +10% token use.
Reasoning chain still bad bang for buck, the model doesn't require it, and you're paying double price for either no or minuscule improvement in rare edge cases.

Other than minor flaws, easily fixed with follow-up, zero mistakes in anything coding I throw at models during testing.
Still great model/family, performance in line with expected. Reasoning unneeded.

Grok Code Fast 1

2025-08-28

Tested Grok Code Fast 1:
Reasoning model optimized for agentic coding. Put through my general use case testing regardless.
2/3 tokens were used for reasoning. Raw reasoning is not provided, only summary. This makes real delivered mtok quite high (almost 8x grok-3 mini)

While not its intended use, completely usable for generic use.

Around GLM-4.5 Air Thinking capability
Mediocre in everything, decent -albeit not great- coding ability
Achieved same results as the much cheaper grok-3 mini in my coding related tasks
Frontend results are okay, visually a bit like more dated models

Initial chess probing placed it at ~940 Elo / ~67% accuracy - around gpt-oss-20b or gpt-4.1-mini level

I can't comment on it's agentic coding as it's neither something I utilize nor test for.
Clearly this model isn't the right use case for me (manual, non-agentic coding), thus especially in this case: YMMV!

DeepSeek V3.1

2025-08-22

Tested DeepSeek V3.1:
Hybrid model, that supports light thinking

Non-Thinking:
Same verbosity as V3 0324
Comparatively, smarter overall, but performed noticeably weaker in coding tasks

Thinking:
+125% token use. 64% of tokens were spent on reasoning.
This is very light reasoning, ~45% less verbosity than R1 0528
Compared to non-thinking, the thinking did very little if anything to improve final response quality. In fact, it was mostly even or slightly worse on some tasks.
During evaluation, it reminded me a lot of Sonnet 4 thinking in terms of reasoning token benefits.
Thus, enabling thinking proved highly ineffective in the totality of my testing.

Chess performance remained poor (~650 starting Elo), around V3 level.

Overall, compared to V3 0324 this is a small upgrade, except for (non-tool) coding where it's a noticeable downgrade imo. (example demo pages available)
Compared to R1 0528, the model lacks behind severely in general intelligence and is not a replacement.

Imo, for general use case, nonthinking DeepSeek V3.1 is a good option.
Overall, I was rather disappointed with the hybrid performance, so I'm not sure it's the right approach - but YMMV

ERNIE-4.5-21B-A3B

2025-08-18

Tested ERNIE-4.5-21B-A3B (local, Q6_K):
Small, non-thinking Baidu MoE model.

Just like the larger model, unintelligent overall
Not very useful except for light math tasks
Fast inference locally (~160tok/s on 4090)

This one is another miss imo, YMMV.

Jamba 1.7

2025-08-15

Tested Jamba 1.7 (Mini & Large):

Large (399B) performs like a modern 5B model
Mini (52B) performs sub Mixtral-8x7b-Instruct-v0.1 (even smaller, released in 2023)

Still don't support basic response format (such as structured table).
Strangely, the older 1.5 models performed better.

API pricing is outdated.
They feel and perform absolutely ancient, as if we traveled back in time 1.5+ years ago.
Consequently, I see no use case for this family. YMMV.

Mistral Medium 3.1

2025-08-14

Tested Mistral Medium 3.1 (aka Mistral-Medium-2508):

Compared to Medium 3, used 11% more tokens

Improvements in code, and reasoning
However, did worse in my STEM segment. And first time reproducible refusal.
Chess skill still poor ~600 Elo
same total capability in my environment

Price/performance is samey, still not state of the art performance at 8x lower cost.
Uninteresting release imho, in particular because it's API only with no price changes.
YMMV.

Qwen3-4B-Instruct-2507

Qwen3-4B-Thinking-2507

2025-08-11

Tested Qwen3-4B-2507 (local, bf16):

Instruct:
"Non-thinker", though it uses the methods of 2507 models = simply not wrapping it thoughts into tags
+84% token use compared to Qwen3-4B
Better at math, instruction following, and programming compared to initial 4B
Around Qwen3-8B level performance

Thinking:
Very verbose thinker (10.64x verbosity, Grok-4 level).
+25% token use compared to Qwen3-4B thinking
700% token of Qwen3-4b non-thinking.
Saw improvements in capability across all fields.
Censorship was quite high, making this model very restricted without jailbreaks or context injection.
Very impressive capability; around nonthinking GLM-4.5 Air (100B/12B active).

Overall, even though these models are very verbose, the performance for this size (4B!) is currently unmatched, though YMMV!

GPT-5

GPT-5 Chat

GPT-5 Mini

GPT-5 Nano

2025-08-09

Sorry for the delay, I spent the past 2 days doing almost nothing but testing, running numbers, retesting, and verifying. My current benchmarking suite got quite comprehensive and can overwhelm me on multi-model releases.

Tested the GPT-5 series:

Nano:

Ultra verbose thinker, to the point where it was well over 5 times slower in completing responses than GPT-5 Chat or twice as slow as GPT-5 mini.
In terms of capability, it performed around Gemini 2.5 Flash Lite (thinking OFF) level, while comparatively being 4x as expensive and 11x slower

It would either need to be massively reduced in reasoning (tanks capability) or cut in price by at least 10x to be remotely viable.

Mini:

Much more concise thinker

Very solid small model, around o4-mini-high or Llama 3 405B capability.

Chat:
This is the model used in ChatGPT. Non-thinking, though documentation claims it has reasoning token support, but I wasn't able to get it to produce any reasoning in any responses

Very chatty and slightly more permissive. Default personality ends every single reply with a forced follow up question. I find its general style a bit annoying (subjective/unrated).
In code tasks, showed some laziness and delivered sometimes very plain, unimpressive results (see examples in my demo pages)
Around Claude 3.7 Sonnet capability. Weaker at math tasks due to missing reasoning chains. Very high general utility for generic use (akin to 4o)
Strong performance at continuation chess (not as strong as GPT-4.5 ~~but better than other OpenAI models~~*)

GPT-5:
Note, that I do not have API access due to heavy restrictions, thus all testing had to be done manually 1by1. For more info, see my explanations on the o3 impression.

Reasonable thought lengths (default/medium), with same reasoning allocation as ~Grok-4
Performed around Gemini 2.5 Pro level, although with slightly less common sense but higher STEM performance.
Much more enthusiastic during coding, though backend and consistency is noticeably beaten by Claude 4.
Improvements in my vision bench, dethroning Gemini 2.5
Overall it performed strong; obviously not for general use- I see strongest use case in academia or STEM.
I prefer its more neutral, non-cringe style, (akin to GPT-4 Turbo) a lot over the Chat model.

Here are some stats for comparison:

Model	Cost	Time (h:m)	Verbosity	Think	Score	Level	Vision	Chess
GPT-5 Nano	$0.36	2:05	10.26x	88%	47.9% (#84)	Gemini 2.5 Flash Lite	#17	1121 (#11)
GPT-5 Mini	$0.73	1:08	4.09x	68%	65.4% (#28)	o4-mini-high	#7	1189 (#7)
GPT-5 Chat	$1.28	0:22	1.41x	0%	71.9% (#12)	Claude 3.7 Sonnet	#2	1279 (#4)
GPT-5	$5.34	2:51	6.03x	79%	76.4% (#7)	Gemini 2.5 Pro	#1	n/a

This is everything I got right now, and obviously as always - YMMV!

gpt-oss-20b

gpt-oss-120b

2025-08-06

Tested GPT-OSS:

We're going to do a very powerful open source model [...] better than any current open source model out there.

120B (5.1B active):
concise thinker, akin to o1-mini verbosity, 3/5 reasoning split

around 4.1-mini & GLM-4.5 Air capability
okay for STEM/math and light programming tasks
underwhelming performance, a bit smarter than 20B
poor style, very censored
weak chess player, initial performance around gemma 2 27B level, ~56% accuracy

20B (3.6B active):
concise thinker, though longer thoughts, 5/3 reasoning split

around Llama-3.1-Nemotron-51B & 4o-mini capability
okay for STEM, math, and easy tasks
almost as smart as the 120B, though more cooperative and fun to use
okay chess player, initial performance around gpt-4.1-mini ~69% Accuracy

Both models are very fast to inference but underwhelming open models that get beat by a plethora of competing models (e.g. Llama-3.3-Nemotron-Super-49B, Qwen3-30B-A3B, GLM-4.5, etc.)
The 120B is obsolete on arrival, in terms of capability and behaviour. Between the two, the 20B is more interesting imo. Might be okay for fast math workloads, though that's outside my use case.
Weak models imo, but YMMV!

Claude Opus 4.1

2025-08-05

Tested Claude Opus 4.1 (default/non-thinking, 20250805):

Concise model, pricey but not overly so when accounting for token use (~30% more expensive than Gemini 2.5 Pro, ~50% cheaper than Grok-4 in testing).

Saw no improvements in logic/common sense
Saw no improvements in vision (in fact performed worse on 1 task)
Slight improvements in STEM & math consistency
Slight improvements in code, debugging
More consistent refusals and less willing to tackle risque topics (akin to Opus 4 thinking after risk evals in thought chains)
Initial Chess performance was same and remained underwhelming for a SOTA ~830 Elo / ~64% accuracy

Fantastic model with some good minor improvements. If it wasn't for refusing several of my tasks, this model comes close to exhausting my main classic benchmark.
Risk aversion reduces its utility for many creative tasks.
Bar none best coder, very cooperative and easy to work with.

I have also published demo pages as well as added its responses to various small experiments.
I like Opus 4.1 a lot, other than the censorship / false-positive refusals. Overall, fantastic model.

XBai-o4

2025-08-04

Tested XBai-o4 (Q4_K_M, local):
Verbose 32B reasoning model, with 7.73x token use ~around GLM-4.5 / Qwen3 Thinking verbosity
Marketing claims it outperforms o3-mini and even Claude Opus 4 (spoiler: it doesn't)

Math was solid but everything else showed weaker performance than current competing models
Performed around QwQ-32B / DeepSeek R1 0528 Qwen3-8B / Magistral Small 2506 level

This model enters a size segment that is completely saturated and doesn't outperform alternatives.
Utterly uninteresting model to me. Also I cannot respect outlandish marketing claims.
YMMV, though..

Qwen3-Coder-30B-A3B-Instruct

Qwen3-30B-A3B-Thinking-2507

2025-08-02

Tested Qwen3-Coder-30B-A3B-Instruct (Q4_K_M, local):
As expected, did worse in non-code related tasks, but to my surprise actually even scored slightly lower than instruct in my tech segment. Naturally, my testing isn't coding focused, but in code tasks it misunderstood the objective that the non-coder didn't misunderstand, lowering the outcome usefulness.
However, I don't use nor test agentic workflows, nor IDE integrations, so my test setting might be the wrong environment for this model type.

Tested Qwen3-30B-A3B-Thinking-2507 (Q4_K_M, local):
x3.2 token of instruct counterpart, verbosity was identical to Qwen3-235B-A22B-Thinking-2507

Was better at following instructions or sticking to instructions
Coding results were weaker, overall intelligence of the model did not benefit from extra token chains
Very censored models in testing, akin to older Claude models

Overall, didn't like either of these two, for my use cases.
I'd rather stick to 30B-A3B-Instruct-2507, which strikes a much better balance of inference bang for buck, and general intelligence.

That's just me though, and YMMV!

Qwen3-30B-A3B-Instruct-2507

2025-07-30

Tested Qwen3-30B-A3B-Instruct-2507 (Q4_K_M, local)
Nonthinker, though 1.56x tokens of Qwen3-30B-A3B (non-thinking).
Very smart model for size, punches well above its weight.
Not great at following precise instructions (e.g. formatting adherence)
Hyper fast at 130+ tok/s on my 4090.

Top model, daily driver candidate for me, but depending on use case, YMMV!

GLM-4.5

GLM-4.5-Air

2025-07-30

Tested GLM-4.5 (358B - 32B active):

Very verbose hybrid thinker MoE, thinking can be disabled by passing "enable_thinking": False

Default/Thinking:
Verbosity on R1 0528 level, overall performed around Llama 3.3 70B level
Price/Performance not great since benchprice was ~Claude Sonnet 4 level but didn't deliver on the same level
Weird CoT-scaling, sometimes performs dumber than non-thinker, overall weak tokens/performance ratio
Encountered Overthinking issues, imprecision especially in my STEM segment.
Chess play was weak, ~qwen3-235b-a22b level

Nonthinking:
64% less tokens, overall performance same level.
Performed almost at thinking level overall, far better value

Air: (106B - 12B active)
Default/Thinking:
Around Qwen2.5 max capability, bench price almost same as 4.5 (non-thinking).
CoT-tokens scaled better than the larger model
Chess play was weak, ~gpt4.1-nano level

Nonthinking:
69% less tokens, around Claude-3.5 Haiku capability.
Same use cases as Llama 4 Scout, though much better at coding

Overall
The family is a bit weird and inconsistent in terms of performance and token use.
If you want a concise model, Kimi-K2 is a better option.
If you don't care about token spam, Qwen3-235B-A22B variant models are smarter.

Style/vibe (not rated) isn't my personal taste, though I am a bit tired of excessive token spam.

As always, and in particular due to varying performance between tasks, YMMV!

Llama-3.3-Nemotron-Super-49B-v1.5

2025-07-28

Tested Llama-3.3-Nemotron-Super-49B-v1.5 (local, Q4_K_M):
Just like v1, a hybrid thinker, though it now reasons by default and Reasoning can be turned off by using /no_think in system prompt.

Reasoning ON (default):

very verbose, 2x tokens of v1 thinking, 6.7x more tokens than v1.5 no_think
performance gains in most areas
stronger at specialized tasks (e.g. maths, coding..) than v1
more risk analysis lead to more censored responses

Reasoning OFF (/no_think):

token savings of ~85%
around v1 level, but more censored

For consumer grade hardware like mine (4090), the default mode is simply not feasible for use due to extremely long thought chains. Some single responses took ~45 minutes to generate! I got around 4.6 tok/s with partial offloading, which is far too slow to support 10k+ thought chains.

Unfortunately, depending on the task, reasoning off is not necessarily an upgrade to v1.
Model is still good, and if you can inference the thinking model it's probably an upgrade, however for my setup v1.5 is not an upgrade.

Try it out on your own hardware and use cases, because as always: YMMV!

Qwen3-235B-A22B-Thinking-2507

2025-07-26

Tested Qwen3-235B-A22B-Thinking-2507 (API, fp8, Alibaba recommended params)

Due to inconsistent and lower than expected performance, multiple retests were conducted (including Alibaba's own offerings).

As a verbose reasoning model, it averaged 70/30 reasoning split and used

x0.85 tokens of Qwen3-235B-A22B Thinking
x3.6 tokens of Qwen3-235B-A22B-Instruct-2507

Unlike initial Qwen3 family, thinking was not beneficial in censorship testing.
It performed noticeably worse in my STEM tasks.
Other areas were fine.

In chess, while drawing in direct competition to Instruct, comparatively it was very inefficient at ⌀7.5k tokens/move (40x!).
Though games took ages, results were slightly better in terms of achieved Elo & accuracy; around command-a / qwq-32b level.

Current pricing is all over the place ranging from $0.13-0.70 input and $0.30-8.40 output.
Regardless, there is a hefty premium on the model, plus added verbosity, somewhere in the ballpark of ~15x more expensive than non-thinking Qwen3-235B-A22B-Instruct-2507 during testing.

This model is weird in that it wasn't a raw upgrade to previous models nor non-thinking counterparts. To me, it's somewhat of a dud.

Give it a go and do your own testing, because - YMMV!

Gemini 2.5 Flash Lite

2025-07-24

Stable release. This was a bit of wasted time, as it performed identical to the 06-17 preview I tested a bit over a month ago. Some reply variance as is expected but same model.
Current API speed actually lower than last month (far below the 200 tok/s). Otherwise, same capability: Fast small model for simple generic tasks.

Qwen3-Coder-480B-A35B

2025-07-23

Tested Qwen3-Coder-480B-A35B:

As expected from a coding focused model - most concise Qwen3 model

46 % less tokens than DeepSeek V3 0324
While competent for general use, too, performed best in STEM (math) and coding obviously.

During creation of demo pages and further probing, it showcased several obvious weaknesses such as producing buggy collision, glaring UI oversights in multiple projects, and in general required error correction that was not necessary on models such as DeepSeek V3 0324.

For a massive, coding specialized model I personally was not convinced by its coding results, combined with the quite poor price/performance on current API offerings.

However, as always - YMMV!

Qwen3-235B-A22B-2507

2025-07-23

Tested Qwen3-235B-A22B-2507:

Not a thinking model, though it can contain similar chain-of-thought in its responses without the thought tags.

75% less tokens than Qwen3-235B-A22B thinking.
45% more tokens than Qwen3-235B-A22B non-thinking.

While I saw no notable differences in coding, STEM or censorship, it performed slightly better in my reasoning segment.
Chess testing was on a very similar low level (~Claude 3.7 Sonnet), though in mirror matches it lost to it's thinking counterpart (low sample size for now).

I personally like getting samey performance without the ultra verbosity, thus it's an upgrade in my book - but YMMV.

ERNIE-4.5-300B-A47B

2025-07-14

Tested ERNIE-4.5-300B-A47B:
Tried testing this 2 weeks ago, but API was riddled with issues, thus another attempt:

Non-thinking Baidu MoE model that is quite verbose (almost 2x token use of Kimi-K2).

Very dry model roughly on par with Qwen3-32B NOTHINK or the original Llama-3-70b-Instruct
Not very smart, subpar results in all tested fields
Vibe/Style (unscored) is really poor imo
very restricted model with high censorship

Chess was already tested 2 weeks ago, ~600 Elo with 48% accuracy - on gpt-4.1 nano level

This model was a lot weaker than I anticipated. Combined with the lame style, I can't think of any use case where I would want to use it.
However, as always: YMMV.

Kimi-K2-Instruct

2025-07-13

Tested Kimi-K2-Instruct:
Very large non-reasoning MoE (1T params, 32B active); in fact so large I had to update my UI slider.
It's a very concise model (11% token use of Qwen3-235B-A22B), which helps the very slow inference offerings (around 13 tok/s at time of testing)

Competent in all tested areas, smartest open non-reasoning model
Good prose and general style/vibes (unscored), although quite risk-averse
Not the strongest at debugging but usually good frontend results (some demo pages added here)
Around Grok-3 & Qwen3-235B-A22B (thinking on) performance

Chess probing wasn't noteworthy; around Llama 3.3 70B level with 55% move accuracy.

This one is definitely worth checking out imo but YMMV.

Grok-4

2025-07-10

Tested Grok-4:
I have run and published full testing on everything I have, including the core benchmark, chess, vision, token rates, demo pages, small experiments, etc.

Very verbose reasoning model, much more so than Grok-3 mini-high, around QwQ level with a 4/1 reasoning split. The reasoning tokens are hidden.

Smarter than Grok-3, though coding and in particular web-design was weaker in places
On multiple math tasks and repeatably, provided just a single number in its response with zero explanations, despite using 20k+ tokens on thought chain
Very good at following instructions and high general utility
Among the least censored models I have tested
Vision performance was decent (not as good as Gemini 2.5 but on par with o3).

Chess:
#1 in reasoning mode (full information), beating the highest rated models (o4-mini/codex-mini)
#3 in continuation mode (raw movetext), losing to GPT-4.5 and 3.5 Turbo Instruct
Currently at ~90% move accuracy, though low amount of games - placement and Elo have yet to settle in.

spent a ton of tokens even on opening book moves, averaging a cost of $0.27 per move!

The model was among the most expensive to test, with a bench price exceeding Opus 4 Thinking and hovering around GPT-4.5 level! Overall, a nice additional SOTA model, although the relatively lackluster code performance was disappointing to me.
But as always - YMMV!

Hunyuan-A13B Instruct

2025-07-10

Tested Hunyuan-A13B Instruct (local, Q4_K_M):
This is a Tencent 80B/13B MoE model. By default, this model reasons on every input. This can be disabled manually by setting enable_thinking=False in system prompt or prepending /no_think to queries. (using /think supposedly forces the model to reason, however I encountered no scenario where this was needed)

Default (thinking):
With 5.93x verbosity is akin to original DeepSeek-R1, though with a slightly smaller 75/25 reasoning split.

core intelligence seems mediocre but lacks in common sense scenarios, and attention to detail
very dry in creative tasks
weak programmer
around Qwen3-4B (Thinking) or Qwen2.5-14B (non-thinker) capability

Non-thinking (enable_thinking=False):

token savings of ~80%
output generally a bit weaker, in particular in terms of instruction following, which wasn't strong to begin with
lack of thinking was more likely to cause hard refusals
around Gemma2 9B capability

On my system (24GB VRAM + 64GB DDR5) inference was quite slow at 9tok/s.
I had higher hopes for this model, but YMMV.

Llama-3.1-Nemotron-Ultra-253B-v1

2025-07-08

Tested Llama-3.1-Nemotron-Ultra-253B-v1:

Too large for my machine, thus utilized Nebius AI Studio API, which claims to serve fp8.

This model has 2 modes, the reasoning mode (enabled by using detailed thinking on in system prompt), and the non-reasoning mode (detailed thinking off).

Reasoning OFF:

Good at STEM, math
Subpar for size in most everything else
Around DeepSeek v2.5 level capability

Reasoning ON:

+240% token usage overall, with a 82/18 reason split the final replies were 40% shorter
Improvements in general Logic and math
Worse for non-English queries, responding in English despite query being a different language
Multiple times, falsely claimed I asked for python code when asking for non-python code
Occasional hallucinations in thought chains (such as real-time visiting links)
Mode changes behaviour; far more likely to produce refusals.

Chess performance, like most llama models, was poor with an avg. accuracy of 52%, similar level to DeepSeek V3. Lost twice and drew the rest during testing. (all match replays available, as always)

Other than for math tasks, this model is surprisingly weak, considering its size. The 3.3-Nemotron-Super-49B I tested locally at only Q4_K_M performed either stronger or equal on most tasks. This would make more sense for a large model running at very low precision, but I have to work with the information I am given.

Bang for buck, while average overall, is poor for a llama variant model.

I personally won't utilize this model, but maybe you can run it yourself and achieve a stronger implementation, thus: YMMV!

Gemma 3n E4B it

2025-06-28

Tested Gemma 3n E4B it (local, fp16):

small multimodal local model, though I tested text only (due to lacking llama.cpp implementation)
capability falls between 4B & 9B Gemma models
I saw no hard refusals, though disclaimers and nagging that is present in whole family remains
It's a nice fast, small multipurpose model that can be used for easy tasks in anything except code

Not exactly required for my use cases, but a nice alternative small model. YMMV.

o3-2025-04-16

2025-06-28

Tested o3-2025-04-16:

Why did it take me 2.5 months to finally test and add this model? Well..

As of today, API access is still gated behind an organization verification, which consists of an 3rd party ID Check mandating consent to processing biometric information (access to camera, ID, facial scans). Needless to say, this is completely ridiculous and nothing I would ever consent to for any model.

2 weeks ago the raw mTok pricing was reduced by 80% from $10/40 to $2/8. I did some lower volume manual testing previously, and the old pricing was not economically feasible for any type of usage (e.g., making a singular chess move cost me almost half a dollar).

That being said, with barriers and increased time-commitment, I did finally manually test the entirety of the model in my test-suit (OR, unlimited maxtoken reasoning):

Verbosity was very low, with 3.22x of a non-thinker and a 2/1 reasoning split.
Capability was rather lackluster and roughly on par with o1-2024-12-17
Logical reasoning was just fine; I noticed it dismissing or glossing over crucial key details in multiple instances
STEM (in particular math) was good, but I did notice critical flaws in legal advice, such as naming correct & relevant court judgment, but concluding the opposite of its ruling.
In my tech/coding segment, it performed weaker than a plethora of other models; frontend design was particularly unimpressive with weak UI and incorporating unasked images consisting of broken imgur links
It showcased weak intent-recognition and would execute literally, which makes iterative workflows more painful than using a cooperative model (e.g. Claude).

In my vision testing, it scored below Gemini 2.5, o4-mini and GPT-4.5, on par with Gemini 2.5 Flash Lite Preview.

Chess is not possible for me to test in bulk without proper API access, however I was able to painfully conduct 2 matches in continuation manually, where it lost to the strongest opponents in GPT-3.5 Turbo Instruct & GPT-4.5 Preview, averaging 83.5% move accuracy

Overall, o3 performed fine but unimpressive. However, it's also reasonably affordable now with my benchprice hovering around o4-mini-high level.

It's a shame the API access is so restricted and invasive, however no model, especially not this one, warrants it imo.

However, that's just my own testing and 2cents, and always YMMV!

Gemini 2.5 Flash

2025-06-22

Tested Gemini 2.5 Flash (stable release):

Reasoning off:

Roughly the same verbosity as 2.5 Flash Lite, 2.35x verbosity of a concise non-thinker
About 2.0 Flash capability
Quite expensive for this class of model, there are better bang/buck contenders (e.g. 4.1-mini or older flash models)

Reasoning on:

Reasonably concise thinker (6.37x verbosity), 270% tok use of non-thinking variant, with a 2/3 reasoning split, meaning the final output was slightly more concise than non-reasoning
Gains were observed in Reasoning/Logic tasks and STEM segments.
I saw no significant improvements in tech/code, instruct following or overall utility when reasoning is enabled
Raw thoughts are hidden and only a step summary is provided (useless to me)
Quite expensive for this class of model, raw bench price was the same as 4o-latest, and far more expensive as a similar-level grok-3-mini

Both variants were overall the weakest of all 2.5 Flash snapshots tested (04-17 being the strongest).
Vision was retested and scored identical to older snapshots.

I feel like the price/performance here isn't quite right, so I personally won't be utilizing this model at this pricepoint.

However, and as always - YMMV!

Mistral Small 3.2

2025-06-21

Tested Mistral Small 3.2 24B Instruct 2506 (local, Q6_K):
This is a fine-tune of Small 3.1 2503, and as expected, overall performs in the same realm as its base model.

more verbose (+18% tokens)
noticed slightly lower common sense, was more likely to approach logic problems in a mathematical manner
saw minor improvements in technical fields such as STEM & Code
acted slightly more risque-averse
saw no improvements in instruction following within my test-suite (including side projects, e.g. chess move syntax adherence)
Vision testing yielded an identical score

Since I did not have issues with repetitive answers in my testing of the base models, I cannot make comments on claimed improvements in that area.
Overall, it's a fine-tune that has the same TOTAL capability with some shifts in behaviour, and personally I prefer 3.1, but depending on your own use case or encountered issues, obviously YMMV!

MiniMax-M1

2025-06-19

Tested MiniMax-M1:
At 456B too large to run local, and as a ultra-verbose reasoning model and slow inference speed via API, found this model to be unusable for any real work.
With 92/8 reasoning split, this model spent most of its time thinking, sometimes exhausting all 40k max tokens without giving a single reply token.

In terms of capability, I found it to be competent at my tech and coding tasks, while producing fairly average results in other areas; around Qwen2.5 Max level.

I place this model in the same category as Phi-4-reasoning-plus or, to an extend, Mistral Magistral, not really usable. But, YMMV!

Gemini 2.5 Pro

Gemini 2.5 Flash Lite Preview 06-17

2025-06-19

(Re-)Tested Gemini 2.5 Pro:

More akin to 03-25 than 05-06 in my testing, meaning less code-focused and better performance for general utility
Very good common sense (only beaten by Opus 4)
Hidden thought-chains on all platforms is understandable from a business standpoint, but a huge loss for average users, losing on the very valuable additional insights
With a ~6.44x token verbosity, and useless thought summaries, real cost for displayed tokens is quite high (more than 200% of Sonnet 4)
Out of the four 2.5 Pro snapshots I tested (Previews/Experimental), was the most censored one
Code was good, but I saw some outcome UI-, and verbose code commentary issues, which makes this less appealing to me as a coding model

Overall, generally just as strong in total, still a great SOTA model
As always, and depending on use case - YMMV!

Tested Gemini 2.5 Flash Lite (Preview 06-17):

verbosity at 2.25x samey as 2.5 Flash models, which means it's a bit yappy (twice the token use of 4.1-Nano)
hyper fast model (generally 200+ tok/s), which makes it great for bulkwork
Around DeepSeek V2.5/Qwen2.5 72B/4omini level capability, very versatile and good general utility, good instruction following
price/performance is good, but not great when compared to older models (1.5 flash 002, 2.0 flash lite, etc.)

Overall, found it to be quite competent, versatile and at fast inference a good option for simple general tasks.

Magistral Medium 2506

Magistral Small 2506

2025-06-11

Tested Mistral Magistral 2506:

I used the recommended/default settings and the included recommended chat template.

Magistral Small 2506 (local, Q6_K):

13x token use of non-thinking 2503
saw slight gains in logic tasks, and most noticeably STEM (in particular math)
General usability obviously decreases significantly, not a general utility model
saw no improvements in my coding segments

Magistral Medium 2506 (API only model):

8x token use of Mistral Medium 3 (which was already on the verbose side)
combined with the random price hike, was roughly 20x the bottom line price
Improvements were seen in reasoning, and coding problems
General usability is far lower than non-thinking Mistral Models
In the 2 reasoning chess matches against Mistral Medium (20x cheaper), it lost both times

These models are the 2nd most verbose I ever tested (x15.4 tok rate), only Phi-4-reasoning-plus produced more tokens.
I don't understand the premium upcharge on the thinking Medium model, as this also gets multiplied by the massive token use
The gains overall are minuscule and on areas outside of generic use. In terms of bang for buck, or inference for buck, very poor.
I did not enjoy reading the thought chains, they are quite mundane.

Overall, these models scores slightly higher purely on numbers, but are completely not comparable to general use models. I don't see any reason to use them for my personal use cases.

However, test them yourself because as always - YMMV!

dots.llm1.inst

2025-06-08

Tested dots.llm1.inst (142B MoE | 14B active):

Rednote open source non-thinking model, that utilized Chain of thoughts, totalling ~2x token verbosity overall.
I utilized the dots demo on huggingface (temp=0.7, top_p=0.8) - too large for my machine, no API yet.
It was relatively uncensored, though it still suffers from Chinese specific censoring & political propaganda.
In terms of formatting, I saw minor issues with Chinese characters (rarely), emoji loops, and overall subpar instruction following.
Code was fine on easier problems, frontend results are rather minimalistic though.

Overall, it's not a poor model, performance was around Llama-3.1-Nemotron-51B or Qwen2.5 72B level (though weaker coder).
Since it might be competing against Llama 4 Scout (109B, 17B active), I found it to be smarter in direct comparison but less versatile, and losing out in format-crucial workflows.

Overall, worth checking out locally if you happen to have a monster machine, and as always: YMMV!

DeepSeek-R1 0528-Qwen3-8B

2025-05-31

Tested DeepSeek-R1 0528-Qwen3-8B:

This took way longer than expected, I encountered many issues with local testing, ranging from degraded replies, inconsistent results, thought loops, and symptoms of minor brain damage in certain tasks.
I tried several quants (bf16) from unsloth, bartowski, lmstudio,.. and used recommended inference parameters (0.6 temp, 0.95 topp), template variations, along with high context (16k & 32k) with and without repeat penalties and limited response length, but no matter what combination I tried (and I ran a ton of tests) there were signs of degradation in every test. Instead of trashing my results and calling it a day I decided to instead test NovitaAI's API implementation as they seem to have gotten rid of problems I wasn't able to, thus:

API Results:

Very verbose, even more so than DeepSeek-R1 0528 and Qwen3 Thinking models, though not quite QwQ level. 81% tokens were used for reasoning.
Did extremely well in Reason & general Logic
Non-math STEM performance was weaker
Instruction following and prompt adherence was fairly bad
For code I found it annoying as it generated "solutions" that ignored instructions or dismissed restrictions.

While the results are overall fantastic for size (8B performing on ~60B level with brute force thought chains), I didn't vibe with this models utility and general usability, it feels like a model created for benchmarking, not for general use.

But maybe I am just annoyed with all those hours wasted on busted local testing..
Either way, as always: YMMV!

DeepSeek-R1 0528

2025-05-29

Tested DeepSeek-R1 0528:

As seems to be the trend with newer iterations, more verbose than R1 (+42% token usage, 76/24 reasoning/reply split)
Thus, despite low mTok, by pure token volume real bench cost a bit more than Sonnet 4.
I saw no notable improvements to reasoning or core model logic.
Biggest improvements seen were in math with no blunders across my STEM segment.
Tech was samey, with better visual frontend results but disappointing C++
Similarly to the V3 0324 update, I noticed significant improvements in frontend presentation.
In the 2 matches against it former version (these take forever!) I saw no chess improvements, despite costing ~48% more in inference.

Overall, around Claude Sonnet 4 Thinking level.
DeepSeek remains having the strongest open models, and this release increases the gap to alternatives from Qwen and Meta.

To me though, in practical application, the massive token use combined/multiplied with the very slow inference excludes this model from my candidate list for any real usage, within my use cases. It's fine for a few queries, but waiting on exponentially slower final outputs isn't worth it, in my case. (e.g. a single chess match takes hours to conclude).

However, that's just me and as always: YMMV!

Example front-end showcases improvements (identical prompt, identical settings, 0-shot - NOT part of my benchmark testing):
CSS Demo page R1 | CSS Demo page 0528
Steins;Gate Terminal R1 | Steins;Gate Terminal 0528
Benchtable R1 | Benchtable 0528
Mushroom platformer R1 | Mushroom platformer 0528
Village game R1 | Village game 0528

Sonnet 4 Thinking

Opus 4 Thinking

2025-05-24

Tested Claude 4 Thinking (budget 16k, though the max it ever used was 6k):

Sonnet 4 Thinking:

Overall output usage was merely ~2x compared to default (significantly reduced from the 7.44x I recorded for 3.7 in February), with a 50/50 reasoning split.
In most cases, final outputs were of same quality compared to non-thinking.
In certain reasoning and creative tasks it performed consistently worse than non-thinking (e.g. due to overthinking, pondering about reasons to not adhere to user query)
In scenarios where thought chains would be immensely helpful, e.g. precise calculations, the model simply assumed its own rounded numbers to be correct without any further verification in thought chains, leading to false results.

Thus, in my observation the Reasoning feels 'slapped on', and didn't improve performance
Often, it spent reason tokens engaging in self-reflection about potential risks or policy violations, thus introducing unwanted risk analysis that can lead to more conservative (and ultimately worse) responses. Weirdly, the affected tasks weren't even probing for censorship (thus not part of that category).

Opus 4 Thinking:

Only used 65% more tokens than without thinking, making it the most concise thinker I have ever tested (4/10 reason split)
Same observed risk analysis within thought chains, leading to more often refuse doing a harmless task than without thinking
Saw actual benefits though on very hard math or very hard coding problems

So from what I have seen, I will practically never enable thinking on Sonnet 4, unless I am interested in reading the chains.
Opus 4 can be worth a shot for very hard problems.
The base models are very strong and clearly not native reasoning models.

But this is just my own testing & opinion derived from what I observed during multi hour testing/comparing. Can obviously vary between use cases, so: YMMV!

Claude 4 Sonnet

Claude 4 Opus

2025-05-23

Tested Claude 4 (default/non-thinking, Opus & Sonnet, 20250514):

Ended up topping my ranks (#1 & #2)
Very high reason, logic and common sense
quite concise models (16% token use of reason models such as 2.5 Pro)
highly competent in most areas tested, though Opus had more slip ups in math related tasks
Great coders, but Sonnet is probably the better choice in most cases (bang 4 buck)
Noticed improvements in back-end tasks and debugging
Saw no improvements in Vision
Chess: competent opening moves, then blunder all pieces even in hugely winning positions (14 draws, 1 loss in 15 matches, with zero secured wins)

Opus in particular seems to have additional guardrails, enforced by API, as I received some usage policy violation warnings on harmless queries (e.g. my Steins;Gate demo pages). This issue was not present on Claude Sonnet 4.

I have also uploaded some demo pages onto my shared assets.
Pricing on Opus with little benefit in most scenarios means I won't be utilizing it much, though.
I'll check out performance with reasoning in the coming days, too.
Overall, impressive models. As always, YMMV!

Codex Mini

2025-05-16

Tested Codex Mini
Obviously not a general purpose model, but out of curiosity I like to test specialized models in my general environment regardless:

General capability around GPT-4.1 and between o3-mini <> o3-mini high
In the few (non-agentic) coding tasks I have, it performed on o3-mini high level
Overall token verbosity (x5.91) was around R1 level (slightly lower than o4-mini high), with a 3/1 split between reasoning and output tokens.
Real bottom line cost was a bit lower than o3-mini-high and a bit higher than o4-mini-high
Did well in my small Vision Test (in fact on top of OpenAI lineup, barely edging out against o4 Mini due to comparable weighted ratings), though still behind Google models

Overall vibes, it felt like an o4-mini model with code focused system instructions.
This model doesn't fit within my personal use case/work flow, so obviously take my findings with a grain of salt, and as always - YMMV!

Mistral Medium 3

2025-05-08

Tested Mistral Medium 3

Non-reasoning model, but baked in chain of thoughts, resulted in overall x2.08 token verbosity.
Supports basic vision (but quite weak, similar to Pixtral 12B in my vision bench)
Capability was quite mediocre, placing it between Mistral Large 1 & 2, similar level as Gemini 2.0 Flash or 4.1 Mini
Bang for buck is meh, cost efficiency is lower than it's competing field

Overall, found this model fairly average, definitely not "SOTA performance at 8X lower cost" as claimed in their marketing.
But of course, as always -YMMV!

Gemini 2.5 Pro Preview 05-06

2025-05-06

Checked out the new Gemini 2.5 Pro Preview 05-06 update (prev 03-25):

Did slightly worse at my reasoning segment (particularly and reproducible the same 3 tasks), same STEM, slightly improved instr. follow, tech was better on 1 issue (but worse on 1 issue compared to Exp).
Overall, same overall capability in my environment, shifted more towards coding, as the blog post suggests.
Token use (and thus price) was up +17% (+11% on non-reason and +21% on reason tokens).

Example front-end showcases comparisons (identical prompt, identical settings, 0-shot - NOT part of my benchmark testing):
CSS Demo page exp/03-25 | CSS Demo page 05-06 :thumbsup:
Steins;Gate Terminal exp/03-25 | Steins;Gate Terminal 05-06
Benchtable exp/03-25 | Benchtable 05-06
Mushroom platformer exp/03-25 | Mushroom platformer 05-06
Village game 03-25 | Village game 05-06 :thumbsdown:

Overall, minor observable change in my environment /small test set, YMMV! the extra token use is a major bummer though..

Qwen3-235B-A22B

Qwen3-1.7B

2024-05-04

Tested Qwen3-235B-A22B (API fp8) & Qwen3-1.7B (local bf16):

1.7B:

/no_think is nothing special, performs as expected, a tiny model for the GPU starved
default/thinking performs around a decent 7B model, quite usable for easy tasks
not my use case, but probably the best option besides small Gemma models, if you cannot fit the 4B

Qwen3-235B-A22B

/no_think verbosity at x1.57, yet only 16% of default mode
performed slightly worse in Reasoning and STEM subjects
around Llama 3.3 70B level, but better coder

thinking (default mode):

616% token usage compared to non-thinking mode
very capable model across the board, almost DeepSeek-R1 level, beating out Llama 405B
impressive STEM performance (not just math but other STEM subjects, too)
Extremely cost efficient, decent vibes, TOP model
The trend of thought-chains aiding sensitive topics continued with these 2 models, too

YMMV!

Phi-4-reasoning plus

2025-05-02

Tested Phi-4-reasoning plus (14B, local, Q8_0):

In terms of overall capability, roughly on par with Qwen3-32B.
I already thought QwQ was unusable for general use, but this one takes the cake in terms of sheer token verbosity: By far the most verbose model I ever tested, almost 20x token usage of a traditional model.
Quite dry and soulless responses overall
Not a model for general use, clearly optimized for benchmarks (math, note that my STEM includes non-math topics)
Ok model to run a few tests or benchmarks, insanely inadequate inference requirements, not reasonable for general use

As always, just my own testing, YMMV!

Qwen3

2025-05-01

Tested Qwen3 (4B, 8B, 14B, 32B, 30B-A3B):

non-thinking (mode enable_thinking=False, /no_think):

still relatively verbose when compared to a traditional non-thinking model (~45% more token usage)
very good all-rounders for size, mostly best in slot for their sizes/non-thought
Good overall utility for a variety of tasks, not recommended for precise maths or programming.
More prone to flat out refuse requests

thinking (default mode enable_thinking=True, /think):

very verbose, but not extremely so (x7.85 token usage puts it among the more verbose tested, but not as extreme as QwQ and o3-mini-high)
Huge gains in math (in particular rounding), as well as Coding
Less prone to flat out refuse requests, thought-chains were beneficial in Censorship testing
Extremely performant overall, dominating in my sub 49B rankings
In fact, 4B and MoE 3.3B were so performant for size (usually tiny models struggle at my test-suit), that I suspected test-leakage and ran multiple re-tests

All models were tested locally, rough inference speeds on my 4090/24GB VRAM:
Qwen3-30B-A3B Q4_K_M: 130 tok/s
Qwen3-4B bf16: 83 tok/s
Qwen3-14B Q8_0: 50 tok/s
Qwen3-8B bf16: 50 tok/s
Qwen3-32B Q4_K_M: 28 tok/s

30B-A3B (insane speed!⚡️) will definitely be utilized as a daily driver by me for all types of random non-crucial tasks.

This was just in my tested use cases, as always, YMMV!

GLM-4-32B-0414

GLM-Z1-32B-0414

2025-04-23

Tested GLM-4-32B-0414 & GLM-Z1-32B-0414 (local Q4_K_M):

GLM-4-32B-0414

non-thinking model, still fairly verbose (x1.62 tok in my testing)
Good overall utility for a variety of tasks
similar overall capability as Qwen2.5 32B
more competition in this non-reasoning size segment (e.g. Mistral Small 3.1, Gemma 3 27B)

GLM-Z1-32B-0414

reasoning model, requires a large context window (minimum 16k in my testing)
nowhere near as verbose as QwQ (x6 instead of x10 token usage), thus higher general usability
capability didn't quite reach QwQ level, but overall 2nd best for models under 49B.
I noticed syntax errors in less popular languages (e.g. swift)
I prefer it over QwQ simply because of its less excessive token spam

This is just my testing and my use cases. As always, YMMV!

Gemini 2.5 Flash Preview 04-17

2025-04-18

Tested Gemini 2.5 Flash Preview 04-17:

Fairly verbose, fast, cheap model that is competent in all tested areas.
Improvements from 2.0 flash, except my coding tasks, where it did slightly worse
Around GPT-4.1 level overall

Thinking:

increased output base price ($0.6 > $3.5) combined with ~3.42x token usage (74.3% reasoning tokens), leads to a much higher inference price, overall almost 20x than non-thinking.
Biggest improvements were seen in reasoning, analytical conclusions, and coding
Counterintuitively, it did consistently worse with thinking on my STEM tasks
Around DeepSeek-R1 & Grok-3 level overall

Due to some inconsistencies observed during testing, I reran my benchmark several times on the Thinking variant. While it is overall far stronger than non-thinking (and far more expensive), it also produced less consistent results compared to non-thinking in some areas.

As always, YMMV!

o4-mini

o4-mini-high

2025-04-17

Tested o4-mini & o4-mini-high:

o4-mini:

Quite concise for a long-CoT-reasoning model (only ~3.2x token verbosity compared to a traditional model).
Real inference cost was almost identical to 3.7 Sonnet (non-thinking).
Performance was roughly in line with o3-mini-high.

o4-mini-high:

Roughly 156% more thinking tokens, translates to inference cost&delay x2.
Comparatively minuscule improvements, in certain areas (very hard code & reasoning).
Not universally better in every scenario, even when disregarding cost increase.
Roughly on par with Grok-3 (non-thinking).

Overall, in my environment, this models feels like a small upgrade to o3-mini, in some scenarios.
The effective cost is a bit lower, which is an upside.
Not too impressive in my testing, but as always, depending on your own use case: YMMV!

Granite-3.3-8B-Instruct

2025-04-16

Tested Granite-3.3-8B-Instruct (f16):

Actually did a bit worse overall than the Granite 3.0 8B Instruct (Q8) I tested 6 months ago.
Not the absolute worst, but just utterly uninteresting and beaten by a plethora of other models in the same size segment in pretty much all tested fields.

GPT-4.1

2025-04-14

Tested GPT-4.1 series:

GPT-4.1 Nano:
Cheap tiny model, roughly comparable to Qwen2.5 14B.
Substantially beaten on price & performance by e.g. Googles flash models.

GPT-4.1 Mini:
Versatile fast model, roughly comparable to Gemini 2.0 flash (but more expensive).
Quite a solid coder, and performed on par with the larger model in my STEM segment.

GPT-4.1:
"flagship" of the series, roughly as strong Llama 3.3 70B (but weaker STEM) & DeepSeek V3 0324 (but weaker coder).
Behind 7 other OpenAI models in my testing.
The "Maverick" type model of OpenAI.

All models are non-reasoning models and not very verbose, when compared to other recent model releases (1.15x / 1.23x / 1-35x token verbosity as size increases in testing).
All models, including Nano, are fairly competent coders! though none excel at my backend testing
None of these were particularly good in my STEM segment.

I have also added 0-shot examples for UI impressions and simplistic game design for each model on my shared assets (NOT part of any scoring, just for additional curiosity/comparison).

As always, YMMV!

Llama-3.1-Nemotron-Nano-8B-v1

2025-04-13

Tested Llama-3.1-Nemotron-Nano-8B-v1 (bf16):

This model has 2 modes, the reasoning mode (enabled by using detailed thinking on in system prompt), and the default mode (detailed thinking off).

Default behaviour:

Despite not officially ＜think＞ing, about 2x verbose as base model
Weak performance across the board, terrible instruction following/prompt adherence
About the same capability of a 3B model, with added verbosity

Reasoning mode:

Not always ＜think＞ing, despite system instructions as per nvidia documentation
minor improvements in logic, some improvements in STEM related tasks
terrible instruction following/prompt adherence. Low utility

Both variants perform significantly below base Llama 3.1 8B and have far less general utility.
Very poor model imo. But as always: YMMV!

Grok-3 mini

2025-04-12

Tested Grok-3 mini:

default reasoning:

Near identical token use to o3-mini (medium), 132% more token use than the non-thinking Grok-3
Good performance in all tested areas, around o3-mini level, not far behind Grok-3
better instruction following than Grok-3
better price/performance for most tasks than Grok-3

high reasoning:

65% more token use than default reasoning (labeled as "low" but I would say is more akin to "medium reasoning")
same overall smartness, but gains stability in math and instruction following
not recommended for areas outside of the above, as I saw certain task even produce worse results, for higher price. (e.g. some C++ issues not present on default thinking).

Also retested & updated current Grok-3 due to observed deviations since 2 months ago, scored slightly higher (+~1.5%) .
As always: YMMV!

Llama 4 Scout

Llama 4 Maverick

2025-04-06

Tested Meta's new Llama 4 Scout & Llama 4 Maverick:

Llama 4 Scout: (109B MoE)

Not a reasoning model, but quite yappy (x1.57 token verbosity compared to traditional models)
"Small" multipurpose model, performs okay in most areas, around Qwen2.5-32B / Mistral Small 3 24B capability
Utterly useless in producing anything code.
Price/Performance (at current offerings) is okay but not too enticing when compared to stronger models such as Gemini 2.0 flash

Llama 4 Maverick: (402B MoE)

Smarter, more concise model.
Weaker than Llama 3.1 405B, performed decent in all areas, exceptional in none, performed around Llama 3.3 70B / DeepSeek V3 capability.
Workable but fairly unimpressive coding results, archaic frontend.

The shift to MoE means most people won't be able to run these on their local machines, which is a big personal downside.
Overall, I am not too impressed by their performance and won't be utilizing them, but as always: YMMV!

Gemini 2.5 Pro Experimental 03-25

2025-03-27

Tested Gemini 2.5 Pro Experimental 03-25:

Average-verbose reasoning model with around 5.4x token use of a traditional model, clocking in around DeepSeek-R1 level token usage. Far less verbose than models such as o3-mini-high or Sonnet Thinking.

#1 Reasoning/Logic segment, surpassing GPT-4.5 Preview
#1 in Code segment, surpassing GPT-4.5 Preview
STEM and math were competent, but nowhere near top, in my testing
Overall utility for miscellaneous casual tasks, where fine, but not outstanding

I really enjoyed testing this model. It's very capable, but still shows flaws in certain areas. As always: YMMV!

DeepSeek V3 0324

2025-03-24

Tested DeepSeek V3 0324:

More verbose than previous V3 model, lengthier CoT-type responses resulted in total token verbosity of +31.8%
Slightly smarter overall. Better coder. Most noticeable difference were a hugely better frontend and UI related coding tasks

This was merely in my own testing, as always: YMMV!

Example frontend showcases comparisons (identical prompt, identical settings, 0-shot - NOT part of my benchmark testing):

CSS Demo page DeepSeek V3
CSS Demo page DeepSeek V3 0324

Steins;Gate Terminal DeepSeek V3
Steins;Gate Terminal DeepSeek V3 0324

Benchtable DeepSeek V3
Benchtable DeepSeek V3 0324

Mushroom platformer DeepSeek V3
Mushroom platformer DeepSeek V3 0324

EXAONE Deep 32B

2025-03-23

Tested EXAONE Deep 32B (local, Q4_K_M):

Yet another long-cot reasoner. Stumbles around with thoughts and delivers unimpressive results, even when compared to non-reasoning models less than half its size.
Was utterly useless in anything code related. This one is very lame, and weak imho, there are at least a dozen far better options at that size.
As always: YMMV!

Llama-3.3-Nemotron-Super-49B-v1

2025-03-22

Tested Llama-3.3-Nemotron-Super-49B-v1 (local, Q4_K_M):

This model has 2 modes, the reasoning mode (enabled by using detailed thinking on in system prompt), and the default mode (detailed thinking off).

Default behaviour:

Despite not officially ＜think＞ing, can be quite verbose, using about 92% more tokens than a traditional model.
Strong performance in reasoning, solid in STEM and coding tasks.
Showed some weaknesses in my Utility segment, produced some flawed outputs when it came to precise instruction following
Overall capability very high for size (49B), about on par with Llama 3.3 70B. Size slots nicely into 32GB or above (e.g. 5090).

Reasoning mode:

Produced about 167% more tokens than the non-reasoning counterpart.
Counterintuitively, scored slightly lower on my reasoning segment. Partially caused by overthinking or more likelihood to land at creative -but ultimately false- solutions. There have also been instances where it reasoned about important details, but failed to address these in its final reply.
Improvements were seen in STEM (particularly math), and higher precision instruction following.

This has been 3 days of local testing, with many side-by-side comparisons between the 2 modes.
While the reasoning mode received a slight edge overall, in terms of total weighted scoring, the default mode is far more feasible when it comes to token efficiency and thus general usability.

Overall, very good model for its size, wasn't too impressed by its 'detailed thinking', but as always: YMMV!

Olmo 2 32B

2025-03-19

Tested Olmo 2 32B Instruct (API/bf16):

Performs around a modern 10B model
Okay for general questions but rather weak in any specialized field (math, code, etc.)
quite vanilla/sterile

This model is quite poor size/performance overall. Outclassed by models such as Nemo 12B and Phi-4 14B.
Subjective Vibe Checks not passed (not rated),
Uninteresting model imho, but YMMV.

Mistral Small 3.1

2025-03-17

Tested Mistral Small 3.1 (API):

not much to say, it's pretty much identical to Mistral Small 3 (within margin of error & minute precision/quantization differences)
you get multi-modality.

I found no underlying text-capability differences.

Jamba 1.6

2025-03-15

Tested Jamba 1.6 (Mini & Large):

Literally worse than the 1.5 Models I tested 7 months ago.
The models cannot even produce a simplistic table!
They are completely coherent, but unintelligent and feel ancient.

The "large" model gets beaten by local ~15B models in terms of raw capability, and the pricing is completely outdated.
The Mini model performed slightly above Ministral 3B.

These models are very bad imho.

As always: YMMV!

Reka Flash 3

2025-03-15

Tested Reka Flash 3 (21B, Q8):

This one is yet another long-CoT reasoning model (~5.32x token verbosity compared to a traditional model).
I did decent in my coding segment (don't use this for frontend webdesign though! looks terrible).
It has low general utility due to extreme verbosity and subpar instruction following.
In other categories, it performed okay-ish for size.

Outclassed by models such as Mistral Small 3, Gemma 3 12B, Phi-4 14B in most scenarios.

As always: YMMV!

Command A

2025-03-13

Tested Command A (03-2025):

Significant upgrade to Command R+ 08-2024
Feels a bit dated for its size (111B) when compared to models such as Llama 3.3 70B
Surprising performance in my tech and code segment, where it delivered consistently good results
Less censored than most other models, easy to steer

As for their marketing claims about being on par or better than 4o and DeepSeek V3: certainly not for general use, but it did perform on par in my coding segments.

This is obviously a model geared at enterprise, RAG, and agentic works, but it will still be useful to risk writing and similar creative work.

As always: YMMV!

Gemma 3

2025-03-13

Tested Gemma 3 (local, Q5, Q8, bf16):

27B: Better in STEM, particularly math, poor coder, disappointing reasoning compared to Gemma 2 & 12B
12B: Slightly better in almost everything compared to Gemma 2 9B, equivalent performance to 27B in many tasks
4B: Comparable to Gemma 2 2B, found it less versatile but a tiny bit smarter in certain cases.

Family as a whole:
Hard refusals have been significantly reduced. You now have to live with large segments containing legal and warning disclaimers though..
Multi-modality & image inputs: My testing does NOT test any multimodal functionality, so do keep an eye on benchmarks that do.

For my use case, as someone who barely ever requires image input, these models are a bit disappointing in terms of raw text capability. but:
As always, YMMV!

R1 1776

2025-03-07

Ran a full retest of R1 1776, after perplexity claims to have fixed their implementation.

Higher quality chain of thoughts, in particular in long context, fixed degradation
Thus, gains in all tested areas, compared to initial implementation
Still falls short when compared to DeepSeek-R1
Core model remains identical with same issues such as still censored Chinese areas and propaganda

Tldr; Recent fixes improved the thought chains and thus outcome significantly, doesn't quite reach R1 level, in my testing.
As always,YMMV!

QwQ-32B

2025-03-06

Tested QwQ-32B (local, Q4_K_M):

best in size, except for coding
extremely verbose (avg. ~10x output tokens compared to traditional model, more verbose than any other long-cot-model I ever tested)
more effective thought chains than r1 distill versions of Qwen2.5-32B
terrible at all webdesign tests I threw at it
Smartest sub 70B by brute force token chains

This is a smart model, but for me the extreme verbosity and inference required excludes it from becoming a daily driver.
The good outcomes feel brute forced with cot, and the verbosity is borderline ridiculous.
Good if for complex STEM related subjects or reasoning tasks.
Not useful for coding.

As always, YMMV!

GPT-4.5 Preview

2025-02-28

Tested GPT-4.5 Preview:

Very expensive model obviously with the highest raw price yet, but actually a bit cheaper than o1 if you account for hidden thought-chains. Model is also fairly concise, with reply lengths slightly below median non-thinking models.
Highest common sense of all models I have ever tested (~130+)
STEM, Coding, and other professional tasks were good, but not super impressive. Attention to detail (haystack tests, bug spotting) was very good, though.
Vibe, style etc. I do not specifically test for but I found the model to be fairly standard, at least in my collected queries.

While the model is advertised to be good at conversation, creativity and natural conversation, I don't see how casual conversations with this model is a feasible use case, considering the outrageous price. I will personally use it as an agent with decision making that requires common sense (e.g. a judge or critical analyst).

As always - YMMV!

Claude 3.7 Sonnet Thinking

2025-02-25

Tested Claude 3.7 Sonnet Thinking (Budget 16k, though the max it ever used was 9k):

Overall output usage was ~7.44x compared to normal (signficantly more expensive).
The vast majority of cases, final outputs were of identical quality compared to non-thinking.
In certain reasoning and creative tasks it performed consistently worse than non-thinking (e.g. due to overthinking, pondering about reasons to not adhere to user query)
In rare specific queries (most consistently in hard code and hard math), it performed slightly better.

I know it feels counterintuitive how it can perform below non-thinking on e.g. Reason, but I have retested all differentiating results a multitude of times, and the differences were reproducible & consistent.

For my use case, the thinking mode will remain deactivated 99% of the time, unless I have a very specific issue that non-thinking cannot solve, then it might be worth giving a thinking budget a try. For the average user, I doubt that using it is wise considering cost-effectiveness.

However, as always, just my own testing. YMMV!

Claude 3.7 Sonnet

2025-02-24

Tested Claude 3.7 Sonnet (non-thinking, claude-3-7-sonnet-20250219):

Smarter & better overall, biggest improvement imho was it's far less aggressive Nanny-behaviour (still not uncensored but big improvement!).
It's frontend dev skills (which was already great) was taken up a notch and produces even better results. Flaws were rather in backend and debugging.

Overall, fantastic model. I'll check out the different thinking options over the next few days (though I have a feeling it won't lead to very cost-efficient improvements)
As always, YMMV!

3 simple frontend UI comparisons between 3.5 and 3.7 (short query prompt, 0 shot) - NOT PART OF MY TESTING; JUST FUN COMPARISON:

CSS DEMO:
https://dubesor.de/assets/shared/UIcompare/Sonnet3.5.1.html
https://dubesor.de/assets/shared/UIcompare/Sonnet3.7UI.html

STEINS;GATE TERMINAL:
https://dubesor.de/assets/shared/SteinsGateWebsiteExamples/Claude%203.5%20Sonnet%20new.html
https://dubesor.de/assets/shared/SteinsGateWebsiteExamples/Claude%203.7%20Sonnet.html

LLM BENCHTABLE MOCKUP:
https://dubesor.de/assets/shared/LLMBenchtableMockup/Claude%203.5%20Sonnet%203.1%20cents.html
https://dubesor.de/assets/shared/LLMBenchtableMockup/Claude%203.7%20Sonnet%2017.9%20cents.html

R1 1776

2025-02-24

Tested R1 1776 (Perplexity post-trained to remove Chinese censorship):

Reasoning showed strong signs of degradation, leading to worse results in all tested areas.
Math, formatting and code related tasks were more strongly affected than pure Logic tasks.
Ironically, the only few Chinese censor tests I have (and have had for a long time) still produced 100% censored and propagandistic answers.

Whether the degradation is due to the post-training, or how the model is implemented, I do not know. But I do know that it isn't on R1 level.
As always, YMMV.

Grok-3

2025-02-19

Tested the current Grok-3:

Reasoning was similar to Grok-2 in my environment, but I saw large improvements in STEM, general utility and Coding (On a sidenote, UX design was hit or miss, sometimes phenomenal, sometimes poor, so a bit inconsistent.)
It's still fairly uncensored, and very wordy model (non-reasoning but produces large responses, roughly 2.25x as GPT-4o-latest which is already wordier than the average traditional model.)
I found it to be a little less cringe-inducing than Grok-2 (subjective, unrated).
Overall, very capable model but not the best at any field I test.
As always - YMMV!

chatgpt-4o-latest

2025-02-16

Tested current 'chatgpt-4o-latest' (time stamp 2025-02-16), and compared to results from 4 months ago:

about 1-4.7% better on my test set, depending how refusals are weighted
more prone to censor in risk topics, lower utility in risk-deemed RP
slightly improved capability across different segments, math, logic, coding, ...
slightly altered behaviour/styling, more emojis by default, more casual tone in certain settings
overall, slightly better for most use cases, most capable non-thinking model, other than 4-Turbo
As always, YMMV!

Qwen2.5

2025-02-02

Tested the new Qwen2.5 models (also updated the price since it changed just 1 day after my testing):

Qwen2.5-Turbo - cheap model, roughly equivalent to GPT3.5 Turbo
Qwen2.5-Plus - mid model, roughly equivalent to Qwen2.5-72B
Qwen2.5-Max - large model, roughly equivalent to Mistral Large 2

Traditional models who are competent enough for their weight & price.
Not the most interesting models to me but as always - YMMV!

o3-mini

2025-02-01

Tested o3-mini (default):
Minor improved reasoning results over o1-mini, strong coder. Did slightly worse at my STEM segment and anything deemed not safe.
If you don't care for censorship or refusals, or use it solely for coding, it's slightly better than o1.
Overall, for general use not a noticeable capability upgrade. New pricing will make it more affordable though (thought token impact calculations still withstanding on my part).
As always, YMMV.

Mistral-Small-24B-Instruct-2501

2025-01-31

Tested Mistral-Small-24B-Instruct-2501 aka Mistral Small 3:

Saw improvement over the previous Mistral Small in most areas, except for code where it actually performed slightly lower (retests were already done, but my segment is rather small, so do take it with a grain of salt).
It ends up among the best choice for sub50B models along Gwen2.5 32B and Gemma 2 27B. Mistral always has lower inherent censorship, so it should perform well for roleplay and similar creative tasks.
Very useful model size for local inference (16GB VRAM+). Good universal model overall.

Qwen2.5-Max

2025-01-30

Tested Qwen2.5-Max:
It's a traditional, competent but overall uninteresting model. Slightly smarter than the vastly cheaper 72B, it's pricing strategy at $10/30 mTok is severely outdated. Compared to completing models with similar capability such as 72B or DeepSeek V3. I saw minor improvements in maths and formatting, but none in core logic or coding related tasks.
This model is also quite dry and doesn't pass my vibe check, rather weak conversationalist and bad for RP. For me, considering the price, it's a pass.

As always, YMMV!

R1-Distill

2025-01-22

Locally tested R1-Distill-Llama-8B, R1-Distill-Qwen-14B, R1-Distill-Qwen-32B

32B was decent, the smaller distilled models were weaker than base. Overall I would use the non-thinkers in my own use case, as the benefit (or lack thereof) is not worth it for local compute, imho. For the smaller models, makes them less usable with lack of benefit. Also the token spam is not really desired for local use, at least in my use cases. Llama was particularly impacted, gaining minute capability in reason and stem, but sacrificing almost all utility and coding. For me, after several hours of testing, the distilled models aren't really attractive. as always, YMMV.

Tested 4 Distill models by now, all statements are in my testing, ymmv:

8B - weaker than base (ranked #107 -vs- #84)
14B - weaker than base (ranked #80 -vs- #62)
32B - slightly better than base (ranked #54 -vs #59)
70B - weaker than base, due to bloated thoughts - pretty much unusable locally (ranked #26 -vs- #18)

DeepSeek R1-Zero

2025-01-26

Tested R1-Zero (fp8):
highly capable model, a little bit messier and less conventional than R1, less aligned/filtered. Loses out in formatting and thus coding, but is a highly capable model overall. probably not as consumer-friendly as R1, but my testing probes mostly raw capability.
As always, YMMV!

DeepSeek-R1

2025-01-20

Tested DeepSeek-R1:
This model is extremely capable for a non-proprietary model, and the first to truly successfully challenge the top SOTA models (more so than DeepSeek V3).
It will not be the most efficient model for every use case, as its long-chain-of-thought reasoning response (can be ignored by user) is very verbose, with an average ratio of 4.7:1 in my testing. Some of it's programming thoughts were even breaking my previous response storage method, due to surpassing 32k chars. So, it can be extremely verbose in its "reasoning_content" response. The final answer ("content") is fairly concise in comparison.

Compared to R1-Lite-preview it did not suffer from false refusals or chinese output issues.
It outperformed Llama 3.1 405B in almost all tasks except for general utility (roleplay, concise formatting, etc. which is to be expected).

Overall a fantastic long-cot model. Adding in the cost factor, in terms of cost effectiveness, it blows o1 completely out of the water. Plus you get to actually see every token you pay for, if you desire.

as always - YMMV!

MiniMax-01

2025-01-17

Tested the model MiniMax-01 in my bench environment. Results were around WizardLM-2 8x22B or Llama3.0 70B level. It was pretty mediocre in most tested fields, cost/performance was top 40%, not expensive and neither particularly cheap for capability.

There are some minor quirks with Chinese output or lack of format adhering but not to an unusable degree. Overall pretty meh model to me.

As always - YMMV!

QVQ-72B-Preview

2024-12-31

Tested QVQ-72B-Preview (bf16):
Surprisingly bad model, despite being a long CoT reasoning model, it did not impress in reasoning tasks.
It did OK in STEM-related tasks, but was useless in programming and anything requiring it to follow instructions. It fails to provide full working code segments most of the time and thinks about poorly formatted snippets.
It's also very censored and refused a lot of tasks unjustified.

It placed #79 in my current environment; right next to Llama 3.1 8B and Ministral 8B, and for a 72B model that should be quite telling.
It also could not deliver on vibe, style or character.

I see zero use case for this model; there are far better options, and far better long chain-of-thought models out there.

DeepSeek V3

2024-12-25

DeepSeek V3 - Thoroughly tested the new capability, I was fortunate to still have very recent 2.5 datasets (due to being late on 1210) for direct output comparisons.

Strong STEM & code, solid instruction following and general utility, arguably minimally improved reasoning.
Overall, 3rd most capable OpenSource model (behind Llama 3.1 405B & Llama 3.3 70B) in my testing. As for proprietary, roughly on o1-mini level.

The biggest flaw for this model in my testing is clearly its reasoning expert, more specifically anything in areas requiring critical thinking and applying common sense, where it blunders a lot and consistently.

As always, YMMV!

GPT-4 Turbo

2024-12-22

Decided to retest GPT-4 Turbo (OR identifier 'openai/gpt-4-turbo') after 10 months with most recent adjustments, still holds up very well in comparison to most other state of the art models, in fact beating them in most scenarios on pure substance.

It's a bit dry, doesn't have the best formatting nor style/vibe, but gets the job done.
Compared to my testing back in February 2024, something clearly changed though, as the model behaviour was slightly different, e.g. I ran into reproducible refusals that were definitely not present beforehand. Combined with slightly weaker performance (but still strong) I suspect this is caused by the changing of system prompt into more convoluted legal coverage or similar back-end alterations over time.

Still, it's better in this regard than o1. It will remain my go to for when cheaper more efficient models don't cut it for a problem.

Gemini 1206

Gemini 2.0 Flash Experimental

Gemini 2.0 Flash Thinking Experimental

2024-12-21

Tested the 3 recent experimental Gemini models (Gemini Experimental 1206, Gemini 2.0 Flash Experimental, Gemini 2.0 Flash Thinking Experimental).

All 3 were not major improvements in terms of total capability, but are slightly different in terms of behaviour. The thinking model obviously is used for reasoning but introduces too much noise which suffers its coding/instruct follow ability.

o1-2024-12-17

2024-12-18

Tested the full o1 via API:

Compared to o1-preview it used slightly fewer invisible thought-tokens, and is undoubtedly much better at STEM (particularly math), multiple unjustified refusals tanked its utility in my cases, this model is clearly not designed for tasks such as e.g. summarization or agentic personas.
I was not impressed by its coding. I required multiple reiterations and restating info that was present in the task already, wasting a TON of money on non-exceptional results. This model is also fairly censored and steers away from any potentially controversial subjects, even if harmless in the context. This model is more akin to Claude models than what I am used to from OpenAI in terms of overcautiousness.

tldr; Fantastic STEM capability, great reasoning, not too impressive in other areas from my testing.
Unfathomably expensive, obviously, because the invisible tokens inflate actual pricing to around $190 mTok across my testing.

Command R7B

2024-12-15

Tested Command R7B (12-2024) - Around Granite 3.0 8B / Qwen2.5-7B level, with decent STEM performance, poor reasoning and terrible coding. There are stronger options in that size category (LLama 3.1, Ministral, etc.)

Price/Performance is OK, but again there are much better options even for bang4buck.

Phi-4

2024-12-13

Tested Phi-4 (14B), it's a decent model around Nemo 12B & Qwen2.5 14B level, with decent reasoning, very good STEM capability but lackluster code & instruct following. default vibe is very neutral and quite sterile, as expected from a microsoft model.

Llama 3.3 70B

2024-12-07

Checked out Llama 3.3 70B locally (Q4_K_M):
Strongest open model after Llama 3.1 405B, saw big improvements in reasoning and STEM-related tasks, compared to 3.1
It did not do particularly well in my coding-related tasks, though. Due to this outlier, had to retest this segment multiple times.

Overall, very capable general use model

QwQ-32B-Preview

2024-11-28

Ran QwQ-32B-Preview (Q4_K_M) through my own benchmark, have to say I am disappointed, had higher hopes. I's outputs are often annoyingly formatting (e.g. no proper distinction between thinking/reasoning loop and true final output.), often no full code blocks but snippets despite instructed to, terrible instruct follow overall, reasoning was poorer than vanilla Qwen2.5 32B.

Math/Stem got boosted by Reasoning Loops in a positive way - everything else was rather poor in comparison.
Vibe check is failed, the style of this model annoys me, and high refusal rate. Also it claimed to be developed by OpenAI in my self-description collections.

I ranked it #62 (between Jamba 1.5 Mini and GPT-3.5 Turbo). Was hoping for a more fun model for my 100th model anniversary, oh well.

Marco-o1

2024-11-24

Tested Marco-o1 (fp16) and it's capability was pretty much exactly how I expect a 7B model to be.

The thinking is a nice gimmick, but it didn't yield better results in reasoning, as the model was unable to outthink bad thinking. It did help in math related tasks, though.
For any generic utility tasks such as instruct following, summarization, etc. I found it to be borderline unusable. The model sometimes had crucial information within its tags while omitting it entirely from the tags, meaning you cannot effectively filter out tags without info loss.
Fun and quirky to play around with sure, but not groundbreaking in my testing.

DeepSeek-R1-Lite-Preview

2024-11-22

DeepSeek-R1-Lite-Preview:

Performed a bit worse than DeepSeek V2.5 overall, partially due to uncalled for refusals in completely generic tasks.
However, it has the highest reasoning skills of all DeepSeek models (slightly higher than o1-mini).
Overall, it placed on a similar level to Grok-2 mini and Claude 3.5 Haiku in my testing.

If this model is in the 50-60B range it would be very impressive if they can iron out the refusal behaviour.
Can be quirky fun to use due to its ramblings, which are hit or miss.

Claude 3.5 Haiku

2024-11-04

Just checked out Claude 3.5 Haiku, very unexpected results..

In my own small-scale test it showcased:

By far the least censored Claude model (other than Claude-1), very different refusal/censor behaviour when compared to old haiku or Sonnets & Opus.
Roughly 2x capability of Claude 3 Haiku
Did better on my small subset of code related tasks than 3.5 Sonnet
STEM was pretty identical
Some flaws in utility/misc tasks (terrible roleplayer)
Reasoning still pretty weak but huge gains compared to the previous iteration
Pricing is too high, when competing with models such as 4o-mini or Gemini 1.5 Pro 002

Not rated but subjective vibe check: very concise model that seems to love putting nearly everything into list format.
AS ALWAYS - YMMV!

Aya Expanse

2024-10-27

Cohere Aya Expanse testing has concluded for me, very weak for their size compared to the competition, not worth the storage space imo.

Aya Expanse 8B (f16) - failed pretty much everything and was around L3.2 3B capability.
Aya Expanse 32B (Q4_K_M) - weaker than even Gemma 2 9B & Nemo 12B in my testing. It would be OK as like a 12B model due to being fairly uncensored. Gets absolutely stomped by Qwen2.5

Claude 3.5 Sonnet 20241022

2024-10-22

Tested the new 3.5 Sonnet.
After all is done and accounted for, it jumped ranks from #15 > #7 with slightly less prudishness (still much higher than the competition).
I saw massive gains in tasks labeled for Reasoning (suspiciously high gains, I need to investigate this further). A slight dip in prompt adherence and code. I scrutinized and retested all tech-related coding tasks a total of 6 times, ended up running 18 queries PER TASK in that particular label to exclude any random outliers. The results were consistently delivering the same outcome, though.
Good improvements as a whole.

Granite-3.0-8B-Instruct

2024-10-21

Granite-3.0-8B-Instruct (Q8_0). Not terrible, not great. While my bench is too hard for small models, and doesn't catch the minute differences for them, it still gives a rough expected performance ballpark. But if I add to that the vibe check (not tested nor depicted) - utterly uninteresting model, won't stay on my drive.

Yi-Lightning

2024-10-22

My 2 cents on the Yi-Lightning models:
Tested around Llama 3.1 70B level, and Llama 3.0 70B for lite model for me.
Reasoning labeled tasks was pretty dead even between them, not their strong suit.
Pretty good at STEM&Maths, and better at following instructions than Qwen models. Fairly uncensored. Competent but did not reach a top spot, unlike say the current arena ranking. it has good style tho, so that would gain a fair amount of votes.

Ministral

2024-10-19

I was quite impressed by Ministral 3B, 8B on the other hand was a barely noticeable improvement in the vast majority of cases. here are some neighbouring performers.

Mistral 8B =~ Llama 3.1 8B
Mistral 3B =~ Llama 3.0 8B

sucks that the 3B model is not local, would be good to run on the side, it's definitely the more interesting model here but usage is so limited by this.

As always, YMMV!

Inflection 3

2024-10-18

Tested Inflection 3 Models:

Productivity one is better not just in performance but also in style imho. Capability range is around Gemma 2 9B (Pi) and Nemo12b/Qwen14B (Productivity). The models are too expensive for what they offer, ranking near the bottom of my price/performance calculations.

Llama-3.1-Nemotron-70B-Instruct

2024-10-16

Comparing Llama-3.1-Nemotron-70B-Instruct to the vanilla Llama-3.1 Model and the best performing competitor at that size, Qwen2.5-72B.
Overall a pretty substantial improvement, I saw biggest gains in STEM related questions. It's also a pretty consistent model and didn't really blunder anything too terribly.

Llama 3.2 11B & 90B

2024-09-28

tested the new 3.2 vision models (text capability), comparing them to their non-vision brethren. 90B was slightly smarter, and 11B was about even with 8B. I do not bench for vision, but tested them a little bit for myself anyways, it's ok for a first iteration vision but not what I would personally use compared to other vision models.

Llama-3.2 1B & 3B

2024-09-27

Gave Llama-3.2-1B and Llama-3.2-3B a spin. My testing isn't very good for such tiny models, as the testset is too hard for such models, but I wanted to try anyways. I found Gemma 2B it to be vastly superior to Llama-3.2-3B

Llama-3.1-Nemotron-51B

2024-09-23

Llama-3.1-Nemotron-51B tested, very impressive for it's size, being toe to toe with it's 70B brother, outperforming it in math but losing out on reasoning and misc tasks. Great model! As always - YMMV!

Qwen2.5-14B-Instruct

2024-09-18

Finished testing Qwen2.5-14B-Instruct on Q8 - best overall local model sub70B I have tested thus far. Barely beat the former champion despite being half as big. As is the issue with most Chinese models, its not very good at sticking to strict instructions and has general prompt adherence issues, but other than that it's a very capable model.

Mistral-Small-Instruct-24-09

2024-09-17

At 22B, another great sub 70B option, joining the ranks of the likes of Nemo & Gemma 27B.
Decent coder, good at math, and fairly unrestricted out of the box.
Kinda flopped in the logic department during my testing, so if you want something to solve riddles this isn't the right model.

o1-mini

2024-09-16

finally finished my o1 testing. this was a long and expensive ride. o1-mini completely wiped the floor with o1-preview in anything math related, not even close. rest is not big enough of a difference to justify the pricing. as always, YMMV

o1-preview

2024-09-15

full o1-preview results are in on my own smallscale-testing!

Highest reasoning I have tested thus far (outside of unreleased models), partially embarrassing math skills, okayish utility (bad for RP tho), not impressed by coding, most censored OpenAI model I have ever tested.

This was very expensive and time-consuming, due to usage caps, and the fact that this time around I also had to track invisible token usage, for the true mTok cost... working on mini next, so far I like it a lot better in terms of price/performance and it seems to do less wasteful tokenwaste thinking.
As always, YMMV!

DeepSeek V2.5

2024-09-10

Put DeepSeek V2.5 through my benchmark. Very similar total capability to DeepSeek-V2, with improvement in math and programming but slight decrease in reasoning and prompt adherence. Very good model for the price/size. As always - YMMV!

Reflection Llama-3.1 70B

2024-09-06

model review of Reflection (local Quant, thus not as powerful as fp16 (once the bugs are ironed out), but still very useful data for general ballpark.
In terms of Llama variations:

Reasoning: Very good, as was expected. Only beaten by 405B
STEM: Still good for its size. Reflection can cause math to introduce additional inaccuracies.
General utility: Bad, the baked in reflection counter-acts user instructions to the point where the reflecting part actively tries to combat user instructions
Code: on par with L3 70B, it's large thinking/reflecting segments seems to cause context poisoning issues, lowering the end result code quality
Censorship: same as L3 70b and L3.1 8B.

Command R 08-2024

Command R+ 08-2024

2024-08-31

New Command R models, more efficient but minor "improvements" overall, the API introduced safety guardrails, which technically can be removed in terminal if you access model directly, but will be unable to be turned off on providers that do not allow for manipulating the safety parameter.

If we compare to competition in similar size brackets, e.g. R+ to Mistral Large and R to Gemma 2 27B, the performance is underwhelming.

Command R 08-2024

2024-08-30

Command R 08-2024 testing (Q4_K_M); slightly better than old model, but , at least in my testing, for its size not good enough compared to the competition of smaller size. Weirdly enough it blundered my entire code section.

Jamba 1.5

2024-08-23

Tested the Jamba 1.5 models. they are decent-ish but pretty underwhelming for their size.
Jamba 1.5 Large with a gigantic 399B size punches in the same league as old L3-70B, and the mini version is roughly equivalent to the over 5 times smaller Gemma 2 9B

Llama 3.1 405B Instruct

2024-08-12

in lieu of https://x.com/aidan_mclau/status/1822830757137596521 I redid the entire bench on bf16. also reran results 3 additional times if they differed between versions. most notable difference is that I got 0 refusals this time (4 reproducible refusals on fp8), and the reasoning is higher with minute discepancies in math and 1 programming task.
Meta changing kv heads a few days ago, and API outputs retroactively being bugged as I noticed 2 days ago doesn't help discrepancies.

Gemini Pro 1.5 experimental

2024-08-03

Gemini Pro 1.5 experimental is quite the step up. (in before the compliance gets nerfed after experimental phase).

Claude-3.5-Sonnet

2024-06-21

Obligatory 3.5 sonnet graph. my own bench, as always ymmv.
Better than Opus in almost every tested way, except for programming and censorship tasks. Still inferior to OpenAI in reasoning and critical thinking tasks. Passed ~57% of my tasks, with double the fail rate of GPT-4 Turbo.
as always ymmv.

Llama-3-70B-Instruct

2024-04-19

Tested llama 3 today. Benched Around Mistral medium level. Good reasoning, A terrible programmer though, missed every bug hunting task, and every needle in haystack task. But a big improvement over llama-2 overall, also far more lenient in terms of refusals.

Command R+

2024-04-15

Today I ran Command R+ through all 80 of my benchmarks. The results are far lower than arena rankings, putting its results around Claude-3 Sonnet and "Mixtral-8x7b-Instruct-v0.1"-level

claude-3-haiku-20240307

2024-03-24

I finally had the time to run all tests through Haiku, so here are the 4 recent claude models together. Haiku is the only cost effective claude model. It comes at a miniscule 1.67% the price of Opus and performs well in STEM and small generic tasks, but sucks heavily at reasoning skills.

Gemini, Claude, Mistral, GPT-4 Turbo

2024-03-09

A few other interesting findings:

Gemini (both pro and ultra) are very prone to unnecessary refusal, even refusing tasks that are not even remotely questionable.,
Mistral and OpenAI models almost never refuse anything, even my tasks that are specifically designed to be risky. (Claude-1 belonged to this camp),
Sonnet is such a weird model. In my testing, it performed better than Opus on tasks that have extremely high difficulty (>83%) yet somehow manages to give completely moronic answers to the easiest questions: https://i.imgur.com/4VeZ5vB.png).,
Out of all tested models, Claude 2.1 scored highest on prompt adherence (sticking to prompt instructions),
Opus seems significantly better in STEM and math, but did not deliver better results in programming over sonnet.,
GPT-4 Turbo has the highest reasoning skills, bar none.,
The best models sometimes fail to do the simplest tasks, that the worst models easily do, such as ending a sentence with a specific word, or excluding certain things.,
GPT-4 Turbo was the only model that consistently gets easy to medium tasks correct, whereas other models sometimes fail at even the simplest of tasks.

claude-3-opus-20240229

claude-3-sonnet-20240229

2024-03-05

Opus seems very similar to mistral large model in terms of performance. lower reasoning, better math, more censored. Overall very similar to mistral large. Not as good as gpt-4 by any metrics I was able to test.
And claude 3-sonnet is a very average model, around mixtral level, weaker than gemini ultra even.

Earlier first impressions

2023/12 - 2024/08

Earlier first impressions between 2023 to Aug 2024 are scattered across different servers & messages, with screenshots or short in nature, too timeconsuming to find to copy right now.

No matching entries found