Wolfram LLM
Benchmarking Project

Using Wolfram Language to benchmark the performance of major LLMs

As major users and analyzers of large language model (LLM) technology, we've been continually tracking the performance of LLMs. This project involves releasing our ongoing results, initially for a specific well-characterized code generation task.

The task consists of going from English-language specifications to Wolfram Language code. The test cases are exercises from Stephen Wolfram's An Elementary Introduction to the Wolfram Language. These exercises have been done online by millions of humans, and we've developed effective tools for determining functional correctness of code, which we're now applying to LLMs.

Last Updated: February 04, 2025

Vendor	Model	Correct Syntax	Correct Functionality
Qwen	Qwen2.5-Max (2025-01-25)Qwen	100.0%	57.3%
DeepSeek	DeepSeek-R1DeepSeek	99.6%	55.1%
OpenAI	o1-preview (2024-09-12)OpenAI	99.7%	52.2%
Anthropic	Claude 3.5 Sonnet (20241022)Anthropic	99.7%	51.9%
OpenAI	o1 (2024-12-17)OpenAI	100.0%	51.0%
DeepSeek	DeepSeek V3 671BDeepSeek	99.9%	50.6%
Meta	Llama 3.1 405B InstructMeta	99.7%	50.5%
OpenAI	GPT-4OpenAI	99.8%	49.7%
Qwen	Qwen2.5-Plus (2025-01-25)Qwen	99.7%	49.6%
Google	Gemini Exp (1206)Google	99.6%	49.3%
Qwen	Qwen2.5 72B Instruct Q4_K_MQwen	99.9%	49.0%
Nexusflow	Athene v2 72B Q4_K_MNexusflow	99.7%	48.5%
OpenAI	GPT-4o (2024-11-20)OpenAI	100.0%	47.8%
OpenAI	GPT-4o (2024-08-06)OpenAI	99.9%	47.7%
Mistral AI	Mistral Large 2 (2411)Mistral AI	99.7%	47.2%
Mistral AI	Mistral Large 2 (2407)Mistral AI	99.7%	47.2%
OpenAI	GPT-4o (2024-05-13)OpenAI	100.0%	46.2%
OpenAI	GPT-4 TurboOpenAI	99.8%	46.2%
Google	Gemini 2.0 Flash Thinking Exp (01-21)Google	99.4%	46.0%
Meta	Llama 3.1 70B InstructMeta	99.9%	45.8%
OpenAI	o3-mini (2025-01-31) (medium)OpenAI	98.8%	45.6%
Google	Gemini Pro Exp (0827)Google	99.7%	45.5%
xAi	Grok-2 (1212)xAi	99.7%	45.4%
Google	Gemini 1.5 Pro (002)Google	99.3%	45.1%
xAi	Grok-2 BetaxAi	99.6%	44.4%
Anthropic	Claude 3 OpusAnthropic	99.4%	44.4%
Qwen	Qwen2.5 32B InstructQwen	99.9%	44.3%
Qwen	Qwen2 Math 72B Instruct Q4_K_MQwen	98.7%	44.3%
Mistral AI	Mistral Large 2 (2407) Q4_0Mistral AI	99.6%	43.8%
NVIDIA	Llama 3.1 Nemotron 70B Instruct Q4_K_MNVIDIA	99.6%	43.8%
Anthropic	Claude 3.5 Sonnet (20240620)Anthropic	99.7%	43.7%
Google	Gemini 2.0 Flash ExpGoogle	99.3%	43.6%
OpenAI	o1-mini (2024-09-12)OpenAI	99.1%	43.5%
Google	Gemini Exp (1114)Google	98.7%	43.4%
Google	Gemini 2.0 Flash Thinking ExpGoogle	99.4%	43.2%
Google	Gemini Exp (1121)Google	99.3%	43.2%
Meta	Llama 3.2 Vision 90B Instruct Q4_K_MMeta	99.1%	43.0%
DeepSeek	DeepSeek-Coder-V2 236BDeepSeek	99.3%	42.5%
Qwen	Qwen2.5 Coder 32B InstructQwen	99.4%	42.1%
Falcon LLM	Falcon3 10BFalcon LLM	99.1%	41.8%
Qwen	Qwen2.5 Coder 14B InstructQwen	99.6%	41.4%
Meta	Llama 3.1 70B Instruct Q8_0Meta	99.4%	41.3%
Qwen	QwQ 32B PreviewQwen	98.5%	41.3%
DeepSeek	DeepSeek V2.5 236B Instruct Q5_1DeepSeek	99.9%	41.0%
Google	Gemini 1.5 Pro (001)Google	99.0%	40.8%
Meta	Llama 3.3 70B InstructMeta	99.6%	40.7%
Mistral AI	Pixtral Large (2411)Mistral AI	98.7%	40.5%
Qwen	Qwen2 72B Instruct Q8_0Qwen	99.6%	40.2%
Meta	Llama 3 70B InstructMeta	99.7%	39.6%
Google	text-unicorn-001Google	99.6%	39.5%
Meta	Llama 3.1 70B Instruct Q4_0Meta	97.3%	39.4%
Falcon LLM	Falcon3 7BFalcon LLM	99.6%	38.9%
Qwen	Qwen2.5 14B InstructQwen	99.7%	38.5%
OpenAI	GPT-3.5 TurboOpenAI	99.0%	38.5%
Anthropic	Claude 3.5 Haiku (20241022)Anthropic	98.7%	38.5%
Ai2	Tulu 70BAi2	100.0%	38.2%
Mistral AI	Mistral Large 1 (2402)Mistral AI	98.4%	38.2%
Google	Gemini 1.5 Flash (002)Google	98.7%	37.6%
Qwen	Qwen2.5-Turbo (2024-11-01)Qwen	99.3%	37.0%
Google	Gemini Flash Exp (0827)Google	99.0%	37.0%
Meta	Code Llama 34BMeta	99.7%	36.1%
Mistral AI	Mixtral 8x22BMistral AI	98.2%	36.1%
NVIDIA	Nemotron-4 340B InstructNVIDIA	98.2%	35.6%
Yi	Yi 1.5 34BYi	100.0%	35.3%
Nexusflow	Athene 70B Q4_K_MNexusflow	98.7%	35.1%
Cohere	Command R plus 104B Q4_K_MCohere	99.0%	34.9%
Mistral AI	CodestralMistral AI	97.5%	34.4%
Mistral AI	Mistral Small (2409)Mistral AI	98.7%	33.8%
Google	Gemini 1.5 Flash (001)Google	98.5%	33.8%
Falcon LLM	Falcon3 Mamba 7BFalcon LLM	97.9%	32.9%
Google	Gemma 2 27B InstructGoogle	99.0%	31.9%
Google	Gemma 2 27B Instruct Q4_0Google	99.0%	31.0%
Qwen	Qwen2.5 Coder 7B InstructQwen	99.3%	30.2%
AIDC-AI	Marco-o1 7BAIDC-AI	98.8%	30.0%
Meta	Code Llama 13BMeta	98.7%	30.0%
Qwen	Qwen2.5 7B InstructQwen	98.2%	30.0%
Mistral AI	Mistral SmallMistral AI	97.5%	29.9%
Qwen	Qwen2 57B Instruct Q4_K_MQwen	97.6%	29.7%
Yi	Yi 1.5 9BYi	98.8%	29.5%
Databricks	DBRX 132B Instruct Q4_0Databricks	99.7%	29.4%
Cohere	Aya Expanse 32BCohere	98.5%	29.2%
Qwen	Qwen2 7B InstructQwen	98.8%	29.1%
Sea AI Lab	Sailor2 20BSea AI Lab	98.5%	28.8%
Anthropic	Claude 2.1Anthropic	96.6%	28.5%
DeepSeek	DeepSeek V2 236B Instruct Q4_K_MDeepSeek	99.1%	28.3%
LG AI Research	EXAONE 3.5 32B InstructLG AI Research	96.3%	28.3%
Anthropic	Claude 2Anthropic	87.3%	28.2%
Anthropic	Claude 3 SonnetAnthropic	98.7%	27.8%
OpenAI	GPT-4o miniOpenAI	94.2%	27.6%
Mistral AI	Mistral MediumMistral AI	88.4%	27.6%
Cohere	Command R 35B Q4_K_MCohere	99.4%	27.4%
DeepSeek	DeepSeek-Coder 7BDeepSeek	92.1%	27.3%
Falcon LLM	Falcon3 3BFalcon LLM	98.1%	27.1%
Qwen	Qwen2 7B Instruct Q4_0Qwen	99.3%	26.8%
DeepSeek	DeepSeek-Coder 33BDeepSeek	92.8%	26.2%
Meta	Code Llama 7BMeta	97.2%	26.0%
Google	Gemini 1.5 Flash 8BGoogle	96.7%	26.0%
Mistral AI	MathstralMistral AI	98.4%	25.8%
Mistral AI	Codestral MambaMistral AI	98.2%	25.5%
IBM	Granite Code 8B InstructIBM	93.4%	25.4%
Meta	Llama 3.1 8B InstructMeta	97.6%	25.3%
Google	text-bison-002Google	98.1%	25.1%
Groq	Llama 3 Groq Tool Use 70BGroq	92.3%	24.9%
Anthropic	Claude 3 HaikuAnthropic	98.4%	24.8%
IBM	Granite 3.0 8B InstructIBM	98.1%	24.7%
Google	code-bison-002Google	97.8%	24.6%
IBM	Granite 3.1 8B InstructIBM	99.0%	24.5%
Google	code-gecko-002Google	98.1%	24.5%
DeepSeek	DeepSeek-Coder-V2 16B InstructDeepSeek	98.8%	24.4%
Meta	Llama 3.1 70B Instruct Q2_0Meta	93.0%	24.4%
Salesforce	xLAM 7BSalesforce	89.7%	24.4%
Microsoft	Phi-4Microsoft	98.4%	24.3%
Google	Gemini 1.0 Pro (002)Google	94.5%	24.2%
IBM	Granite Code 20B InstructIBM	98.8%	24.1%
Qwen	Qwen2 Math 7B InstructQwen	93.0%	24.1%
Meta	Llama 3.2 Vision 11B InstructMeta	97.6%	23.9%
Yi	Yi 1.5 6BYi	98.7%	23.6%
Nexusflow	NexusRaven v2 13BNexusflow	96.9%	23.4%
Google	CodeGemma 7B instruct v1.1Google	95.4%	23.4%
Yi	Yi Coder 9B BaseYi	96.1%	23.3%
Qwen	Qwen2 Math 7B Instruct Q4_0Qwen	90.9%	23.2%
DeepSeek	DeepSeek-Coder 6.7BDeepSeek	89.1%	22.8%
Qwen	Qwen2.5 Coder 3B InstructQwen	98.1%	22.2%
Meta	Llama 3 8B InstructMeta	97.0%	22.1%
Ai2	Tulu 8BAi2	97.8%	21.8%
Google	CodeGemma 7B Q4_0Google	87.6%	21.8%
Cognitive Computations	Dolphin 2.9.4 Llama 3.1 8B Q8_0Cognitive Computations	99.1%	21.3%
Mistral AI	Mistral NemoMistral AI	97.6%	21.3%
IBM	Granite Code 3B InstructIBM	93.9%	21.1%
Google	Gemma 2 9B InstructGoogle	97.0%	21.0%
Qwen	Qwen2.5 3B InstructQwen	97.6%	20.7%
Google	Gemma 2 9B Instruct Q4_0Google	97.3%	20.6%
Upstage	Solar ProUpstage	95.4%	20.1%
Groq	Llama 3 Groq Tool Use 8BGroq	96.0%	19.7%
IBM	Granite Code 34B InstructIBM	98.5%	19.5%
Yi	Yi Coder 9B ChatYi	96.7%	19.4%
Mistral AI	Ministral 8B (2410)Mistral AI	97.0%	19.2%
Falcon LLM	Falcon Mamba 7BFalcon LLM	94.2%	19.2%
Google	code-bison-001Google	95.8%	19.1%
LG AI Research	EXAONE 3.5 7.8B InstructLG AI Research	95.1%	19.1%
Cognitive Computations	Dolphin 3 Llama 3.1 8BCognitive Computations	95.1%	18.8%
Ai2	Olmo2 13B InstructAi2	99.1%	18.6%
Microsoft	Phi-3 Small 128kMicrosoft	94.6%	18.5%
Microsoft	Phi-3 MediumMicrosoft	93.6%	18.4%
Qwen	Qwen2.5 Coder 1.5B InstructQwen	96.7%	18.2%
Cognitive Computations	Dolphin 2.9.1 Yi 1.5 34B Q4_K_MCognitive Computations	94.2%	18.2%
Mistral AI	Ministral 3B (2410)Mistral AI	95.7%	17.9%
Qwen	Qwen2.5 1.5B InstructQwen	99.0%	17.4%
Upstage	Solar MiniUpstage	96.3%	17.4%
Microsoft	Phi-3 MiniMicrosoft	89.6%	17.2%
Qwen	Qwen2 Math 1.5B InstructQwen	91.1%	17.0%
IBM	Granite 3.1 3B MoE InstructIBM	96.4%	16.9%
Cohere	Aya Expanse 8BCohere	91.2%	16.7%
Sea AI Lab	Sailor2 8BSea AI Lab	90.3%	16.7%
LG AI Research	EXAONE 3.0 7.8B Instruct Q5_K_MLG AI Research	92.4%	16.5%
IBM	Granite 3.1 2B InstructIBM	97.5%	16.3%
IBM	Granite 3.0 2B InstructIBM	96.0%	16.3%
IBM	Granite 3.0 3B MoE InstructIBM	96.9%	16.1%
LG AI Research	EXAONE 3.5 2.4B InstructLG AI Research	93.9%	16.0%
Nous Research	Nous-Hermes-2-Mixtral-8x7B-DPONous Research	79.9%	15.6%
Ai2	Olmo2 7B InstructAi2	91.3%	15.5%
OpenChat	openchat3.5OpenChat	88.2%	15.2%
Cohere	Aya 23 35BCohere	96.9%	14.8%
Mistral AI	Pixtral 12BMistral AI	93.1%	14.3%
Hugging Face	SmolLM2 1.7B InstructHugging Face	96.7%	13.6%
Aleph Alpha	Luminous-supremeAleph Alpha	81.8%	13.2%
Meta	Llama 3.2 3B InstructMeta	94.2%	12.4%
Microsoft	Phi-3.5 Mini InstructMicrosoft	79.4%	11.6%
Falcon LLM	Falcon3 1BFalcon LLM	94.2%	11.5%
Falcon LLM	Falcon 40B Instruct Q4_0Falcon LLM	93.0%	11.1%
Yi	Yi Coder 1.5B BaseYi	88.2%	11.1%
Aleph Alpha	Luminous-supreme-control-20230501Aleph Alpha	86.8%	10.9%
InternLM	InternLM2.5 20B Q4_0InternLM	46.2%	10.9%
IBM	Granite 3.1 1B MoEInstructIBM	95.5%	10.7%
Meta	Llama 2 13BMeta	91.4%	10.7%
IBM	Granite 3.0 1B MoE InstructIBM	82.2%	10.7%
DeepSeek	DeepSeek-Coder 1.3B InstructDeepSeek	64.3%	10.7%
Qwen	Qwen2.5 0.5B InstructQwen	93.9%	10.5%
Google	Gemma 2 2B InstructGoogle	92.4%	10.4%
Aleph Alpha	Luminous-extendedAleph Alpha	77.7%	10.0%
Hugging Face	SmolLM 1.7B InstructHugging Face	97.3%	9.1%
Qwen	Qwen2.5 Coder 0.5B InstructQwen	95.7%	9.1%
Aleph Alpha	Luminous-supreme-control-20240215Aleph Alpha	56.3%	9.1%
Google	CodeGemma 2B v1.1Google	86.9%	8.5%
Mistral AI	Mistral TinyMistral AI	78.5%	8.2%
Hugging Face	SmolLM2 360M InstructHugging Face	92.2%	7.9%
Google	CodeGemma 2BGoogle	87.5%	7.7%
Aleph Alpha	Luminous-extended-control-20240215Aleph Alpha	69.2%	7.5%
Aleph Alpha	Luminous-baseAleph Alpha	62.4%	7.0%
Meta	Llama 3.2 1B InstructMeta	85.8%	6.7%
Aleph Alpha	Luminous-base-control-20240215Aleph Alpha	76.3%	6.6%
OpenChat	openchat3.6 8BOpenChat	89.7%	6.3%
Hugging Face	SmolLM 360M InstructHugging Face	94.3%	5.4%
Replit	replit-code-v1_5-3BReplit	40.5%	4.2%
Meta	Llama 2 7BMeta	26.4%	3.7%
Sea AI Lab	Sailor2 1BSea AI Lab	83.9%	3.3%
Falcon LLM	Falcon 7BFalcon LLM	45.3%	3.3%
NVIDIA	Minitron 4BNVIDIA	7.3%	2.8%
Hugging Face	SmolLM 135M InstructHugging Face	91.7%	2.2%
Hugging Face	SmolLM2 135M InstructHugging Face	82.3%	1.3%
Salesforce	xLAM 1BSalesforce	22.8%	0.3%
No Matches

This table and previous versions are available in computable form in the Wolfram Data Repository.

Find out how Wolfram Language can enhance your LLM results.

For LLM developers: contact us for the dataset and tools or to arrange for your LLM to be included.

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read

Wolfram LLM
Benchmarking Project

Using Wolfram Language to benchmark the performance of major LLMs

Related resources: