WOLFRAM

Wolfram LLM
Benchmarking Project

Using Wolfram Language to benchmark the performance of major LLMs

As major users and analyzers of large language model (LLM) technology, we've been continually tracking the performance of LLMs. This project involves releasing our ongoing results, initially for a specific well-characterized code generation task.

The task consists of going from English-language specifications to Wolfram Language code. The test cases are exercises from Stephen Wolfram's An Elementary Introduction to the Wolfram Language. These exercises have been done online by millions of humans, and we've developed effective tools for determining functional correctness of code, which we're now applying to LLMs.

Last Updated: February 04, 2025

Vendor
Model
Correct
Syntax
Correct
Functionality
QwenQwen2.5-Max (2025-01-25)Qwen100.0%57.3%
DeepSeekDeepSeek-R1DeepSeek99.6%55.1%
OpenAIo1-preview (2024-09-12)OpenAI99.7%52.2%
AnthropicClaude 3.5 Sonnet (20241022)Anthropic99.7%51.9%
OpenAIo1 (2024-12-17)OpenAI100.0%51.0%
DeepSeekDeepSeek V3 671BDeepSeek99.9%50.6%
MetaLlama 3.1 405B InstructMeta99.7%50.5%
OpenAIGPT-4OpenAI99.8%49.7%
QwenQwen2.5-Plus (2025-01-25)Qwen99.7%49.6%
GoogleGemini Exp (1206)Google99.6%49.3%
QwenQwen2.5 72B Instruct Q4_K_MQwen99.9%49.0%
NexusflowAthene v2 72B Q4_K_MNexusflow99.7%48.5%
OpenAIGPT-4o (2024-11-20)OpenAI100.0%47.8%
OpenAIGPT-4o (2024-08-06)OpenAI99.9%47.7%
Mistral AIMistral Large 2 (2411)Mistral AI99.7%47.2%
Mistral AIMistral Large 2 (2407)Mistral AI99.7%47.2%
OpenAIGPT-4o (2024-05-13)OpenAI100.0%46.2%
OpenAIGPT-4 TurboOpenAI99.8%46.2%
GoogleGemini 2.0 Flash Thinking Exp (01-21)Google99.4%46.0%
MetaLlama 3.1 70B InstructMeta99.9%45.8%
OpenAIo3-mini (2025-01-31) (medium)OpenAI98.8%45.6%
GoogleGemini Pro Exp (0827)Google99.7%45.5%
xAiGrok-2 (1212)xAi99.7%45.4%
GoogleGemini 1.5 Pro (002)Google99.3%45.1%
xAiGrok-2 BetaxAi99.6%44.4%
AnthropicClaude 3 OpusAnthropic99.4%44.4%
QwenQwen2.5 32B InstructQwen99.9%44.3%
QwenQwen2 Math 72B Instruct Q4_K_MQwen98.7%44.3%
Mistral AIMistral Large 2 (2407) Q4_0Mistral AI99.6%43.8%
NVIDIALlama 3.1 Nemotron 70B Instruct Q4_K_MNVIDIA99.6%43.8%
AnthropicClaude 3.5 Sonnet (20240620)Anthropic99.7%43.7%
GoogleGemini 2.0 Flash ExpGoogle99.3%43.6%
OpenAIo1-mini (2024-09-12)OpenAI99.1%43.5%
GoogleGemini Exp (1114)Google98.7%43.4%
GoogleGemini 2.0 Flash Thinking ExpGoogle99.4%43.2%
GoogleGemini Exp (1121)Google99.3%43.2%
MetaLlama 3.2 Vision 90B Instruct Q4_K_MMeta99.1%43.0%
DeepSeekDeepSeek-Coder-V2 236BDeepSeek99.3%42.5%
QwenQwen2.5 Coder 32B InstructQwen99.4%42.1%
Falcon LLMFalcon3 10BFalcon LLM99.1%41.8%
QwenQwen2.5 Coder 14B InstructQwen99.6%41.4%
MetaLlama 3.1 70B Instruct Q8_0Meta99.4%41.3%
QwenQwQ 32B PreviewQwen98.5%41.3%
DeepSeekDeepSeek V2.5 236B Instruct Q5_1DeepSeek99.9%41.0%
GoogleGemini 1.5 Pro (001)Google99.0%40.8%
MetaLlama 3.3 70B InstructMeta99.6%40.7%
Mistral AIPixtral Large (2411)Mistral AI98.7%40.5%
QwenQwen2 72B Instruct Q8_0Qwen99.6%40.2%
MetaLlama 3 70B InstructMeta99.7%39.6%
Googletext-unicorn-001Google99.6%39.5%
MetaLlama 3.1 70B Instruct Q4_0Meta97.3%39.4%
Falcon LLMFalcon3 7BFalcon LLM99.6%38.9%
QwenQwen2.5 14B InstructQwen99.7%38.5%
OpenAIGPT-3.5 TurboOpenAI99.0%38.5%
AnthropicClaude 3.5 Haiku (20241022)Anthropic98.7%38.5%
Ai2Tulu 70BAi2100.0%38.2%
Mistral AIMistral Large 1 (2402)Mistral AI98.4%38.2%
GoogleGemini 1.5 Flash (002)Google98.7%37.6%
QwenQwen2.5-Turbo (2024-11-01)Qwen99.3%37.0%
GoogleGemini Flash Exp (0827)Google99.0%37.0%
MetaCode Llama 34BMeta99.7%36.1%
Mistral AIMixtral 8x22BMistral AI98.2%36.1%
NVIDIANemotron-4 340B InstructNVIDIA98.2%35.6%
YiYi 1.5 34BYi100.0%35.3%
NexusflowAthene 70B Q4_K_MNexusflow98.7%35.1%
CohereCommand R plus 104B Q4_K_MCohere99.0%34.9%
Mistral AICodestralMistral AI97.5%34.4%
Mistral AIMistral Small (2409)Mistral AI98.7%33.8%
GoogleGemini 1.5 Flash (001)Google98.5%33.8%
Falcon LLMFalcon3 Mamba 7BFalcon LLM97.9%32.9%
GoogleGemma 2 27B InstructGoogle99.0%31.9%
GoogleGemma 2 27B Instruct Q4_0Google99.0%31.0%
QwenQwen2.5 Coder 7B InstructQwen99.3%30.2%
AIDC-AIMarco-o1 7BAIDC-AI98.8%30.0%
MetaCode Llama 13BMeta98.7%30.0%
QwenQwen2.5 7B InstructQwen98.2%30.0%
Mistral AIMistral SmallMistral AI97.5%29.9%
QwenQwen2 57B Instruct Q4_K_MQwen97.6%29.7%
YiYi 1.5 9BYi98.8%29.5%
DatabricksDBRX 132B Instruct Q4_0Databricks99.7%29.4%
CohereAya Expanse 32BCohere98.5%29.2%
QwenQwen2 7B InstructQwen98.8%29.1%
Sea AI LabSailor2 20BSea AI Lab98.5%28.8%
AnthropicClaude 2.1Anthropic96.6%28.5%
DeepSeekDeepSeek V2 236B Instruct Q4_K_MDeepSeek99.1%28.3%
LG AI ResearchEXAONE 3.5 32B InstructLG AI Research96.3%28.3%
AnthropicClaude 2Anthropic87.3%28.2%
AnthropicClaude 3 SonnetAnthropic98.7%27.8%
OpenAIGPT-4o miniOpenAI94.2%27.6%
Mistral AIMistral MediumMistral AI88.4%27.6%
CohereCommand R 35B Q4_K_MCohere99.4%27.4%
DeepSeekDeepSeek-Coder 7BDeepSeek92.1%27.3%
Falcon LLMFalcon3 3BFalcon LLM98.1%27.1%
QwenQwen2 7B Instruct Q4_0Qwen99.3%26.8%
DeepSeekDeepSeek-Coder 33BDeepSeek92.8%26.2%
MetaCode Llama 7BMeta97.2%26.0%
GoogleGemini 1.5 Flash 8BGoogle96.7%26.0%
Mistral AIMathstralMistral AI98.4%25.8%
Mistral AICodestral MambaMistral AI98.2%25.5%
IBMGranite Code 8B InstructIBM93.4%25.4%
MetaLlama 3.1 8B InstructMeta97.6%25.3%
Googletext-bison-002Google98.1%25.1%
GroqLlama 3 Groq Tool Use 70BGroq92.3%24.9%
AnthropicClaude 3 HaikuAnthropic98.4%24.8%
IBMGranite 3.0 8B InstructIBM98.1%24.7%
Googlecode-bison-002Google97.8%24.6%
IBMGranite 3.1 8B InstructIBM99.0%24.5%
Googlecode-gecko-002Google98.1%24.5%
DeepSeekDeepSeek-Coder-V2 16B InstructDeepSeek98.8%24.4%
MetaLlama 3.1 70B Instruct Q2_0Meta93.0%24.4%
SalesforcexLAM 7BSalesforce89.7%24.4%
MicrosoftPhi-4Microsoft98.4%24.3%
GoogleGemini 1.0 Pro (002)Google94.5%24.2%
IBMGranite Code 20B InstructIBM98.8%24.1%
QwenQwen2 Math 7B InstructQwen93.0%24.1%
MetaLlama 3.2 Vision 11B InstructMeta97.6%23.9%
YiYi 1.5 6BYi98.7%23.6%
NexusflowNexusRaven v2 13BNexusflow96.9%23.4%
GoogleCodeGemma 7B instruct v1.1Google95.4%23.4%
YiYi Coder 9B BaseYi96.1%23.3%
QwenQwen2 Math 7B Instruct Q4_0Qwen90.9%23.2%
DeepSeekDeepSeek-Coder 6.7BDeepSeek89.1%22.8%
QwenQwen2.5 Coder 3B InstructQwen98.1%22.2%
MetaLlama 3 8B InstructMeta97.0%22.1%
Ai2Tulu 8BAi297.8%21.8%
GoogleCodeGemma 7B Q4_0Google87.6%21.8%
Cognitive ComputationsDolphin 2.9.4 Llama 3.1 8B Q8_0Cognitive Computations99.1%21.3%
Mistral AIMistral NemoMistral AI97.6%21.3%
IBMGranite Code 3B InstructIBM93.9%21.1%
GoogleGemma 2 9B InstructGoogle97.0%21.0%
QwenQwen2.5 3B InstructQwen97.6%20.7%
GoogleGemma 2 9B Instruct Q4_0Google97.3%20.6%
UpstageSolar ProUpstage95.4%20.1%
GroqLlama 3 Groq Tool Use 8BGroq96.0%19.7%
IBMGranite Code 34B InstructIBM98.5%19.5%
YiYi Coder 9B ChatYi96.7%19.4%
Mistral AIMinistral 8B (2410)Mistral AI97.0%19.2%
Falcon LLMFalcon Mamba 7BFalcon LLM94.2%19.2%
Googlecode-bison-001Google95.8%19.1%
LG AI ResearchEXAONE 3.5 7.8B InstructLG AI Research95.1%19.1%
Cognitive ComputationsDolphin 3 Llama 3.1 8BCognitive Computations95.1%18.8%
Ai2Olmo2 13B InstructAi299.1%18.6%
MicrosoftPhi-3 Small 128kMicrosoft94.6%18.5%
MicrosoftPhi-3 MediumMicrosoft93.6%18.4%
QwenQwen2.5 Coder 1.5B InstructQwen96.7%18.2%
Cognitive ComputationsDolphin 2.9.1 Yi 1.5 34B Q4_K_MCognitive Computations94.2%18.2%
Mistral AIMinistral 3B (2410)Mistral AI95.7%17.9%
QwenQwen2.5 1.5B InstructQwen99.0%17.4%
UpstageSolar MiniUpstage96.3%17.4%
MicrosoftPhi-3 MiniMicrosoft89.6%17.2%
QwenQwen2 Math 1.5B InstructQwen91.1%17.0%
IBMGranite 3.1 3B MoE InstructIBM96.4%16.9%
CohereAya Expanse 8BCohere91.2%16.7%
Sea AI LabSailor2 8BSea AI Lab90.3%16.7%
LG AI ResearchEXAONE 3.0 7.8B Instruct Q5_K_MLG AI Research92.4%16.5%
IBMGranite 3.1 2B InstructIBM97.5%16.3%
IBMGranite 3.0 2B InstructIBM96.0%16.3%
IBMGranite 3.0 3B MoE InstructIBM96.9%16.1%
LG AI ResearchEXAONE 3.5 2.4B InstructLG AI Research93.9%16.0%
Nous ResearchNous-Hermes-2-Mixtral-8x7B-DPONous Research79.9%15.6%
Ai2Olmo2 7B InstructAi291.3%15.5%
OpenChatopenchat3.5OpenChat88.2%15.2%
CohereAya 23 35BCohere96.9%14.8%
Mistral AIPixtral 12BMistral AI93.1%14.3%
Hugging FaceSmolLM2 1.7B InstructHugging Face96.7%13.6%
Aleph AlphaLuminous-supremeAleph Alpha81.8%13.2%
MetaLlama 3.2 3B InstructMeta94.2%12.4%
MicrosoftPhi-3.5 Mini InstructMicrosoft79.4%11.6%
Falcon LLMFalcon3 1BFalcon LLM94.2%11.5%
Falcon LLMFalcon 40B Instruct Q4_0Falcon LLM93.0%11.1%
YiYi Coder 1.5B BaseYi88.2%11.1%
Aleph AlphaLuminous-supreme-control-20230501Aleph Alpha86.8%10.9%
InternLMInternLM2.5 20B Q4_0InternLM46.2%10.9%
IBMGranite 3.1 1B MoEInstructIBM95.5%10.7%
MetaLlama 2 13BMeta91.4%10.7%
IBMGranite 3.0 1B MoE InstructIBM82.2%10.7%
DeepSeekDeepSeek-Coder 1.3B InstructDeepSeek64.3%10.7%
QwenQwen2.5 0.5B InstructQwen93.9%10.5%
GoogleGemma 2 2B InstructGoogle92.4%10.4%
Aleph AlphaLuminous-extendedAleph Alpha77.7%10.0%
Hugging FaceSmolLM 1.7B InstructHugging Face97.3%9.1%
QwenQwen2.5 Coder 0.5B InstructQwen95.7%9.1%
Aleph AlphaLuminous-supreme-control-20240215Aleph Alpha56.3%9.1%
GoogleCodeGemma 2B v1.1Google86.9%8.5%
Mistral AIMistral TinyMistral AI78.5%8.2%
Hugging FaceSmolLM2 360M InstructHugging Face92.2%7.9%
GoogleCodeGemma 2BGoogle87.5%7.7%
Aleph AlphaLuminous-extended-control-20240215Aleph Alpha69.2%7.5%
Aleph AlphaLuminous-baseAleph Alpha62.4%7.0%
MetaLlama 3.2 1B InstructMeta85.8%6.7%
Aleph AlphaLuminous-base-control-20240215Aleph Alpha76.3%6.6%
OpenChatopenchat3.6 8BOpenChat89.7%6.3%
Hugging FaceSmolLM 360M InstructHugging Face94.3%5.4%
Replitreplit-code-v1_5-3BReplit40.5%4.2%
MetaLlama 2 7BMeta26.4%3.7%
Sea AI LabSailor2 1BSea AI Lab83.9%3.3%
Falcon LLMFalcon 7BFalcon LLM45.3%3.3%
NVIDIAMinitron 4BNVIDIA7.3%2.8%
Hugging FaceSmolLM 135M InstructHugging Face91.7%2.2%
Hugging FaceSmolLM2 135M InstructHugging Face82.3%1.3%
SalesforcexLAM 1BSalesforce22.8%0.3%

This table and previous versions are available in computable form in the Wolfram Data Repository.

Find out how Wolfram Language can enhance your LLM results.

For LLM developers: contact us for the dataset and tools or to arrange for your LLM to be included.