WebDev Arena Leaderboard

WebDev Arena is an open-source benchmark evaluating AI capabilities in web development, developed by LMArena.

Leaderboard

Arena Score

1252.97

License

Proprietary

95% CI

+5.67 / -5.83

Votes

15,727

DeepSeek-R1

DeepSeek

#2

Arena Score

1210.88

License

MIT

95% CI

+9.32 / -11.26

Votes

3,539

Arena Score

1161.38

License

Proprietary

95% CI

+32.58 / -35.18

Votes

342

Arena Score

1139.00

License

Proprietary

95% CI

+5.66 / -6.27

Votes

10,172

Arena Score

1110.15

License

Proprietary

95% CI

+8.54 / -9.42

Votes

3,310

Arena Score

1109.54

License

Proprietary

95% CI

+8.61 / -10.75

Votes

4,543

Arena Score

1053.81

License

Proprietary

95% CI

+7.09 / -6.05

Votes

8,376

Arena Score

1053.69

License

Proprietary

95% CI

+5.48 / -6.47

Votes

12,871

Arena Score

1038.53

License

Proprietary

95% CI

+16.84 / -17.27

Votes

1,064

Arena Score

1026.87

License

Proprietary

95% CI

+6.85 / -5.82

Votes

8,010

Arena Score

1025.29

License

Proprietary

95% CI

+4.69 / -5.56

Votes

12,099

Arena Score

986.97

License

Proprietary

95% CI

+6.83 / -5.59

Votes

14,482

#12

Arena Score

981.29

License

Proprietary

95% CI

+11.67 / -12.81

Votes

2,702

DeepSeek-V3

DeepSeek

#13

Arena Score

966.28

License

DeepSeek

95% CI

+7.92 / -5.88

Votes

7,478

Arena Score

964.00

License

Proprietary

95% CI

+4.91 / -5.70

Votes

13,838

Arena Score

904.14

License

Apache 2.0

95% CI

+6.04 / -5.58

Votes

12,497

Arena Score

894.65

License

Proprietary

95% CI

+7.08 / -5.73

Votes

12,155

Arena Score

813.69

License

Llama 3.1

95% CI

+19.07 / -14.97

Votes

1,117

More Statistics for WebDev Arena (Overall)

Confidence Interval for Model Strength

Figure 1

claude-3-5-sonnet-20241022deepseek-r1o3-mini-2025-01-31-highclaude-3-5-haiku-20241022o3-mini-2025-01-31gemini-2.0-pro-exp-02-05o1-2024-12-17o1-mini-2024-09-12gemini-2.0-flash-thinking-exp-01-21gemini-2.0-flash-thinking-exp-1219gemini-exp-1206gemini-2.0-flash-expqwen-max-2025-01-25deepseek-v3gpt-4o-2024-11-20qwen-2.5-coder-32b-instructgemini-1.5-pro-002llama-v3.1-405b-instruct800900100011001200
ModelRating

Average Win Rate Against All Other Models (Assuming Uniform Sampling and No Ties)

Figure 2

claude-3-5-sonnet-20241022deepseek-r1claude-3-5-haiku-20241022gemini-2.0-pro-exp-02-05o3-mini-2025-01-31-higho3-mini-2025-01-31gemini-2.0-flash-thinking-exp-...o1-2024-12-17o1-mini-2024-09-12gemini-exp-1206gemini-2.0-flash-thinking-exp-...gemini-2.0-flash-expgpt-4o-2024-11-20qwen-max-2025-01-25deepseek-v3qwen-2.5-coder-32b-instructgemini-1.5-pro-002llama-v3.1-405b-instruct0.00%0.20%0.40%0.60%0.80%0.690.630.510.490.460.460.380.370.370.370.350.320.270.260.250.190.180.13

Fraction of Model A Wins for All Non-tied A vs. B Battles

Figure 3

0.050.090.160.090.180.190.070.070.110.110.110.090.170.150.130.160.240.240.230.220.190.290.430.060.080.150.140.10.080.160.160.150.170.240.260.250.270.220.30.420.090.10.180.170.2600.280.220.230.280.310.290.350.280.480.510.070.180.170.210.190.150.310.320.30.360.390.290.490.10.140.180.220.2200.270.210.230.270.330.330.250.330.420.450.60.110.130.220.230.240.330.290.340.290.280.320.380.40.410.420.450.550.150.210.220.250.350.390.570.390.40.270.450.50.130.160.210.330.340.310.270.210.440.460.460.560.560.670.160.150.250.250.250.230.410.370.450.330.430.50.410.510.590.620.170.190.260.310.290.120.380.360.390.350.440.550.410.530.570.620.130.240.360.320.370.380.480.410.540.550.4400.360.50.470.530.380.430.560.580.670.690.360.210.30.330.390.250.450.520.50.550.590.520.690.680.190.260.310.390.350.450.550.670.50.540.580.530.620.70.660.240.310.420.460.360.550.50.540.610.470.60.60.60.610.640.670.310.450.560.510.50.660.660.780.620.710.680.640.750.790.750.460.550.640.620.440.740.670.650.740.690.740.790.770.760.810.810.85claude-3-5-sonnet-20241022deepseek-r1claude-3-5-haiku-20241022gemini-2.0-pro-exp-02-05o3-mini-2025-01-31o3-mini-2025-01-31-highgemini-2.0-flash-thinking-exp-1219o1-2024-12-17o1-mini-2024-09-12gemini-exp-1206gemini-2.0-flash-thinking-exp-01-21gemini-2.0-flash-expgpt-4o-2024-11-20qwen-max-2025-01-25deepseek-v3qwen-2.5-coder-32b-instructgemini-1.5-pro-002llama-v3.1-405b-instructllama-v3.1-405b-instructgemini-1.5-pro-002qwen-2.5-coder-32b-instructdeepseek-v3qwen-max-2025-01-25gpt-4o-2024-11-20gemini-2.0-flash-expgemini-2.0-flash-thinking-exp-01-21gemini-exp-1206o1-mini-2024-09-12o1-2024-12-17gemini-2.0-flash-thinking-exp-1219o3-mini-2025-01-31-higho3-mini-2025-01-31gemini-2.0-pro-exp-02-05claude-3-5-haiku-20241022deepseek-r1claude-3-5-sonnet-20241022
00.20.40.60.8Predicted Win Rate Using Elo Ratings for Model A in an A vs. B BattleModel BModel A

Battle Count for Each Combination of Models (without Ties)

Figure 4

25211626131102517032621312600011310355977146611071360516416000000193194183017717719300000000000255286951891301210270213010337512951000026318346139330155143033533201043421250510001313673561483921701906336740301824850125129016024944932154671982353512433020104853423750642669066061510603433064941065841760020118210410305137907628071524608551736896576076000000007726756021050355385500108605768414334033322130136178568648031276519510681010868961065512367335270010725186916011695973176214680681500736494363001936101994164218027851789014685103855513062351901431211774611188216531762910017891762519355608343198170155130177711313441234121409107859731276105015241060467392330189097261930175701214176218021695803602807615215148139951835516183501757123416531642160186467576266049335634628619410321018351930134418821994186985677279069049436731825519311325claude-3-5-sonnet-20241022gemini-2.0-flash-expgpt-4o-2024-11-20o1-mini-2024-09-12qwen-2.5-coder-32b-instructgemini-1.5-pro-002gemini-exp-1206claude-3-5-haiku-20241022o1-2024-12-17gemini-2.0-flash-thinking-exp-1219deepseek-v3gemini-2.0-pro-exp-02-05deepseek-r1o3-mini-2025-01-31qwen-max-2025-01-25llama-v3.1-405b-instructgemini-2.0-flash-thinking-exp-01-21o3-mini-2025-01-31-higho3-mini-2025-01-31-highgemini-2.0-flash-thinking-exp-01-21llama-v3.1-405b-instructqwen-max-2025-01-25o3-mini-2025-01-31deepseek-r1gemini-2.0-pro-exp-02-05deepseek-v3gemini-2.0-flash-thinking-exp-1219o1-2024-12-17claude-3-5-haiku-20241022gemini-exp-1206gemini-1.5-pro-002qwen-2.5-coder-32b-instructo1-mini-2024-09-12gpt-4o-2024-11-20gemini-2.0-flash-expclaude-3-5-sonnet-20241022
050010001500Battle Count for Each Combination of ModelsModel BModel A