(cache)vLLM 2024 Analysis

vLLM 2024 Analysis

Published

Dec 19, 2024

Simon Mo

This notebook explores vLLM's usage data in 2024. The range of data is 04-01-2024 to 12-15-2024. The data collection is only a small sample of users, non-identifiable, and typically opt-out.. The real usage is likely a lot bigger of what is shown but we should focus on relative trend.

Over 259 days, we have total 534333432 hours of compute running on vLLM, which is about 85960 GPUs running non-stop!

Key takeaways

Usage: 10x growth in 6 months.
Model: Llama and Qwen dominate.
Hardware: Diveristy of H100, A100, and inference chips; seeing a small AMD trend.

Total GPU hours by Week

GPU Hours week by week change (%)

Breakdown on Hardware Vendor

gpu_vendor

AMD

Ascend

NVIDIA

Usage by Major GPU Type

gpu_unit

4090

A100

A10G

A800

H100

H20

H800

L20

L40

V100

Usage by Model Architecture in Serving

model_architecture

DeepseekV2ForCausalLM

Gemma2ForCausalLM

LlamaForCausalLM

MixtralForCausalLM

MllamaForConditionalGeneration

Qwen2ForCausalLM

Qwen2VLForConditionalGeneration

Usage by Entrypoints

context

BATCH

INTEGRATION

SERVING

Llama's TP Size Trend

model_architecture_tp

LlamaForCausalLM-1

LlamaForCausalLM-2

LlamaForCausalLM-4

LlamaForCausalLM-8

Increasing Percentage of vLLM Deployments with Quantization

quantization

awq

awq_marlin

bitsandbytes

fbgemm_fp8

fp8

gguf

gptq

gptq_marlin

groupwise-quant

null