vLLM 2024 Analysis
Log in

This notebook explores vLLM's usage data in 2024. The range of data is 04-01-2024 to 12-15-2024. The data collection is only a small sample of users, non-identifiable, and typically opt-out.. The real usage is likely a lot bigger of what is shown but we should focus on relative trend.

Over 259 days, we have total 534333432 hours of compute running on vLLM, which is about 85960 GPUs running non-stop!

Key takeaways

  • Usage: 10x growth in 6 months.
  • Model: Llama and Qwen dominate.
  • Hardware: Diveristy of H100, A100, and inference chips; seeing a small AMD trend.
Total GPU hours by Week
GPU Hours week by week change (%)
Breakdown on Hardware Vendor
gpu_vendor
AMD
Ascend
NVIDIA
Usage by Major GPU Type
gpu_unit
4090
A100
A10G
A800
H100
H20
H800
L20
L4
L40
V100
Usage by Model Architecture in Serving
model_architecture
DeepseekV2ForCausalLM
Gemma2ForCausalLM
LlamaForCausalLM
MixtralForCausalLM
MllamaForConditionalGeneration
Qwen2ForCausalLM
Qwen2VLForConditionalGeneration
Usage by Entrypoints
context
BATCH
INTEGRATION
SERVING
Llama's TP Size Trend
model_architecture_tp
LlamaForCausalLM-1
LlamaForCausalLM-2
LlamaForCausalLM-4
LlamaForCausalLM-8
Increasing Percentage of vLLM Deployments with Quantization
quantization
awq
awq_marlin
bitsandbytes
fbgemm_fp8
fp8
gguf
gptq
gptq_marlin
groupwise-quant
null