Benchmarking Llama 3.1 8B Inference with vLLM on Azure GPU and CPU VMs
Cormac Garvey evaluates the inference performance and cost-efficiency of Llama 3.1 8B using vLLM across Azure GPU and CPU virtual machines, offering actionable benchmarks and deployment strategies for enterprise AI workloads.
Benchmarking Llama 3.1 8B Inference with vLLM on Azure GPU and CPU VMs
Introduction
This report presents a comparative analysis of the inference performance of the Llama 3.1 8B large language model using vLLM, evaluated across a range of Azure ND-series GPU and CPU virtual machines. It expands on previous work by benchmarking not only throughput and latency but also the cost-efficiency associated with deploying large language models (LLMs) in enterprise settings.
Key focus areas include:
- Performance across chat, document classification, and code generation AI workloads
- Resource utilization (particularly KV cache effectiveness)
- Cost per token metrics, based on Azure’s region-specific pricing
Benchmark Environment
Benchmarks Used:
- Tool: Hugging Face Inference Benchmarker
- Profiles: Chat (share_gpt_turns.json), Classification (classification.json), Code Generation (github_code.json)
- Model: meta-llama/Llama-3.1-8B-Instruct (FP16 precision, 14.9 GiB)
- Inference Engine: vLLM with tuned parameters (e.g., gpu_memory_utilization=0.9, max_num_seqs=1024, chunked prefill and prefix caching enabled)
Hardware:
- GPUs: ND-H100-v5, ND-H200-v5, HD-A100-v4 (both 80GB and 40GB) on HPC Ubuntu 22.04, PyTorch 2.7.0 + CUDA 12.8
- CPUs: HPC-class Ubuntu 22.02 VMs
Results Summary
Throughput & Latency
- H200 GPU: Top performance across all workloads (highest prompt/generation throughput).
- H100 GPU: Close second, especially strong on classification and code generation.
- A100/A100_40G: Lagging, especially in classification (↓ throughput due to memory and cache constraints).
KV Cache Analysis
- H200/H100: High hit rates (up to 99% in classification), efficient cache use, minimal request queuing except for code generation (where hit rates drop).
- A100_40G: High cache usage, very low hit rates for classification and code gen; higher server queuing observed.
Cost Efficiency
- Chat workloads: A100 40G best value.
- Classification: H200 shines.
- Code Generation: H100 delivers optimum efficiency.
CPU vs. GPU
- CPUs struggle to deliver acceptable throughput or latency for Llama 3.1 8B. Even advanced CPU VMs (HB176-96_v4) are an order of magnitude slower than GPUs. Only small models (≤1B parameters) may be feasible on CPU for light workloads.
Optimization Tips
- Enable AVX512 support on CPU VMs if available
- Fit model on a single socket if possible, or use tensor parallelism for split inference
- Use CPU core/thread pinning to optimize vLLM independent thread performance
- Allocate sufficient CPU memory for KVCache (important for larger models)
Conclusion
Hardware choice on Azure directly impacts both AI inference speed and cost. The H200 GPU is the overall top performer, followed by the H100, while the A100/A100_40G provide budget alternatives mainly for chat-type tasks. CPU VMs are currently not viable for high-throughput LLM inference. Practitioners planning enterprise AI deployments on Azure should match GPU choice to their workload’s performance and cost needs.
References and Resources
- Hugging Face Inference Benchmarker: https://github.com/huggingface/inference-benchmarker
- Llama 3.1 8B Model: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
- vLLM Engine: https://github.com/vllm-project/vllm
- Azure ND-Series GPU Docs: https://learn.microsoft.com/en-us/azure/virtual-machines/nd-series
- Azure Pricing Calculator: https://azure.microsoft.com/en-us/pricing/calculator
- CPU vLLM Install: https://docs.vllm.ai/en/latest/getting_started/installation/cpu.html
Appendix: vLLM on CPU VMs
- Clone:
git clone https://github.com/vllm-project/vllm.git vllm_source
- Adjust Dockerfiles as needed (see content for ENTRYPOINT details)
- Build with AVX512 flags and required Docker targets
- Launch vLLM serve/benchmark with proper environment variables (kv cache, thread binding, Hugging Face token, etc.)
- See full Docker commands and environment variables within the appendix details.
For deeper graphs, benchmarks datasets, and full tuning parameters, refer to the links above or the original evaluation content.
This post appeared first on “Microsoft Tech Community”. Read the entire article here