Azure ND GB300 v6: Achieving Over 1 Million Tokens/sec on Llama2 70B Inference
Hugo Affaticati and Mark Gitau detail Azure ND GB300 v6 VMs’ record-breaking throughput for Llama2 70B inference, sharing technical benchmarks and a step-by-step Azure deployment guide.
Azure ND GB300 v6: Achieving Over 1 Million Tokens/sec on Llama2 70B Inference
Authors: Mark Gitau (Software Engineer), Hugo Affaticati (Senior Cloud Infrastructure Engineer)
Overview
Microsoft Azure’s ND GB300 v6 virtual machines, based on NVIDIA’s Blackwell architecture, deliver a breakthrough in AI inference throughput. In formal benchmarking, the system achieved a record 1,100,000 tokens/sec on Llama2 70B using MLPerf Inference v5.1, outpacing previous Azure ND GB200 v6 results by 27%.
Key Hardware and Benchmark Details
- Cloud Platform: Microsoft Azure
 - VM Instance SKU: ND_GB300_v6
 - System Configuration: 18 x NDv6 VM instances, one NVL72 rack
 - GPU: 4 x NVIDIA GB300 per VM (72 total)
 - GPU Memory: 189,471 MiB per GPU
 - GPU Power Limit: 1,400 Watts
 - Storage: 14 TB Local NVMe RAID per VM
 - Inference Engine: NVIDIA TensorRT-LLM
 - Benchmark Harness: MLCommons MLPerf Inference v5.1 (Offline scenario)
 - Model: Llama2-70B (FP4 Precision Quantization)
 
Performance Metrics (from Table 1)
| Metric | Performance (Tokens/Second) | 
|---|---|
| Total Aggregated Throughput | 1,100,948.3 | 
| Maximum Single-Node | 62,803.9 | 
| Minimum Single-Node | 57,599.1 | 
| Average Single-Node | 61,163.8 | 
| Median Single-Node | 61,759.1 | 
Compared to previous generations:
- Azure ND GB200 v6: 865,000 tokens/sec
 - NVIDIA DGX H100: ~3,066 tokens/sec per GPU
 - ND H100 v5 VM: 5× lower throughput than ND GB300 v6
 
Technical Highlights
- GEMM TFLOPS: ND GB300 v6 achieves 2.5x more GEMM TFLOPS per GPU than ND H100 v5
 - Memory Bandwidth: 7.37 TB/s HBM bandwidth at 92% efficiency
 - NVLink C2C: 4x faster CPU-to-GPU transfer speeds
 - NCCL Communication: Improved GPU interconnect performance
 - FP4 Precision: Leveraging quantization for fast, accurate inference
 - Software Stack: NVIDIA TensorRT-LLM, MLPerf 5.1, containerized workflows
 
Detailed Benchmark Replication Guide
Prerequisites
- Azure ND GB300 v6 VM access
 - Basic familiarity with Docker, Python, ML benchmarking
 
Step 1: Clone Benchmarking Guide
 git clone https://github.com/Azure/AI-benchmarking-guide.git && cd AI-benchmarking-guide/Azure_Results
Step 2: Download Models & Datasets
- Create directories for models, data, and preprocessed_data
 - Download Llama2 70B model and datasets as described in repo instructions
 
Step 3: Setup and Build Container
mkdir build && cd build
 git clone https://github.com/NVIDIA/TensorRT-LLM.git TRTLLM
 cd TRTLLM
# Edit Makefile lines 135, 136 to set SOURCE_DIR and CODE_DIR paths
 make -C docker build
 make -C docker run
Step 4: Install TensorRT-LLM and MLPerf Dependencies
 cd 1M_ND_GB300_v6_Inference/build/TRTLLM
 python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt --benchmarks --cuda_architectures "103-real" --no-venv --clean
 pip install build/tensorrt_llm-1.1.0rc6-cp312-cp312-linux_aarch64.whl
 make clone_loadgen && make build_loadgen
 git clone https://github.com/NVIDIA/mitten.git ./build/mitten && pip install build/mitten
 pip install -r docker/common/requirements/requirements.llm.txt
Step 5: Link Datasets and Run Benchmark
export MLPERF_SCRATCH_PATH=/work
export SYSTEM_NAME=ND_GB300_v6
make link_dirs
make run_llm_server RUN_ARGS="--core_type=trtllm_endpoint --benchmarks=llama2-70b --scenarios=Offline"
make run_harness RUN_ARGS="--core_type=trtllm_endpoint --benchmarks=llama2-70b --scenarios=Offline"
- Log files for all 18 runs: Benchmark Results
 
Notes on Results
- [1] Unverified MLPerf® v5.1 result; see MLCommons on validation protocols
 - [2] Comparison ID for verified NVIDIA DGX H100 run: 4.1-0043
 
Summary
Azure ND GB300 v6 enables new scales of enterprise AI inference, thanks to technical advances in hardware and ML infrastructure. By following the documented guide, practitioners can replicate benchmark performance using the open-source MLPerf and TensorRT-LLM stack on Azure VMs.
For questions, analysis, or further details, refer to the original benchmarking guide or reach out to Hugo Affaticati via the Azure HPC Blog.
This post appeared first on “Microsoft Tech Community”. Read the entire article here