Record-Breaking AI Inference Performance with Azure ND Virtual Machines
Microsoft Events, with Hugo Affaticati and Nitin Nagarkatte, demonstrate how Azure ND Virtual Machines deliver record-setting AI inference throughput and efficiency through engineering innovations showcased at Ignite 2025.
Record-Breaking AI Inference Performance with Azure ND Virtual Machines
Speakers: Hugo Affaticati, Nitin Nagarkatte Event: Microsoft Ignite 2025, Breakout Session BRK180 (Intermediate Level)
Introduction
This session highlights Azure’s achievement in AI inference performance, with the ND GB200 and GB300 v6 Virtual Machines reaching speeds of 865,000 and 1.1 million tokens per second. These results stem from optimizations across the compute stack—from low-level GPU kernels like GEMM to sophisticated attention mechanisms and multi-node scaling solutions.
Key Topics Covered
- Performance Milestones:
- NDGP200v6: 865K tokens/sec
- ND GB300 v6: 1.1M tokens/sec
- Deep Stack Optimization Techniques
- Custom GPU kernel engineering (GEMM)
- Advanced attention mechanisms for transformer models
- Efficient multi-node scaling for distributed inference
- LLAMA Benchmarks
- Utilized to measure and validate throughput gains
- Demonstrates hardware and model architecture codesign impact
- Practical Benefits
- Faster time-to-value for AI solutions
- Lower cost per token for inference workloads
- Robust, production-ready infrastructure for enterprise-scale AI
- Customers able to deploy ultra-high throughput inference
Infrastructure Overview
- Three Pillars of Azure AI Infrastructure:
- Compute: High-throughput, AI-optimized VMs
- Network: High-bandwidth, low-latency interconnects
- Storage: Performance and reliability for data-intensive workloads
- Origins and Evolution
- Azure’s foundational role in AI since 2019
- Expansion during the ChatGPT era (2021+)
- Growing ecosystem of AI use cases and supercomputing resources
Session Highlights
- Real-world performance demonstrations using LLAMA
- Balancing throughput and latency for efficient AI inference
- Insights into hardware/software co-innovation
- Direct connection opportunities with Azure engineers for deeper technical discussions
Further Resources
- Explore Azure supercomputing and AI infrastructure:
Conclusion
This session provides practitioners a deep technical look into how Azure expands scalable inference capabilities, optimizing every layer for accelerated, cost-effective AI workloads. The content is aimed at enterprise developers, data scientists, and AI infrastructure architects seeking actionable strategies for deploying AI at production scale.