Record-Breaking AI Inference Performance with Azure ND Virtual Machines

Name: Record-Breaking AI Inference Performance with Azure ND Virtual Machines
Uploaded: 2025-11-20T16:14:48+00:00
Description: Microsoft Events, with Hugo Affaticati and Nitin Nagarkatte, demonstrate how Azure ND Virtual Machines deliver record-setting AI inference throughput and...

Nov 20, 2025 by Microsoft Events

Microsoft Events, with Hugo Affaticati and Nitin Nagarkatte, demonstrate how Azure ND Virtual Machines deliver record-setting AI inference throughput and efficiency through engineering innovations showcased at Ignite 2025.

Record-Breaking AI Inference Performance with Azure ND Virtual Machines

Speakers: Hugo Affaticati, Nitin Nagarkatte Event: Microsoft Ignite 2025, Breakout Session BRK180 (Intermediate Level)

Introduction

This session highlights Azure’s achievement in AI inference performance, with the ND GB200 and GB300 v6 Virtual Machines reaching speeds of 865,000 and 1.1 million tokens per second. These results stem from optimizations across the compute stack—from low-level GPU kernels like GEMM to sophisticated attention mechanisms and multi-node scaling solutions.

Key Topics Covered

Performance Milestones:
- NDGP200v6: 865K tokens/sec
- ND GB300 v6: 1.1M tokens/sec
Deep Stack Optimization Techniques
- Custom GPU kernel engineering (GEMM)
- Advanced attention mechanisms for transformer models
- Efficient multi-node scaling for distributed inference
LLAMA Benchmarks
- Utilized to measure and validate throughput gains
- Demonstrates hardware and model architecture codesign impact
Practical Benefits
- Faster time-to-value for AI solutions
- Lower cost per token for inference workloads
- Robust, production-ready infrastructure for enterprise-scale AI
- Customers able to deploy ultra-high throughput inference

Infrastructure Overview

Three Pillars of Azure AI Infrastructure:
- Compute: High-throughput, AI-optimized VMs
- Network: High-bandwidth, low-latency interconnects
- Storage: Performance and reliability for data-intensive workloads
Origins and Evolution
- Azure’s foundational role in AI since 2019
- Expansion during the ChatGPT era (2021+)
- Growing ecosystem of AI use cases and supercomputing resources

Session Highlights

Real-world performance demonstrations using LLAMA
Balancing throughput and latency for efficient AI inference
Insights into hardware/software co-innovation
Direct connection opportunities with Azure engineers for deeper technical discussions

Further Resources

Explore Azure supercomputing and AI infrastructure:
- Microsoft Ignite
- Azure ND Virtual Machine Documentation

Conclusion

This session provides practitioners a deep technical look into how Azure expands scalable inference capabilities, optimizing every layer for accelerated, cost-effective AI workloads. The content is aimed at enterprise developers, data scientists, and AI infrastructure architects seeking actionable strategies for deploying AI at production scale.