Fast and Flexible Inference for Open-Source AI Models at Scale with Azure
Microsoft Events presents a detailed session from Ignite 2025 focused on deploying and scaling open-source AI models using Azure Container Apps and AKS. The speakers showcase robust strategies for running custom and OSS models efficiently.
Fast and Flexible Inference for Open-Source AI Models at Scale with Azure
Session Overview
Presented at Microsoft Ignite 2025, this session explores how to deploy and operationalize open-source AI models at scale on Azure. The speakers, Mehrdad Abdolghafari, Cary Chai, and Sachi Desai, demonstrate how to leverage hybrid architectures and cloud technologies for high-performance and cost-efficient AI workloads.
Key Topics Covered
- Hybrid Model Architecture: Strategies for building AI solutions that span on-premises and cloud deployments, maintaining control over data boundaries.
- LLMS Agents: Implementing large language models as agents for dynamic, data-driven applications.
- GPU-Intensive Workloads: Approaches for workloads requiring intense GPU computation, such as physics simulations and video processing.
- Docker Compose and Cloud Deployment: Simplifying the deployment of AI agents via Docker Compose and Azure Container Apps.
- Live Demos: Dashboard generation, log streaming visualization, and real-time testing of model inference.
- AKS for Scale and Security: Leveraging Azure Kubernetes Service for scalable LLM operations, security, cost optimization, and advanced AI support.
- Workload Scheduling: Enhanced scheduling and configuration for diverse AI workloads, ensuring efficient resource allocation.
- Inference Traffic Management: Managing AI inference traffic using Gateway API, demonstrated through an Ignite preview.
- Enterprise Implementation Example (RBC): RBC’s CI/CD pipeline for secure GPU resource provisioning and compliance-driven AI factory.
Technical Insights
- Serverless GPU Solutions: Utilize Azure’s serverless GPU options for fast, cost-effective inferencing, both locally and in cloud environments.
- Azure Container Apps: Streamlined deployment and management of containerized AI workloads.
- AKS: Fine-grained control over scaling, security, and operational costs for LLMs and other advanced AI models.
- Workload Configuration: Dynamic workload management and scheduling to optimize GPU utilization for both small and large workloads.
- CI/CD for AI Infrastructure: Establish secure and automated CI/CD pipelines for GPU resources, aligning to compliance and governance requirements.
Resources
Speaker Highlights
- Mehrdad Abdolghafari
- Cary Chai
- Sachi Desai
Real-World Example
- RBC: Building Canada’s largest AI farm within compliance boundaries, accelerating provisioning for inference workloads.
Chapters Summary
- Hybrid model architectures and agent use cases
- Physics/video workload GPU acceleration
- Cloud deployment via Docker Compose
- Real-time dashboard testing
- AKS integration for AI scale/security/cost
- Workload scheduling and Gateway APIs
- CI/CD for secure resource management
- Large-scale compliant AI operations