Microsoft Events presents a detailed session from Ignite 2025 focused on deploying and scaling open-source AI models using Azure Container Apps and AKS. The speakers showcase robust strategies for running custom and OSS models efficiently.

Fast and Flexible Inference for Open-Source AI Models at Scale with Azure

Session Overview

Presented at Microsoft Ignite 2025, this session explores how to deploy and operationalize open-source AI models at scale on Azure. The speakers, Mehrdad Abdolghafari, Cary Chai, and Sachi Desai, demonstrate how to leverage hybrid architectures and cloud technologies for high-performance and cost-efficient AI workloads.

Key Topics Covered

Hybrid Model Architecture: Strategies for building AI solutions that span on-premises and cloud deployments, maintaining control over data boundaries.
LLMS Agents: Implementing large language models as agents for dynamic, data-driven applications.
GPU-Intensive Workloads: Approaches for workloads requiring intense GPU computation, such as physics simulations and video processing.
Docker Compose and Cloud Deployment: Simplifying the deployment of AI agents via Docker Compose and Azure Container Apps.
Live Demos: Dashboard generation, log streaming visualization, and real-time testing of model inference.
AKS for Scale and Security: Leveraging Azure Kubernetes Service for scalable LLM operations, security, cost optimization, and advanced AI support.
Workload Scheduling: Enhanced scheduling and configuration for diverse AI workloads, ensuring efficient resource allocation.
Inference Traffic Management: Managing AI inference traffic using Gateway API, demonstrated through an Ignite preview.
Enterprise Implementation Example (RBC): RBC’s CI/CD pipeline for secure GPU resource provisioning and compliance-driven AI factory.

Technical Insights

Serverless GPU Solutions: Utilize Azure’s serverless GPU options for fast, cost-effective inferencing, both locally and in cloud environments.
Azure Container Apps: Streamlined deployment and management of containerized AI workloads.
AKS: Fine-grained control over scaling, security, and operational costs for LLMs and other advanced AI models.
Workload Configuration: Dynamic workload management and scheduling to optimize GPU utilization for both small and large workloads.
CI/CD for AI Infrastructure: Establish secure and automated CI/CD pipelines for GPU resources, aligning to compliance and governance requirements.

Resources

Speaker Highlights

Mehrdad Abdolghafari
Cary Chai
Sachi Desai

Real-World Example

RBC: Building Canada’s largest AI farm within compliance boundaries, accelerating provisioning for inference workloads.

Chapters Summary

Hybrid model architectures and agent use cases
Physics/video workload GPU acceleration
Cloud deployment via Docker Compose
Real-time dashboard testing
AKS integration for AI scale/security/cost
Workload scheduling and Gateway APIs
CI/CD for secure resource management
Large-scale compliant AI operations