Scale agentic AI from on-device to cloud orchestration | BRKSP92
Karthik Vijayan, Colin Helms, imran Sheik Mohamed, and Jayneel Vora present a Microsoft Build 2026 breakout on designing agentic AI systems that run across client devices, edge environments, and the cloud.
Overview
Modern AI systems often span multiple environments rather than running as a single model in one place. This session explores how agentic AI workloads operate across client, edge, and cloud through three demos:
- Real-time on-device agents (including on-device reasoning and NPU activity)
- Distributed inference across edge systems
- Enterprise-scale multi-agent orchestration on Azure Kubernetes Service (AKS) with Intel Xeon
The session also focuses on practical guidance for deciding where to place inference, reasoning, and orchestration to balance responsiveness, scale, and efficiency.
Session chapters (from the video)
- 0:00 — Introduction to distributed AI systems across cloud, edge, and client
- 03:04 — Live AI client demo showing on-device reasoning and NPU activity
- 07:17 — Overview of Intel Core Ultra Series 3 processors with integrated Arc graphics
- 07:48 — Shared GPU and fast LPDDR5X memory enabling AI workloads
- 14:21 — Demo begins: running a sandboxed script with PowerShell
- 16:01 — Demonstration of model speed and token processing performance
- 22:01 — OpenFlow sends execution results to an SG Lang LLM engine for summarization
- 24:16 — Demo of auto-scaling mini pods and instance replicas based on CPU load
- 25:02 — Multi-agent workflow and nightly job automation within a single VM setup
Key themes
Placing AI capabilities across environments
- How to think about splitting responsibilities between:
- On-device reasoning (latency/responsiveness)
- Edge inference (locality and distributed capacity)
- Cloud orchestration (coordination and scale)
Orchestrating multi-agent systems on AKS
- Using Azure Kubernetes Service (AKS) as the platform for enterprise-scale orchestration
- Scaling behavior demonstrated via:
- Pods and replicas
- Autoscaling based on CPU load
Performance and hardware considerations
- Observing on-device NPU activity during real-time reasoning
- Hardware context called out in the session:
- Intel Core Ultra Series 3 processors with integrated Arc graphics
- Shared GPU and LPDDR5X memory characteristics
- Intel Xeon for cloud/cluster scenarios
Demo workflow elements
- Running a sandboxed script with PowerShell
- Measuring model speed and token processing performance
- Sending execution results through OpenFlow to an SG Lang LLM engine for summarization
- Multi-agent workflow and nightly job automation within a single VM setup