Build multimodal agents that reason interact and take action | DEM330
Henk Boelman live-codes a real-time, voice-first multimodal agent in Azure AI Foundry using the Voice Live API, showing how to combine speech input, model reasoning, and speech output, then connect the agent to external tools via MCP so it can take real actions.
Overview
What this session builds
- A real-time agent with a conversational avatar (voice-first UX)
- An end-to-end flow that unifies:
- Speech-to-text (STT)
- Model reasoning
- Text-to-speech (TTS)
- Tool-using behavior so the agent can take actions by calling tools via MCP
Key technologies and concepts
- Azure AI Foundry (Microsoft Foundry)
- Used as the platform for building and wiring up the agent
- Voice Live API
- Demonstrated as the API that unifies STT, reasoning, and TTS for real-time voice interactions
- MCP (Model Context Protocol)
- Used to connect the agent to tools so it can perform real actions (tool calling)
- Multimodal agent patterns
- Reusable patterns for building voice-first, tool-using agents with expressive avatars
Resources
- Foundry Discord: https://aka.ms/build/foundrydiscord
Session context
- Microsoft Build 2026 session: DEM330
- Track: Agents & apps
- Format: Demo (advanced), live coding (no slides)