Build multimodal agents that reason interact and take action | DEM330

Name: Build multimodal agents that reason interact and take action | DEM330
Uploaded: 2026-06-10T20:16:56+00:00
Description: Henk Boelman live-codes a real-time, voice-first multimodal agent in Azure AI Foundry using the Voice Live API, showing how to combine speech input, model...

Jun 10, 2026 by Henk Boelman

Henk Boelman live-codes a real-time, voice-first multimodal agent in Azure AI Foundry using the Voice Live API, showing how to combine speech input, model reasoning, and speech output, then connect the agent to external tools via MCP so it can take real actions.

Overview

What this session builds

A real-time agent with a conversational avatar (voice-first UX)
An end-to-end flow that unifies:
- Speech-to-text (STT)
- Model reasoning
- Text-to-speech (TTS)
Tool-using behavior so the agent can take actions by calling tools via MCP

Key technologies and concepts

Azure AI Foundry (Microsoft Foundry)
- Used as the platform for building and wiring up the agent
Voice Live API
- Demonstrated as the API that unifies STT, reasoning, and TTS for real-time voice interactions
MCP (Model Context Protocol)
- Used to connect the agent to tools so it can perform real actions (tool calling)
Multimodal agent patterns
- Reusable patterns for building voice-first, tool-using agents with expressive avatars

Resources

Foundry Discord: https://aka.ms/build/foundrydiscord

Session context

Microsoft Build 2026 session: DEM330
Track: Agents & apps
Format: Demo (advanced), live coding (no slides)