Build realtime multimodal agents with LiveKit and Azure | ODSP937

Jesse Hall demonstrates how to build a low-latency, real-time voice agent using LiveKit’s real-time infrastructure together with Azure models for speech-to-text, LLM reasoning, and text-to-speech, with a focus on audio quality, resilience, and turn detection for multimodal agent experiences.

Overview

The session walks through an end-to-end architecture for a real-time, multimodal (voice-first) agent where latency and network conditions are first-class concerns. It focuses on how LiveKit’s real-time stack fits together with Azure-hosted AI models for:

The goal is to produce a completed voice agent that can answer questions (demonstrated with questions about Microsoft Build).

Key requirements for real-time multimodal agents

The talk highlights practical constraints that show up quickly in voice agents:

Architecture approach: cascaded pipeline

The session frames the solution as a cascaded pipeline, where the agent experience is composed from multiple specialized components:

LiveKit real-time infrastructure and WebRTC

A zoomed-out view is provided for how real-time connectivity is achieved for global users:

Agent SDK and implementation entry point

The session introduces the LiveKit Agents SDK and its supported languages, then moves into the core implementation:

Voice activity detection (VAD) and cost optimization

The session calls out voice activity detection (VAD) as a key building block:

Configuration: LiveKit + Azure environment variables

The walkthrough includes configuring runtime settings via environment variables, including:

Session chapters (from the video)