Build realtime multimodal agents with LiveKit and Azure | ODSP937

Name: Build realtime multimodal agents with LiveKit and Azure | ODSP937
Uploaded: 2026-06-03T14:10:15+00:00
Description: Jesse Hall demonstrates how to build a low-latency, real-time voice agent using LiveKit’s real-time infrastructure together with Azure models for...

Jun 3, 2026 by Jesse Hall

Jesse Hall demonstrates how to build a low-latency, real-time voice agent using LiveKit’s real-time infrastructure together with Azure models for speech-to-text, LLM reasoning, and text-to-speech, with a focus on audio quality, resilience, and turn detection for multimodal agent experiences.

Overview

The session walks through an end-to-end architecture for a real-time, multimodal (voice-first) agent where latency and network conditions are first-class concerns. It focuses on how LiveKit’s real-time stack fits together with Azure-hosted AI models for:

Speech-to-text (STT)
Large language model (LLM) reasoning
Text-to-speech (TTS)

The goal is to produce a completed voice agent that can answer questions (demonstrated with questions about Microsoft Build).

Key requirements for real-time multimodal agents

The talk highlights practical constraints that show up quickly in voice agents:

Low latency end-to-end (speech in → response out)
Network resilience and global connectivity
Clean audio handling
Turn detection (knowing when the user is done speaking)
Cost considerations (especially around continuous audio processing)

Architecture approach: cascaded pipeline

The session frames the solution as a cascaded pipeline, where the agent experience is composed from multiple specialized components:

Real-time audio transport and session management (LiveKit)
STT for converting audio to text (Azure)
LLM for generating responses (Azure-hosted model)
TTS for converting responses back to audio (Azure)

LiveKit real-time infrastructure and WebRTC

A zoomed-out view is provided for how real-time connectivity is achieved for global users:

WebRTC is used as the underlying real-time modality for audio transport.
The design emphasizes scalability requirements typical of real-time applications.

Agent SDK and implementation entry point

The session introduces the LiveKit Agents SDK and its supported languages, then moves into the core implementation:

agent.py is presented as the core logic file for the assistant.
The agent is configured to set behavior and greet users when a session starts.

Voice activity detection (VAD) and cost optimization

The session calls out voice activity detection (VAD) as a key building block:

VAD helps determine when speech is present.
It is discussed in the context of cost optimization (avoiding unnecessary processing when there is no speech).

Configuration: LiveKit + Azure environment variables

The walkthrough includes configuring runtime settings via environment variables, including:

LiveKit configuration values
Azure configuration values for STT/LLM/TTS usage
An OpenAI API key (as part of the Azure integration setup described in the session)

Session chapters (from the video)

Overview of Topics: LiveKit, Cascaded Pipeline, Azure Integration
Voice as a Real-Time Modality
Scalability Requirements for Real-Time Applications
Introduction to Agent's SDK and Supported Languages
Voice Activity Detection and Cost Optimization
Zoomed-Out View: Global User Connectivity via WebRTC
Introduction to Agent.py – core logic for the assistant
Setting agent behavior and greeting users upon session start
Configuring LiveKit and Azure environment variables including OpenAI API key