Build realtime multimodal agents with LiveKit and Azure | ODSP937
Jesse Hall demonstrates how to build a low-latency, real-time voice agent using LiveKit’s real-time infrastructure together with Azure models for speech-to-text, LLM reasoning, and text-to-speech, with a focus on audio quality, resilience, and turn detection for multimodal agent experiences.
Overview
The session walks through an end-to-end architecture for a real-time, multimodal (voice-first) agent where latency and network conditions are first-class concerns. It focuses on how LiveKit’s real-time stack fits together with Azure-hosted AI models for:
- Speech-to-text (STT)
- Large language model (LLM) reasoning
- Text-to-speech (TTS)
The goal is to produce a completed voice agent that can answer questions (demonstrated with questions about Microsoft Build).
Key requirements for real-time multimodal agents
The talk highlights practical constraints that show up quickly in voice agents:
- Low latency end-to-end (speech in → response out)
- Network resilience and global connectivity
- Clean audio handling
- Turn detection (knowing when the user is done speaking)
- Cost considerations (especially around continuous audio processing)
Architecture approach: cascaded pipeline
The session frames the solution as a cascaded pipeline, where the agent experience is composed from multiple specialized components:
- Real-time audio transport and session management (LiveKit)
- STT for converting audio to text (Azure)
- LLM for generating responses (Azure-hosted model)
- TTS for converting responses back to audio (Azure)
LiveKit real-time infrastructure and WebRTC
A zoomed-out view is provided for how real-time connectivity is achieved for global users:
- WebRTC is used as the underlying real-time modality for audio transport.
- The design emphasizes scalability requirements typical of real-time applications.
Agent SDK and implementation entry point
The session introduces the LiveKit Agents SDK and its supported languages, then moves into the core implementation:
agent.pyis presented as the core logic file for the assistant.- The agent is configured to set behavior and greet users when a session starts.
Voice activity detection (VAD) and cost optimization
The session calls out voice activity detection (VAD) as a key building block:
- VAD helps determine when speech is present.
- It is discussed in the context of cost optimization (avoiding unnecessary processing when there is no speech).
Configuration: LiveKit + Azure environment variables
The walkthrough includes configuring runtime settings via environment variables, including:
- LiveKit configuration values
- Azure configuration values for STT/LLM/TTS usage
- An OpenAI API key (as part of the Azure integration setup described in the session)
Session chapters (from the video)
- Overview of Topics: LiveKit, Cascaded Pipeline, Azure Integration
- Voice as a Real-Time Modality
- Scalability Requirements for Real-Time Applications
- Introduction to Agent's SDK and Supported Languages
- Voice Activity Detection and Cost Optimization
- Zoomed-Out View: Global User Connectivity via WebRTC
- Introduction to Agent.py – core logic for the assistant
- Setting agent behavior and greeting users upon session start
- Configuring LiveKit and Azure environment variables including OpenAI API key