Run AI SREs without burning token budgets | ODSP928

Name: Run AI SREs without burning token budgets | ODSP928
Uploaded: 2026-06-03T14:08:02+00:00
Description: Natan Yellin breaks down why AI SRE-style investigations can cost around $2 per alert and shows practical ways to reduce LLM token spend so enterprises can...

Jun 3, 2026 by Natan Yellin

Natan Yellin breaks down why AI SRE-style investigations can cost around $2 per alert and shows practical ways to reduce LLM token spend so enterprises can run AI investigations across high alert volumes without blowing budgets.

Overview

This Microsoft Build 2026 session focuses on the economics of using LLMs for SRE-style alert investigations at enterprise scale, and the optimizations that can make “AI investigations on every alert” financially viable.

Naive cost model for large alert volumes

The session starts with a simple cost-per-investigation model (roughly “$2 per alert”) and explains why that becomes prohibitive when applied to enterprise alert volumes.
It includes an example annualized estimate (e.g., on the order of hundreds of thousands of dollars per large enterprise) to illustrate the scale of the problem.

Optimization: using cheaper models

The presenter discusses reducing cost by switching to lower-cost models.
A comparison is discussed between higher-cost and lower-cost model options (including Opus and DeepSeek).
Trade-offs are highlighted, especially how lower cost can come with accuracy and quality impacts.

Optimization: LLM-native grouping instead of deterministic rules

The session contrasts deterministic, rule-based grouping with “LLM-native” grouping approaches.
The goal is to reduce redundant investigations and focus LLM work where it provides the most value.

Optimization: reusing cached context windows

The presenter covers reusing cached context windows to reduce repeated token spend.
The emphasis is on minimizing repeated context ingestion across investigations so the LLM does less expensive re-processing.

Why cost optimization changes the operating model

The session argues that if per-alert investigation cost is brought down enough, it becomes feasible to run AI investigations on all alerts (not just a subset), changing how first-line triage can be handled at scale.