Using autonomous SRE to move from alerts to action | OD800
Vyom Nagrani and Deepthi Chelupati explore how autonomous SRE can turn incident signals into concrete actions, combining AI-agent workflows with observability, SLO practices, and governance controls.
Overview
The session focuses on modern SRE workflows where AI agents do more than detect and summarize incidents—they can also propose and execute mitigations under defined constraints.
Key themes covered include:
From alerts to action with autonomous remediation
- Using AI agents to progress incidents from detection to diagnosis and mitigation.
- Emphasis on reducing downtime by preventing issues earlier and accelerating response when incidents occur.
SLO management and deep observability
- Positioning SLOs as a control surface for reliability work.
- Using deep observability signals to support diagnosis, validation of hypotheses, and decision-making during incidents.
Agent workflow testing via PR-based changes
- Demonstration-oriented workflow where an agent is tested by creating a new pull request and monitoring automated analysis.
- Using PR activity as a structured way to validate agent behavior and outcomes.
Guardrails and human oversight
- Automated mitigation is framed as operating with guardrails and explicit human oversight.
- Discussion of how teams can keep control while still benefiting from autonomous execution.
Custom logic and organizational adaptation
- Extending the workflow with custom logic to fit organizational requirements.
- Adapting agent behavior to local processes and constraints.
Live simulation: diagnosing a credential desync
- A simulated incident scenario focused on diagnosing a credential desynchronization issue.
- Root cause confirmation is highlighted as part of the workflow.
Data-driven reasoning and use of past knowledge
- The agent applies prior knowledge and available data to identify patterns.
- Validating hypotheses is treated as a first-class step, not just generating a guess.
Governance model: review vs autonomous run modes
- Governance approach that supports different operating modes:
- Review mode (human review before actions)
- Autonomous run mode (agent executes within constraints)
- Command validation hooks are discussed as a control mechanism.
Monitoring and evaluation of the system
- Monitoring capabilities through session insights, dashboards, and evaluation metrics.
- Focus on measuring outcomes and behavior to ensure reliability and safety of autonomous operations.
Resources
Session context
- Microsoft Build 2026 session: OD800
- Level: Advanced
- Topic area: Developer tools & frameworks