Automating On-Call Runbooks with Azure SRE Agent for Incident Response
In this post, dchelupati details how Azure SRE Agent can automate the tedious on-call runbook execution process for DevOps and SRE teams, enabling faster, more reliable incident diagnostics.
Automating On-Call Runbooks with Azure SRE Agent for Incident Response
Author: dchelupati
When production throws 500 errors at 3am, the repetitive, manual runbook tasks involved in incident response can be error-prone and exhausting. dchelupati shares a solution using Azure SRE Agent to automate these diagnostics, reducing time to discovery and letting engineers focus on decisive action.
The Problem with Manual Runbooks
- Incident response often involves following documented steps: querying metrics, logs, requests, and errors.
- This manual process is tedious, repetitive, and prone to human error, especially during late-night on-call emergencies.
Azure SRE Agent + Runbook Automation Overview
- The Azure SRE Agent can read markdown runbooks, execute diagnostic steps (such as
az monitor metrics, Log Analytics and App Insights queries), and compile a summary email for responders. - Reduces terminal context switching and the chance of missing key troubleshooting steps.
Runbook Components Used
- az monitor metrics — Checking resource health and usage
- Log Analytics queries — Investigating error and exception patterns
- App Insights — Reviewing failed requests, stack traces, and correlation IDs
- az containerapp logs — Accessing app revision logs and configuration
- All steps are written in standard markdown, including CLI and KQL queries.
Automation Process Steps
- Create SRE Agent: Use the Azure portal to spin up an agent (no resource group needed for most scenarios).
- Assign Roles: (Optional) Provide Reader role for az command access if runbooks target Azure resources.
- Load Runbooks: Add markdown runbook files to the agent’s knowledge base.
- Connect Communication: Integrate with Outlook to get emailed findings.
- Configure Subagent: Create a subagent with instructions to find and execute appropriate runbooks, collect evidence, and send summaries.
- Set Up Incident Trigger: Connect incident management tools like PagerDuty, ServiceNow, or Azure Monitor alerts to trigger the workflow.
Flexibility and Platform Agnosticism
- Works with on-prem, hybrid, or other cloud environments - simply include the right diagnostic steps in your runbook.
- The agent executes whatever is in your markdown file, regardless of platform.
Benefits Highlighted
- Reduction in MTTR: Get concise analysis before even starting your investigation.
- Consistent Execution: Automated process means no missed or forgotten steps.
- Documented Evidence: All queries and results are preserved for future postmortems.
- Empowered Decision-Making: Spend your mental energy on solutions, not data collection.
Conclusion
Automating runbook execution with Azure SRE Agent transforms incident management, making it faster and reducing on-call stress for engineers. If you maintain runbooks for diagnostics, consider connecting them to an SRE Agent and letting automation handle your next 3am alert.
Try it: Convert your runbooks to markdown, add them to SRE Agent, and let automation do the grunt work.
Published: Dec 20, 2025
Author: dchelupati
This post appeared first on “Microsoft Tech Community”. Read the entire article here