In this post, dchelupati details how Azure SRE Agent can automate the tedious on-call runbook execution process for DevOps and SRE teams, enabling faster, more reliable incident diagnostics.

Automating On-Call Runbooks with Azure SRE Agent for Incident Response

Author: dchelupati

When production throws 500 errors at 3am, the repetitive, manual runbook tasks involved in incident response can be error-prone and exhausting. dchelupati shares a solution using Azure SRE Agent to automate these diagnostics, reducing time to discovery and letting engineers focus on decisive action.

The Problem with Manual Runbooks

  • Incident response often involves following documented steps: querying metrics, logs, requests, and errors.
  • This manual process is tedious, repetitive, and prone to human error, especially during late-night on-call emergencies.

Azure SRE Agent + Runbook Automation Overview

  • The Azure SRE Agent can read markdown runbooks, execute diagnostic steps (such as az monitor metrics, Log Analytics and App Insights queries), and compile a summary email for responders.
  • Reduces terminal context switching and the chance of missing key troubleshooting steps.

Runbook Components Used

  • az monitor metrics — Checking resource health and usage
  • Log Analytics queries — Investigating error and exception patterns
  • App Insights — Reviewing failed requests, stack traces, and correlation IDs
  • az containerapp logs — Accessing app revision logs and configuration
  • All steps are written in standard markdown, including CLI and KQL queries.

Automation Process Steps

  1. Create SRE Agent: Use the Azure portal to spin up an agent (no resource group needed for most scenarios).
  2. Assign Roles: (Optional) Provide Reader role for az command access if runbooks target Azure resources.
  3. Load Runbooks: Add markdown runbook files to the agent’s knowledge base.
  4. Connect Communication: Integrate with Outlook to get emailed findings.
  5. Configure Subagent: Create a subagent with instructions to find and execute appropriate runbooks, collect evidence, and send summaries.
  6. Set Up Incident Trigger: Connect incident management tools like PagerDuty, ServiceNow, or Azure Monitor alerts to trigger the workflow.

Flexibility and Platform Agnosticism

  • Works with on-prem, hybrid, or other cloud environments - simply include the right diagnostic steps in your runbook.
  • The agent executes whatever is in your markdown file, regardless of platform.

Benefits Highlighted

  • Reduction in MTTR: Get concise analysis before even starting your investigation.
  • Consistent Execution: Automated process means no missed or forgotten steps.
  • Documented Evidence: All queries and results are preserved for future postmortems.
  • Empowered Decision-Making: Spend your mental energy on solutions, not data collection.

Conclusion

Automating runbook execution with Azure SRE Agent transforms incident management, making it faster and reducing on-call stress for engineers. If you maintain runbooks for diagnostics, consider connecting them to an SRE Agent and letting automation handle your next 3am alert.

Try it: Convert your runbooks to markdown, add them to SRE Agent, and let automation do the grunt work.


Published: Dec 20, 2025
Author: dchelupati

This post appeared first on “Microsoft Tech Community”. Read the entire article here