saziz_msft introduces a hands-on guide to building self-healing production infrastructure with Azure SRE Agent, focusing on autonomous detection, remediation, and reporting of performance issues.

Fix It Before They Feel It: Proactive Reliability with Azure SRE Agent

Overview

Discover how to achieve higher reliability and autonomous remediation for your cloud solutions using the Azure SRE Agent. This agent leverages AI techniques to detect, diagnose, and fix production issues with no human intervention, ensuring minimal user impact and fast recovery.

Demo and Key Features

Presented at .NET Day 2025, the blog features a live demo where intentionally degraded code is deployed in production. Azure SRE Agent detects the anomaly, rolls back to a healthy state, communicates the incident, and generates insightful reports—all automatically.

Capabilities

Proactive Baseline Learning: Learns application response norms and stores historical metrics
Real-time Anomaly Detection: Monitors production for deviations from baselines
Autonomous Remediation: Performs Azure CLI operations (slot swaps) upon alert triggers
Cross-platform Notification: Updates Microsoft Teams and GitHub Issues during incidents
Incident Reporting: Generates daily summaries of incident response and resolution metrics

Demo Architecture

Application Layer:
- .NET 9 Web API hosted on Azure App Service
- Application Insights for monitoring
- Azure Monitor Alerts for trigger events
Azure SRE Agent:
- 3 sub-agents (AvgResponseTime, DeploymentHealthCheck, DeploymentReporter)
- Knowledge Store for metric baselines
External Integrations:
- GitHub for code tracking and issue automation
- Microsoft Teams for real-time comms
- Outlook for report delivery

Walkthrough: Demo Flow

Deploy infrastructure and applications using provided PowerShell scripts
Configure sub-agents in Azure SRE Agents Portal (set up triggers, schedules)
Run the demo:
- Simulate bad code in production
- Agent evaluates response times via Application Insights
- If >20% degradation, triggers health check, Azure slot swap to rollback
- Posts incident updates to Teams, creates GitHub issue, sends email summary

Detailed setup and demo instructions

Setting Up

Prerequisites:

Azure subscription with Contributor access
Azure CLI (az login)
.NET 9.0 SDK
PowerShell 7.0+

Key Scripts:

1-setup-demo.ps1 — Deploys base infrastructure, App Service, Insights, baseline app versions
2-run-demo.ps1 — Runs the anomaly + remediation demo

Performance Toggle Example (ProductsController.cs):

private const bool EnableSlowEndpoints = false; // false = fast, true = slow

Production: EnableSlowEndpoints = false (~50ms response)
Staging: EnableSlowEndpoints = true (~1500ms response)

Technology Stack

.NET 9.0 Web API
Azure App Service
Application Insights and Log Analytics
Azure DevOps/Bicep Infrastructure
Azure SRE Agent + PowerShell

Resources

Conclusion

Implementing autonomous, proactive remediation with Azure SRE Agent reduces MTTD and MTTR, ensuring high availability and robust user experience.

This post appeared first on “Microsoft Tech Community”. Read the entire article here