Use AI to Achieve Operational Excellence with the Well-Architected Framework practices
Microsoft Developer features Boris Scholl and Niels Buit discussing how to apply AI to Azure Well-Architected Framework Operational Excellence practices, focusing on observability, troubleshooting, automation, and the risks (non-determinism, security, privacy) that architects need to manage.
Overview
This episode explores how AI can improve Operational Excellence (“day 2” operations) using practices from the Azure Well-Architected Framework (WAF). It focuses on practical areas like observability, troubleshooting, and automation—and the operational risks that come with non-deterministic systems.
What you’ll learn
- How AI fits into the Operational Excellence pillar of the Well-Architected Framework
- Where AI helps most today:
- Observability and telemetry analysis
- Dependency mapping
- Proactive issue detection
- Summarization and troubleshooting support
- What to watch out for:
- Non-determinism and operational risk
- Hallucinations and root-cause analysis (RCA) accuracy
- Security, privacy, and governance concerns
- Why and how to keep a human in the loop for critical decisions
- How WAF can shift from a static checklist to a more “living” framework when combined with AI and evaluation patterns
Key themes discussed
AI for day 2 operations (not just building systems)
The conversation distinguishes between:
- Running AI systems (operationally managing AI workloads)
- Using AI to improve operations (making production operations more effective)
It emphasizes that operational excellence is often more critical after launch than during initial build.
Observability as the best starting point
Observability is highlighted as the most practical entry point for AI in operations, including:
- Using telemetry to detect emerging issues earlier
- Building dependency graphs and understanding system relationships
- Enabling predictive failure detection
OpenTelemetry is referenced as a way to get started.
Automation, simulation, and remediation
AI opportunities discussed include:
- Supporting engineers during incidents (summaries, suggestions)
- Exploring simulation scenarios
- Moving toward safer auto-remediation patterns, where appropriate
Risk management: non-determinism, hallucinations, and guardrails
The episode calls out the trade-offs of non-deterministic AI behavior in production operations:
- AI output can be inconsistent across runs
- Hallucinations can create risk in RCA and summarization
- Guardrails and “fail fast” thinking matter
It also references evaluation patterns like AI-as-a-judge to assess output quality.
Human-in-the-loop for critical incidents
For high-impact operational actions, the discussion stresses keeping humans involved, especially where decisions affect availability, security, or data integrity.
Security, privacy, and governance
Security and privacy considerations are covered as part of making AI-driven operations safe and trustworthy, alongside governance practices.
Timeline (from the video description)
- 00:06 – AI and Operational Excellence in the Well-Architected Framework
- 00:48 – Running AI vs Using AI to Improve Operations
- 01:21 – Why Day Two Operations Matter More Than Day One
- 02:05 – Observability, Deployment, and Troubleshooting
- 02:40 – Why Observability Is the Best Starting Point for AI
- 03:38 – Non-Deterministic AI and Operational Risk
- 04:05 – Real-World AI Use in Engineering Operations
- 04:39 – AI Opportunities: Simulation and Auto-Remediation
- 05:00 – Getting Started with OpenTelemetry
- 05:48 – What AI Enables in Observability Today
- 06:24 – Dependency Graphs and Predictive Failure Detection
- 07:01 – Human-in-the-Loop for Critical Incidents
- 08:00 – Fail Fast, Guardrails, and Responsible AI
- 08:50 – Hallucinations, RCA Accuracy, and Summarization Risk
- 09:29 – AI-as-a-Judge and Evaluation Patterns
- 10:17 – Security and Privacy Considerations
- 11:30 – How WAF Helps Navigate AI Tradeoffs
- 12:19 – Making WAF a Living, AI-Driven Framework
- 13:15 – Key Takeaways and Final Thoughts
Resources
- Azure Well-Architected Framework: https://learn.microsoft.com/azure/well-architected/
- Operational Excellence (WAF): https://learn.microsoft.com/en-us/azure/well-architected/operational-excellence/
- Try Azure for free: https://aka.ms/AzureFreeTrialYT
Speakers
- Boris Scholl: https://www.linkedin.com/in/bscholl/
- Niels Buit: https://www.linkedin.com/in/nielsbuit/