Use AI to Achieve Operational Excellence with the Well-Architected Framework practices

Microsoft Developer features Boris Scholl and Niels Buit discussing how to apply AI to Azure Well-Architected Framework Operational Excellence practices, focusing on observability, troubleshooting, automation, and the risks (non-determinism, security, privacy) that architects need to manage.

Overview

This episode explores how AI can improve Operational Excellence (“day 2” operations) using practices from the Azure Well-Architected Framework (WAF). It focuses on practical areas like observability, troubleshooting, and automation—and the operational risks that come with non-deterministic systems.

What you’ll learn

Key themes discussed

AI for day 2 operations (not just building systems)

The conversation distinguishes between:

It emphasizes that operational excellence is often more critical after launch than during initial build.

Observability as the best starting point

Observability is highlighted as the most practical entry point for AI in operations, including:

OpenTelemetry is referenced as a way to get started.

Automation, simulation, and remediation

AI opportunities discussed include:

Risk management: non-determinism, hallucinations, and guardrails

The episode calls out the trade-offs of non-deterministic AI behavior in production operations:

It also references evaluation patterns like AI-as-a-judge to assess output quality.

Human-in-the-loop for critical incidents

For high-impact operational actions, the discussion stresses keeping humans involved, especially where decisions affect availability, security, or data integrity.

Security, privacy, and governance

Security and privacy considerations are covered as part of making AI-driven operations safe and trustworthy, alongside governance practices.

Timeline (from the video description)

Resources

Speakers