Weekly Azure Roundup: Day-2 Ops, Secure Compute, Hybrid Control

Azure stories clustered around day-2 reality: better operational signal across container and hybrid estates, tighter compute security without breaking automation, and platform moves like leaving Heroku or running AI inference on-cluster. The framing across these posts is mostly about controlled transitions instead of all-at-once rewrites.

SRE, incident response, and observability getting more “context-aware”

Building on last week’s platform engineering thread, Azure reliability tooling this week emphasized that AI-assisted operations only work when grounded in your environment. Azure SRE Agent’s onboarding flow reflects that. Provisioning the runtime (managed identity, Application Insights, Log Analytics) is straightforward, but the UI pushes you to attach context so answers are not generic. That includes connecting code (health probes, auth flows, history), logs with known schemas, incident sources, scoped Azure resources, and team knowledge files. The walkthrough is clear about outcomes: deployment-state summaries via Azure CLI plus logs, diagnosing 401s with app-specific checks, generating/triaging Azure Monitor incidents, finding RBAC/Log Analytics permission gaps, and classifying GitHub issues. It ends with a practical “done” definition: the agent can handle a live incident with app-specific, data-backed guidance. (It also notes SRE Agent GA as of March 10, 2026, with creation at sre.azure.com.) Azure Managed Grafana 12 shipped investigation and access-control improvements. The main change is current-user Entra auth for supported Azure data sources (Azure Monitor, Azure Data Explorer, Azure Monitor Managed Service for Prometheus), so queries can run under the signed-in user’s permissions instead of a shared identity. This supports least privilege and auditing while still supporting Managed Identity and Service Principal for automation. Logs exploration also improves with a faster Explore flow and a query builder for iterating Azure Monitor Logs queries without writing KQL from scratch, plus a higher record limit (up to 30,000 per query). Metrics querying improves for Prometheus + OpenTelemetry users with better OTLP/histogram support and an OTel mode to reduce difficult label joins.

Cloud-native operations: isolated AKS inference and clearer container supply signals

Following last week’s hybrid/disconnected focus (Arc Gateway for Kubernetes GA, sovereign-cloud governance training), teams running AI inference under strict network controls got a blueprint for air-gapped AKS. The constraint is straightforward: with no egress (no public LBs, NAT, public IPs on node subnets, and subnet design blocking default outbound), common LLM images fail because they download weights at startup. The guide offers two patterns: build “fat” images with weights baked in and push to private ACR (example uses az acr build and HF_TOKEN), or pre-download artifacts outside the cluster and expose them via private storage (for example, Azure Files over NFS) mounted into vLLM/NIM pods. It also highlights ACR artifact cache to stage upstream images so deployments do not depend on public registries. For GPUs, it points to a managed GPU node pool preview (prereqs preinstalled and lifecycle-managed) as an operations simplifier, with NVIDIA GPU Operator air-gapped mode as the alternative for tighter version control. Validation stays operational: internal service IPs, calling an OpenAI-compatible /v1/chat/completions endpoint from inside the private network, and verifying the model identifier matches baked/mounted artifacts. Azure Container Registry also added proactive health monitoring to help separate “our pipeline broke” from “the registry is degraded.” ACR now runs SLI-based auto-detection across auth/push/pull and emits Azure Service Health events when a region is degraded. Events include tracking IDs, impacted regions/resources, and mitigation status (automated remediation vs engineer response). The practical benefit is faster correlation when CI/CD fails or Kubernetes pulls time out, and it integrates with standard paging via Service Health alert rules/action groups (PagerDuty/Opsgenie/ServiceNow/webhooks), with ARM/Bicep guidance for standardizing alerting. Microsoft’s KubeCon Europe 2026 lineup post is not a product change, but it shows where Azure’s cloud-native guidance is trending: AI agents for platform operations (HolmesGPT), shared GPU inference scheduling (including Kueue), multi-cloud inference patterns, and staples like Istio/networking, OpenTelemetry, supply chain tooling (Notary/ORAS/Ratify), confidential containers, Terraform operations, and fleet management.

Compute security and contracts: Confidential VM workflows, plus an API response change to watch

This continues last week’s compliance/governance thread (sovereign/hybrid controls, policy-as-code thinking), but with a compute-primitives focus: repeatable Confidential VM workflows, plus a small API contract change that can still break automation. The Confidential VM custom image workflow shows how to build a hardened Windows golden image once, publish via Azure Compute Gallery for versioned/cross-region rollout, then choose disk encryption posture at deployment time. One image can serve PMK and CMK because encryption is applied via OS disk configuration (SecurityEncryptionType) and, for CMK, a Disk Encryption Set wired to Key Vault/Managed HSM. It also calls out a pipeline constraint: to publish an image supporting Confidential security types, you must use a Source VHD, so you export a generalized OS disk to a storage-account VHD before creating an image version. The guide covers common time sinks like Sysprep failures due to BitLocker (decrypt and retry) and includes PowerShell for -SecurityType "ConfidentialVM", Secure Boot, vTPM, and confidential OS encryption on Gen2 Windows Server 2022 Azure Edition images. Azure Migrate guidance extends this into a private, governed runbook: private endpoints for Azure Migrate and staging storage, Private DNS zones, ExpressRoute/S2S VPN, Disk Encryption Sets with CMK, and attestation-gated key release via Managed HSM so keys release only after the VM proves it booted in an expected confidential state. Planning details include supported SKUs across AMD SEV-SNP and Intel TDX (for example, DCasv6/ECasv6 and DCesv6/ECesv6), Gen2 + UEFI + Secure Boot + vTPM requirements, disk compatibility constraints (including a noted ⇐128 GB OS disk limit for full confidential disk encryption support, with workarounds), and OS support caveats. The runbook is structured as nine phases from appliance setup through test migration, cutover, and post-migration policy enforcement, with day-2 governance via Azure Policy, Azure Monitor, and Defender for Cloud (optionally Sentinel). Azure Compute also flagged a REST API contract change: with api-version=2025-11-01, VM/VMSS responses will always return a non-null properties.securityProfile.securityType. If you omit or send null on create/update, responses return "Standard". Explicit "TrustedLaunch" and "ConfidentialVM" stay the same. Provisioning behavior does not change, but tooling that treats null as distinct must update to treat "Standard" as the default for the new API version.

App platform and API patterns: Heroku exits, and APIM as a BFF without another microservice

This week’s “controlled transition” framing echoes last week’s migration guides (container runtime shifts, ExpressRoute gateway migration): sequence changes so teams can keep shipping while platforms move. Heroku migration guidance continues to sharpen into a move-in-slices playbook. Azure App Service is framed as the closest landing zone for Heroku-style apps/APIs, with Azure Container Apps for teams ready for containers with serverless traits like scale-to-zero. It calls out continuity for common languages (.NET/Java/Node.js/Python) and Docker, then recommends incremental modernization: rehost first, then evolve toward microservices and event-driven patterns using Dapr and KEDA. For data, Azure Database for PostgreSQL is positioned as the natural step for Heroku Postgres users, aligning with last week’s Postgres-as-bridge trend. Delivery centers on GitHub integration and GitHub Actions, with operations via Azure Monitor and Azure SRE Agent, and AI feature delivery via Azure AI Foundry (Microsoft Foundry), including MCP tool-calling once apps stabilize. A separate guide shows implementing the BFF / Curated API pattern directly in Azure API Management policies, avoiding a separate aggregation service when the main need is fan-out plus response shaping. The examples use <wait for="all"> with parallel <send-request>, store results in variables, then build a combined payload with policy expressions (C#-style) and JObject (Newtonsoft.Json), returning via <return-response>. It also covers semantics (200 when all succeed; consider 206/207 for partial success) and a practical cutoff: if orchestration logic gets too complex or divergent, a coded BFF (Functions or Container Apps) is easier to maintain.

Other Azure News

The MSSQL extension v1.40 for VS Code added workflow updates: Edit Data (grid edits in-editor), better object search for large schemas, DACPAC export/import for packaging and promoting schema, and flat file (.csv/.txt) import for quick loads. This reduces trips to SSMS/Azure Data Studio and supports the “move in slices” migration theme by making schema/data validation and repeatable promotion easier during platform transitions.