Weekly Azure Roundup: Managed Identity, Observability, DNS

Azure's updates this week leaned toward making production operations less brittle, continuing last week's theme of controlled transitions and day-two readiness. Identity continues shifting away from long-lived secrets, ops tooling continues emphasizing “observe first, automate safely,” and app hosting continues smoothing runtime upgrades and practical deployment paths. Architecture guidance stayed grounded in scale realities: DNS as a hard dependency in private-first designs and DR choices aligned to real RTO/RPO needs.

Managed identities and least-privilege automation across Azure platforms

Azure Red Hat OpenShift (ARO) reached an identity/governance milestone: Azure Managed Identities (for platform/operators) and Microsoft Entra Workload Identity (for pods/apps) are now GA. It fits last week's identity/governance thread by moving ARO components off one broadly permissioned service principal and onto multiple user-assigned managed identities, each mapped to dedicated built-in ARO roles so RBAC can be scoped per component. Provisioning is supported via portal (including an all-in-one flow), ARM/Bicep, and az aro CLI (Azure CLI 2.84.0+). For workloads, Workload Identity uses OIDC federation to map a Kubernetes service account to a user-assigned managed identity so pods can request short-lived tokens for narrowly scoped access (Key Vault, Storage, Azure SQL) without in-cluster secrets. A key transition note remains: legacy service-principal clusters cannot migrate in-place. You need a new managed-identity cluster and then migrate workloads. The same move away from secrets shows up in App Service CI/CD guidance. A walkthrough shows deploying to Azure Web Apps from Azure DevOps using an ARM service connection authenticated via a user-assigned managed identity (UAMI). The flow is simple: assign Website Contributor on the app (or resource group), create the service connection with UAMI auth, and verify who deployed via App Service audit logs (AppServiceAuditLogs), where the initiator is the UAMI object ID. It also notes that setup-time interactive sign-in may appear in logs, so “who set it up” can differ from “which identity deployed.” Hybrid onboarding got a similar least-privilege update with a new Azure Arc onboarding role for Ansible. It matches last week's “standardize the baseline” message by letting teams onboard Arc-enabled servers through existing Ansible playbooks with more tightly scoped permissions, which helps with fleet-scale onboarding.

Kubernetes and App Service operations: tracing, alert automation, and simpler deployments

AKS troubleshooting got a service-mesh tracing pattern that reinforces last week's “constraints plus observability” message. The guide correlates Istio/Envoy access logs with Application Insights using W3C Trace Context. It enables Istio Telemetry in the managed Istio system namespace, then uses EnvoyFilter resources (inbound/outbound) to emit structured JSON logs to stdout so they land in Log Analytics (ContainerLogV2). The practical win is correlation: parse the traceparent header from Envoy logs, extract the 32-hex trace id, and match it to App Insights operation_Id via KQL to follow one request across mesh hops and app telemetry without adding another tracing stack. For incident response, Azure SRE Agent's Azure Monitor integration continued last week's day-two focus. It ingests fired alerts through the Azure Alerts Management REST API (“Get all alerts”), handles multiple alert types, and routes matches into Incident Response Plans in “review” or “autonomous” mode. The demo (AKS Node.js app failing Redis auth due to a bad secret) shows polling every 60s, acknowledging alerts, investigating logs, fixing the secret by retrieving the right Redis key, rolling pods, and resolving. A key nuance is merging: repeated firings from the same rule merge into one active thread (7-day window) rather than creating new incidents, which interacts with whether a rule uses auto-resolve. The advice to keep auto-resolve OFF for persistent failures (bad creds, CrashLoopBackOff-like issues) matters if you want one investigation per ongoing problem. App Service got two developer-facing improvements that reduce friction at opposite ends of the workflow, building on last week's hosting direction. PHP 8.5 is now supported on App Service for Linux in all public regions, including better fatal error backtraces and language features like the pipe operator (|>). For quick deployment, App Service for Linux now offers a simpler zip deployment flow in Kudu/SCM: drag-and-drop a zip, preview contents, choose whether to run a server-side build, and track Upload/Build/Deployment progress with logs and runtime log links. It targets quick tests and initial setup, while still pointing teams to CI/CD for repeatability.

Reliability and private networking: DNS pitfalls, multi-region choices, and Key Vault continuity

A hub-spoke incident write-up reinforced a point that extends last week's private networking items: DNS is Tier-0, especially with custom hub DNS, centralized egress (firewall/NVA), Private Endpoints (OpenAI, AI Search, Key Vault, Storage), and Azure Container Apps (ACA) with internal ingress. The symptoms looked like inconsistent platform behavior (ACA startup/scaling issues, Terraform stalls, endpoint reachability gaps, secret/auth problems) but came down to DNS gaps: custom resolvers not forwarding to 168.63.129.16, missing conditional forwarders for privatelink zones (privatelink.openai.azure.com, privatelink.vaultcore.azure.net, privatelink.search.windows.net, privatelink.blob.core.windows.net), incomplete private DNS zone links across VNets/subscriptions, and missed ACA internal ingress DNS needs (private DNS zone for the environment domain plus wildcard/apex records pointing to the static IP). As with last week's DNS reconciliation pattern, remediation required no application changes, only consistent zone management and correct forwarding/linking. Azure multi-region resilience guidance continues last week's reliability framing by focusing on matching patterns to requirements: Availability Zones for in-region HA, regional BCDR across paired or non-paired regions based on capacity/latency/residency, and active/active when you can handle the operational complexity. It reinforces that regional failover is often customer-orchestrated, so testing failover/failback and aligning data replication/consistency are still on you. It distinguishes Azure Site Recovery (VM mobility/failover) from Azure Backup (restore-based protection) and points to the Resiliency agent in Azure Copilot (preview) as emerging tooling for coverage gaps and drills. Key Vault continuity got a practical warning that fits last week's “do not assume day-2 is handled” theme. Paired-region replication improves survivability but does not guarantee continuity for applications that need writes during outages. During Microsoft-managed regional failover, the vault becomes read-only (reads and crypto ops continue, but writes/updates/rotation/cert renewal stop). If you need rotation and deterministic failover/testing, the recommended pattern is multiple independent vaults across regions with customer-managed replication and failover, supported by Terraform reference architectures including event-based sync (variants for private-endpoint-only regulated environments vs simpler public setups).

Other Azure News

Provisioning workflows continue to standardize around repeatability, echoing last week's deterministic dev/test and predictable rollout theme. An Azure Developer CLI guide walks through the azd loop (azd init, azd auth login, azd up, azd show, azd down --force --purge) and the project structure (azure.yaml, infra/, .azure/) so teams can move from local code to reproducible deployments without portal-driven glue.