Weekly Azure Roundup: Resilient Routing and Fabric-First Data Moves

This week's Azure story split into two lines: keeping platforms resilient as infrastructure evolves (edge routing, registries, ingress, DR, monitoring, hybrid networking), and modernizing data estates into Fabric/OneLake where migration assistants, governance, and real-time pipelines are becoming standard building blocks. It continues last week's “controlled transitions” framing: change traffic layers, registry behavior, or data platforms in phases, with clearer signals and fewer surprise support boundaries.

AKS ingress modernization: moving off Ingress NGINX to Application Gateway for Containers (AGC)

Ingress NGINX now has explicit support boundaries: upstream ingress-nginx is best-effort until March 2026, then stops shipping releases, bug fixes, and security patches. AKS Application Routing gets only critical security patches until November 2026 (no new features or general fixes). Azure's recommended replacement is Application Gateway for Containers (AGC), a managed L7 load balancer for AKS (GA late 2024) with WAF support added November 2025. Architecturally, AGC moves the traffic engine out of the cluster: the data plane/control resource lives in Azure (frontends, delegated subnet association, auto-generated FQDNs), while an in-cluster ALB Controller watches Gateway API objects (Gateway/HTTPRoute) and AGC policy CRDs and programs the Azure-side gateway. Migration guidance emphasizes phased adoption: AGC supports both Gateway API (preferred) and legacy Ingress API, enabling incremental cutover while aiming to land on Gateway/HTTPRoute. It outlines two operating models: “Bring Your Own” (platform teams manage the Azure AGC resource with IaC and configure the controller with a fixed resource ID) and controller-managed provisioning (Kubernetes ApplicationLoadBalancer CR drives Azure-side lifecycle), which is simpler to start but more coupled to cluster objects. Some teams will hit prerequisites: AGC requires Azure CNI or Azure CNI Overlay, so Kubenet clusters may need a network plugin migration, and workload identity must be enabled for controller auth. This mirrors last week's “dependency-first” sequencing: networking and identity often need to be correct before higher-level platform changes behave predictably. Benefits are practical: less node resource use (data plane outside cluster), fewer ingress proxy patch chores, faster convergence during scale events by routing to pod IPs via EndpointSlice, and integrated WAF that may simplify “WAF in front of AKS” designs. Microsoft's AGC Migration Utility (Jan 2026) is positioned as the first step: run in “files mode” (manifests directory) or “cluster mode” (reads live Ingress), generate Gateway API YAML plus a coverage report, and use report categories (completed/warning/not-supported/error) to find NGINX annotation dependencies that need redesign. The recommended flow: install/configure AGC + ALB Controller, generate resources for BYO (AGC resource ID) or managed (subnet ID), validate in non-prod, run NGINX and AGC in parallel, DNS cutover to AGC frontend FQDN, then decommission the old controller to avoid multiple reconcilers. The guidance is explicit about what cannot be automated: NGINX snippets and Lua do not translate cleanly, TLS and DNS cutover require manual validation/updates, and GitOps pipelines may need refactoring because kinds and manifests change when moving from Ingress to Gateway API.

Edge and routing resiliency: Azure Front Door patterns and Front Door’s faster RTO work

October 2025's Front Door incidents are shaping Azure guidance: treat “global edge routing is unavailable” as a real failure mode, not only origin-region outages. It aligns with last week's day-2 readiness theme: design for diagnosability and runbook-driven failover when the control plane is unhealthy. Field lessons describe DNS-steered fallback patterns where Azure Traffic Manager in “Always Serve” mode is the escape hatch when AFD control plane or DNS resolution fails. A common trade-off is WAF consistency during failover: if policy must remain consistent, teams often keep a regional Application Gateway (WAF) path ready, sometimes requiring a runbook step to switch an AppGW IP config to public because Traffic Manager targets must be public. For stricter uptime targets (and heavy CDN dependence), a multi-CDN pattern (AFD + Akamai, for example) reduces single-provider edge dependency, with advice like keeping a small steady-state split (for example, 90/10) to keep caches warm and avoid cache-miss storms during cutover. Front Door engineering updates also aim to reduce recovery time after edge restarts. Part 2 focuses on bounding RTO at scale by changing config recovery: durable per-tenant translated config artifacts (FlatBuffers memory-mapped from disk) persist across restarts; per-tenant validation occurs on load; and only failing tenant entries are evicted and retranslated instead of invalidating everything. The second lever is scaling recovery by active tenants using ML-informed lazy loading: workers preload predicted warm tenants per edge site while keeping hostmaps for correctness so cold tenants load on first request. Timelines are concrete: config propagation reduced from ~45 minutes to ~20 minutes (target ~15 minutes by end of April 2026), RTO targeting <10 minutes worst case by April 2026, and a “micro-cellular” tenant isolation redesign targeted for June 2026. For critical services behind AFD, the takeaway is to (1) implement tested DNS failover runbooks and plan for cache-miss surges, and (2) track Front Door recovery behavior so SLO expectations match platform reality.

Container supply chain reliability: health-aware ACR geo-replication failover

This continues last week's ACR thread: after adding deeper health monitoring and Service Health communication so teams can correlate CI/CD and pull failures, ACR is now using that signal to change routing during regional issues. ACR geo-replication now fails over based on whether a region can serve end-to-end registry operations, not just whether a proxy returns 200 OK. Previously, the global endpoint (for example, contoso.azurecr.io) relied on Traffic Manager performance routing with shallow probes, which could keep sending traffic to a “green” reverse proxy while dependencies (storage, auth, caching, metadata) were degraded, causing pulls and pushes to return 500s despite green probes. ACR now wires Health Monitor into Traffic Manager via a deep health endpoint that rolls up dependency health and marks replicas unhealthy when they cannot serve real registry requests. Operationally important, this is evaluated per registry, not as a blanket region flag. Health Monitor checks the registry's actual backing resources in that region (including feature-specific dependencies like metadata search), so one registry may reroute while another stays local. For DevOps teams, the change is transparent (no hostname/config changes), but behavior is still DNS-based and measured in minutes: probe cadence (~30s), failure thresholds, TTLs, and client DNS caching matter. It reinforces two active-active realities: replication is eventual (so immediate cross-region pulls after push can return not-found, and retries help), and pushes during failover can fail mid-operation if DNS re-resolution occurs, so publish steps should be idempotent and retryable. The practical result is fewer manual interventions (like az acr replication update) during regional issues and fewer “it's up but pulls fail” incidents.

Azure compute for GPU workloads: NCv6 refresh and GA transition

Azure's NCv6 GPU VM family is moving from preview toward GA in coming weeks, shifting to SLA-backed production readiness and changing available sizes. NCv6 uses NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs with Intel Xeon 6 “Granite Rapids” 6900P-series CPUs, targeting graphics/VDI, CAD/CAE, and generative AI inference. The refreshed lineup expands from three preview sizes to seven sizes across two sub-families (General Purpose, Compute Optimized) and adds fractional GPU options (1/2 and 1/4 GPU). Those fractional SKUs help right-size inference endpoints and interactive visualization where full GPUs are unnecessary, scaling down vCPU, memory, temp disk/NVMe, and networking accordingly. In the context of last week's guidance on placing inference where networking and isolation allow it (including isolated AKS patterns and managed GPU node pools preview), smaller shapes can also help place GPU closer to constrained workloads without full-GPU cost. Some top-end shapes increase vCPU counts (for example, 288 vs 256) to align with the CPU architecture and improve high-end VDI. GA regions start with West US 2 and Southeast Asia, with more planned across Q3 2026 (East US, West Europe, East US 2, North Europe, South Central US, Germany West Central, West US, Korea Central). Teams testing preview sizes should plan for shape changes as new sizes replace prior preview offerings.

Microsoft Fabric + OneLake modernization: migrations, governance, real-time pipelines, and platform controls

Fabric updates this week focused on making modernization repeatable: assess-first migration assistants, safer cutover defaults, expanded governance/security, and more ready-to-use building blocks for real-time and monitoring. It mirrors last week's migration posture (move in slices, parallel validation, controlled cutovers) with more platform defaults that support phased transitions. Migration assistants expanded across common Azure data-estate entry points. For Azure Data Factory and Synapse, public-preview assistants assess compatibility, convert pipelines into Fabric Data Factory equivalents (including linked services → Fabric connections), and intentionally disable triggers after migration so teams can validate without accidentally starting schedules/events. Synapse Spark migrations similarly move artifacts (Spark pools, notebooks, jobs) into Fabric Data Engineering and map lake databases via OneLake shortcuts without moving data, so you can validate in parallel before cutover. On SQL, Fabric is pushing guided in-portal migration for SQL Server into SQL database in Fabric (preview), combining schema checks, remediation guidance (including Copilot help), and data copy in one flow. The broader roundup also reiterates DACPAC schema import and compatibility improvements (expanded compatibility levels, more T-SQL, full-text search, more ALTER DATABASE) to reduce application changes. OneLake is becoming the connect-without-copying layer. Shortcuts and mirroring expand to more sources (SharePoint Lists preview, Azure Monitor mirroring via shortcuts preview, Dremio preview), while mirroring is GA for Oracle and SAP Datasphere. Shortcut transformations are GA (including conversion to Delta), and there is a preview Excel-to-Delta transformation to reduce notebook glue code. Governance and security are expanding alongside: workspace-level IP firewall rules GA; Outbound Access Protection (OAP) GA across more items (including shortcuts and mirrored databases). OneLake Security is expected to reach GA soon with a unified permission model (including RLS/CLS) intended to follow data across Spark, Power BI, and Fabric agent experiences, with APIs planned so third-party engines can integrate with OneLake enforcement instead of rebuilding auth. For real-time patterns, Fabric Eventstreams adds DeltaFlow (preview) to turn Debezium-style CDC feeds from operational databases (Azure SQL DB/MI, SQL Server on Azure VMs, PostgreSQL) into analytics-ready streaming tables. Eventstreams can handle schema registration, flatten CDC payloads into table-shaped outputs, enrich with CDC metadata, manage schema evolution, and auto-create/update destination tables. This helps teams route to Eventhouse and query with KQL without maintaining custom CDC connectors and transforms. Visualization also advanced: Maps GA in Real-Time Intelligence adds ontology-based modeling through Fabric IQ, new geospatial connections (Planetary Computer Pro imagery, WMS/WMTS raster), and scheduled refresh for vector tile sets. Eventhouse gets built-in workspace monitoring dashboard templates so teams can start with prebuilt KQL and visuals for Eventhouse and Power BI semantic model operations (DirectQuery, XMLA), then customize. Fabric Data Factory's FabCon recap delivered more platform controls: OAP to restrict pipeline/dataflow/copy destinations; Key Vault integration for the VNet Gateway GA; on-prem gateway auto-updates; broader identity support (GA service principal + workspace identity across activities); and Copy Job improvements (connectors, incremental copy options including more watermark types and query extraction, truncate-on-full-copy to avoid duplicates). Two developer-facing additions stand out: Copy Job audit columns (row-level metadata like extraction time, job/run IDs, incremental window bounds for lineage/compliance), and a Fabric Data Factory MCP Server (preview) exposing Data Factory operations (pipelines, dataflows, Power Query M author/exec, gateway discovery/health) to MCP-compatible tools like GitHub Copilot and Claude, continuing last week's agent-assisted ops thread. Warehouse updates include Custom SQL Pools (preview) to isolate compute by allocating multiple pools as percentages and routing workloads via Application Name or regex, which helps when ad-hoc analysis competes with reporting and ETL. The SQL engine continues reducing ingestion-related slowdowns via proactive and incremental stats refresh, moving stats updates to background policies and updating histograms incrementally for large append-heavy tables to avoid compile-time stalls; Microsoft says most workspaces saw compilation-time stats updates cut in half by March 2026. Finally, Fabric Extensibility self-service workload publishing is GA for ISVs: upload/validate packages and share with up to 20 customer tenants pre-certification, with name reservation and automated manifest/security validation to tighten pilot-to-Workload-Hub paths.

Other Azure News

Platform migrations showed up in a practical “here is the mapping and CLI” form, building on last week's Heroku exit guidance (App Service for rehost, Container Apps for container-native ops). This week's post is the Container Apps “show your work” version, including the kinds of failure modes (provider registration, secret ordering) that tend to show up during phased cutovers.