Weekly Azure Roundup: AKS Ops, DNS Reliability, and Fabric CI/CD

Apr 20, 2026 by TechHub

Azure updates this week leaned into operational work: new ingress, backups, and incident-response building blocks for Kubernetes; deeper looks at private DNS and packet visibility; and Fabric progress on migration gaps plus automation hooks. The theme was reducing toil through standard workflows (one-command setups, self-updating CLIs, policy remediation) and more evidence-based troubleshooting and cutovers. It continues last week's “day-two readiness” thread: fewer brittle secrets and manual steps, more controlled transitions (ingress migration clocks, log ingestion deprecations), and clearer acknowledgement that DNS and telemetry wiring often decide reliability.

This Week's Overview

AKS platform operations: ingress shifts, backups, and AI-assisted investigations

AKS ingress is moving toward Kubernetes Gateway API, with the AKS App Routing add-on as the entry point. Building on last week's AKS/Istio direction (mesh-aware tracing, evidence-based troubleshooting), this preview reuses an Istio-managed Envoy gateway stack in a gateway-only shape: “App Routing Gateway API” replaces the NGINX Ingress API path with an Istio-managed Envoy gateway, explicitly not a full mesh (no sidecars, no workload Istio CRDs). The platform model is simpler: creating a Gateway auto-provisions the Envoy deployment plus a LoadBalancer service, HPA (2-5 replicas, 80% CPU default), and PDB. The split between GatewayClass/Gateway and HTTPRoute also lets platform teams own gateway infrastructure while app teams own routes, which reduces shared-Ingress contention. The preview is framed against deadlines: Ingress NGINX retirement is March 2026, with extended support for NGINX-based AKS App Routing until November 2026. Migration guidance includes parallel controllers, manifest conversion via kubernetes-sigs/ingress2gateway v1.0.0, and careful cutover steps (Gateway “programmed” condition, Host header tests to the new external IP before DNS, lowering DNS TTL early, keeping old ingress 24-48 hours for rollback). The preview is not feature-complete: current App Routing DNS/TLS automation (Azure DNS + Key Vault cert integration) is not available yet in Gateway API mode, so teams need manual TLS/DNS or alternatives like ExternalDNS for Gateway API. That gap matters given last week's “DNS is Tier-0” warning: moving ingress is often easier than moving DNS and TLS plumbing. There is also a strategic constraint: this mode is mutually exclusive with the AKS Istio service mesh add-on, so clusters choose “gateway-only” or full mesh. AKS backup enablement also gained a more automation-friendly entry point, consistent with last week's emphasis on repeatable baselines. A single command, az dataprotection enable-backup trigger --datasource-type AzureKubernetesService --datasource-id <aks-arm-id>, orchestrates validation and setup: backup RG selection/creation, AKS Backup Extension install, storage account provision/reuse, vault/policy provision/reuse, Trusted Access config, and backup instance creation. Presets (Week/Month/DisasterRecovery/Custom) standardize retention defaults while still supporting enterprise wiring via JSON config (existing vault/policy IDs, tags, RG control). For on-call work, AKS networking troubleshooting is getting more “agentic” but remains evidence-driven. Following last week's “observe first, automate safely” theme, the Container Network Insights Agent (public preview) correlates signals across CoreDNS, service routing, NetworkPolicy/CiliumNetworkPolicy, Cilium/Hubble flows, and host kernel telemetry (ring buffers, packet counters, SoftIRQ distribution, socket buffer utilization). It integrates through the AKS MCP server to run diagnostics via kubectl, Cilium, and Hubble within defined boundaries, producing an auditable report tied to pass/fail evidence. It is advisory-only (no changes), uses read-only/minimal RBAC, and may deploy a temporary debug DaemonSet for host visibility. Preview regions are limited (Central US, East US, East US 2, UK South, West US 2), and full capability requires Cilium plus Advanced Container Networking Services. Customers also bring their own Azure OpenAI resource for model configuration and residency control. Finally, AKS migration guidance reiterated a reliability point consistent with last week's broader framing: “it deployed” is not the same as “ready for cutover.” Guidance focuses on what breaks under real traffic (memory limits, region configuration gaps, broken bindings to messaging/ingestion, stale endpoint mappings) and treats cutover as a coordinated dependency transition across compute, networking, storage, messaging, analytics connectivity, and background jobs. It also stresses rehearsing DR/failback mechanics (not only DNS reversal) and running smoke tests that exercise real integrations and background workloads under production-like constraints.

Observability and incident response automation: SRE Agent connectors and log ingestion migrations

Azure SRE Agent got a usability boost for investigations, extending last week's “Azure Monitor in Azure SRE Agent” story from alert ingestion and merging into faster evidence gathering. New first-party connectors for Log Analytics and Application Insights let the agent run KQL directly via MCP-backed tools instead of shelling out to az monitor. Setup also handles RBAC when saving the connector (Log Analytics Reader + Monitoring Reader at the target RG), and queries use native monitor-namespace MCP tools like monitor_workspace_log_query, monitor_resource_log_query, plus discovery helpers. The model stays read-only (no changes to retention/settings) and can use different managed identities per connector, continuing last week's move away from over-permissioned automation identities. Azure Monitor log ingestion is also moving off the legacy HTTP Data Collector API, similar to this week's ingress retirement clock. With deprecation set for September 2026, the outlined migration path is moving Logic Apps from the Data Collector connector (workspace ID/key) to an HTTP action calling Logs Ingestion API, backed by DCEs and DCRs. Practical issues are already showing up: Logic Apps can “succeed” while data does not land in new custom tables, and new Data Collector API connections may fail with 403. The new pattern includes schema definition via sample upload (JSON array), optional TimeGenerated via DCR transformation, ingestion URL built from DCE base + DCR immutable ID + stream name, and assigning the Logic App managed identity Monitoring Metrics Publisher on the DCR. Success returns 204, which is useful for validating pipelines.

Azure networking and hybrid connectivity: Private DNS fallback, ExpressRoute to MVE, and packet mirroring

Hybrid and Private Link-heavy designs keep hitting the same DNS failure mode, continuing last week's hub-spoke postmortem: a linked private DNS zone returns authoritative NXDOMAIN for a Private Link name when the needed record is missing (DR failovers, partial replication, cross-boundary layouts, multi-region). The fix highlighted is enabling resolutionPolicy = NxDomainRedirect on the private DNS zone's VNet link (portal: “Enable fallback to internet”). Azure DNS then retries public recursive resolution only when the private zone returns NXDOMAIN, letting apps resolve the public endpoint again when it exists and is reachable. This is a scoped resolution change (not access), but it can prevent partial DNS inconsistency from turning into an outage, especially when public fallback is part of the intended DR posture. A connectivity walkthrough covers wiring Azure ExpressRoute into Megaport Virtual Edge (MVE) with a Cisco 8000v NVA. It is configuration-focused: two VXCs (primary/secondary), distinct VLAN IDs per path, matching VLAN between Megaport and ExpressRoute Private Peering, /30s per path, and Cisco IOS subinterfaces with encapsulation dot1Q <vlan> plus eBGP neighbor configuration (example Azure ASN 12076). It also highlights validation steps (ICMP to Azure peer IPs) and common troubleshooting (ARP issues). Azure Virtual Network TAP (VTAP) also got attention as an option when flow logs are not enough, complementing last week's observability guidance. In public preview in select regions, VTAP mirrors full traffic (including payload) for selected NICs and sends it to a collector using VXLAN over UDP 4789. The demo shows Wireshark decoding encapsulated flows and notes that the destination NIC can be in the same or a peered VNet, which can help centralize inspection tooling away from application subnets.

Data services: Fabric schema safety, migration parity, and the next wave of PostgreSQL

Fabric's data plane continues closing gaps that show up in CI/CD and migrations, building on last week's themes around modernizing without rewrites and reducing glue code. A practical GA update in Fabric Data Warehouse is that some ALTER TABLE operations now work inside explicit transactions (BEGIN TRAN ... COMMIT). Previously, ALTER TABLE under snapshot isolation failed, which forced non-atomic schema changes and increased partial-deploy risk. Supported operations include adding nullable columns, dropping columns, adding/dropping NOT ENFORCED constraints (PK/UNIQUE/FK), multiple ALTER TABLE statements in one transaction, and altering distributed temp tables. Exclusions include adding non-nullable columns and ALTER COLUMN. Fabric SQL Database improved migration compatibility with a preview: full support for Azure SQL Database collation sets at database creation time. It is configured in the creation payload (NewSqlDatabaseCreationPayload) via the Fabric REST API (and wrappers like Fabric CLI/PowerShell). This reduces surprises for multilingual and collation-sensitive workloads (ORDER BY, LIKE, equality, case/accent sensitivity), though it does not change collation for replicated data in the SQL analytics endpoint. Fabric Data Factory guidance focused on when to move from Azure Data Factory and what changes for developers. The stance is incremental: ADF remains supported, but new work lands in Fabric Data Factory's SaaS authoring and workspace model. Differentiators include Fabric-native Mirroring into OneLake for low-latency replication (continuous inserts/updates/deletes) and Copy Jobs for config-first bulk and incremental movement (watermarking, CDC, built-in SCD Type 2). For pro-dev flows, managed Airflow Jobs and dbt Jobs are first-class alongside pipelines and Dataflows Gen2, with an AI integration thread via MCP (Copy Jobs exposed as MCP endpoints and the open-source microsoft/DataFactory.MCP server). This mirrors what is happening in operations (AKS, SRE Agent): standardized tool interfaces with guardrails and clearer auditability. Managed PostgreSQL messaging also hints at a split between “run Postgres well today” and “what's next.” One video covers practical Azure Database for PostgreSQL mechanics (HA/failover, read replicas, backup/restore, elastic clusters) plus cost/perf notes like AMD SKUs. Another “sneak peek” introduces Azure HorizonDB, a managed PostgreSQL option aimed at very large scale with decoupled compute and storage, replica scaling over shared storage, and multi-zone commit behavior. It is also positioned as “AI-native,” with vector indexing and SQL AI functions plus Azure AI Foundry integration and VS Code-centric provisioning/query adaptation.

App and messaging services: Web PubSub wildcard roles and Service Bus request/reply scaling

Azure Web PubSub expanded auth in a way that matters for high-cardinality group patterns and matches last week's least-privilege and identity direction. Wildcard group roles let backends grant permissions like webpubsub.joinLeaveGroups.room/* and webpubsub.sendToGroups.room/* instead of issuing tokens with long lists of per-group roles (for example, repeating webpubsub.sendToGroup.room123). This reduces token size and simplifies issuance logic for bots and monitoring systems that need broad access across dynamic group namespaces. Guidance is practical: keep literal roles for strict end-user isolation, and use wildcard roles when broad access is intentional and audited. A Service Bus architecture write-up addressed a scaling trap in sync-over-async gateways: using Service Bus Sessions for request/reply correlation can create instance affinity because one gateway instance holds the session lock and must receive the reply. An alternative keeps gateways stateless by routing replies through a topic with SQL Filter subscriptions on a custom property like CorrelationId. Each request creates a dynamic subscription matching that value; the worker replies to a shared topic with the property; the broker delivers to the right gateway instance without session locks. The trade-off is managing dynamic subscriptions, but it is packaged as a Spring Boot starter for Java gateway teams. It also fits last week's broader Service Bus evolution: safer defaults and boundary controls, plus application patterns that avoid hidden scaling ceilings.

Other Azure News

Azure Virtual Desktop updates mixed new support with production lessons. App attach now supports Windows Server 2025 and Windows Server 2022 session hosts, extending dynamic app delivery (MSIX/AppX/App-V) to server-based pools and helping reduce golden-image sprawl as App-V Server components approach end of support (April 2026). A real-world AVD deployment in the Perth Azure Extended Zone showed the engineering behind private-only, GPU-backed personal desktops (NVadsA10 v5): IaC-driven Azure Image Builder, Compute Gallery replication where builds stay in the parent region, and a one-time REST API step to associate a user-assigned managed identity to the gallery for Extended Zone replication when portal support lags. That managed identity detail matches last week's push toward clearer scoping and auditing for identities. It also included cost control via IMDS + Azure Automation webhook (“Stop My VDI”) so users can deallocate without portal access, paired with “Start VM on Connect” RBAC. Operationally, there was a reminder that Azure Run Command lets you run commands across VM Scale Set instances without RDP/SSH, aligning with last week's “standardize the baseline, reduce snowflake access” theme. Constraints still apply (VM Agent ready, outbound 443 for results, 4096-byte output limit, one run at a time per VM, 90-minute max). VMSS mode matters: Uniform supports az vmss run-command invoke by instance ID, while Flexible typically requires iterating VMs and calling az vm run-command invoke. Hybrid SQL governance automation appeared via an Azure Policy (DeployIfNotExists) pattern enforcing Arc-enabled SQL Server extension LicenseType (“Paid”, “PAYG”, “LicenseOnly”), plus scripts for assignments and remediation across management groups/subscriptions. This aligns with last week's Arc least-privilege onboarding and the broader move from tickets and tribal knowledge to repeatable policy. PAYG has a caveat: policy sets ConsentToRecurringPAYG, and once set it cannot be removed even if you switch away, so consent is effectively one-way. Azure Developer CLI got a small but useful update consistent with last week's azd reproducibility theme: azd update (azd 1.23.x) updates regardless of install method (winget/Chocolatey/Homebrew/script), and supports switching stable vs daily via --channel. The broader update feed and cost content emphasized operational planning: mentions included StandardV2 NAT Gateway for AKS outbound, Azure Monitor OpenTelemetry for AKS, Bastion MI graphical session recording, ASR NVMe controller support, storage security/tiering changes, and retirements (Azure Batch HBv2/HC/NP; Azure Managed Grafana Basic). Cost guidance reiterated that AI-heavy cost optimization needs continuous visibility, guardrails, rightsizing, and recurring reviews, consistent with last week's FinOps tone. Fabric Eventhouse added a preview Capacity Scheduler for hourly minimum capacity baselines across a 7-day grid while keeping autoscale, to align predictable ingestion/query windows with cost control.