Weekly Azure Roundup: Zonal VM moves, agent governance, IaC safety

This week's Azure updates center on making production changes less disruptive, from in-place VM moves into Availability Zones and Availability Set migrations to VM Scale Sets (Flexible), to new Intel Xeon 6-based VM families and upcoming reservation retirements that impact cost planning. On the AI side, the focus shifts from models to operations, with the Azure Resource Manager MCP Server, multi-region agent landing zone guidance, and clearer paths from local prototypes to governed, observable deployments in Azure AI Foundry. Infrastructure and security themes tie it together with safer Terraform state migrations, earlier validation for Azure Functions deployments, more transparent HSMs, and better code-to-cloud risk context via Defender for Cloud and GitHub Advanced Security. Data and platform operations round out the week with Cosmos DB RU lessons, Databricks inventory and DR patterns, Logic Apps Standard migration tooling, and practical improvements for ACR, AKS resiliency testing, and near zero-downtime PaaS cutovers.

This Week's Overview

Reliability and cost shifts in Azure Compute

In-place moves to Availability Zones (and beyond) for existing VMs

Building on last week's focus on controlled transitions and production-ready resiliency patterns (including Extended Zones and private-first deployments), Azure introduced a public preview to migrate regional (non-zonal) Virtual Machines and VMSS Flexible instances into Availability Zones without the usual rebuild-and-recreate pain. The key promise is preservation of resource identifiers: names, resource IDs, disks, NICs, and IPs can remain intact, which reduces downstream breakage for monitoring, automation, and allowlists tied to those resources.

Operationally, the preview follows a controlled flow (deallocate, update to zonal, then start) and comes with limitations you will want to validate in a non-production environment first. If you have been blocked from zone adoption because your VM identity is referenced across scripts, IaC, or partner tooling, this feature is worth prototyping as a stepping stone to higher availability without a full migration project.

Availability Sets to VM Scale Sets (Flexible) migration preview

A second compute migration preview targets Availability Sets, letting you move VMs into Virtual Machine Scale Sets (Flexible) using portal-driven or API-driven options (CLI/PowerShell/REST). This is a practical modernization path if you want VMSS management benefits (more consistent operations and scaling primitives) without jumping straight to uniform orchestration or replatforming.

Because the migration can be performed in a controlled, per-VM manner, teams can phase the change and reduce blast radius. Expect to spend time validating how your existing extensions, load balancing, and update processes map onto VMSS Flexible semantics before committing.

New Dlsv7/Dsv7/Esv7 VM families on Intel Xeon 6 (GA)

Microsoft also announced general availability for the Dlsv7, Dsv7, and Esv7 VM series powered by Intel Xeon 6 processors. The release highlights performance improvements, larger scale options, and updated networking and storage capabilities via Azure Boost, with specific mention of Premium SSD v2 and Ultra Disk support for higher-performance storage configurations.

For teams refreshing capacity, this is a good moment to revisit sizing assumptions and benchmark core workloads (web tiers, app servers, and memory-heavy services) on the new families. If you standardize images and scale sets, a targeted canary pool using these SKUs can help you quantify cost/performance before broad rollout.

Reserved VM Instances retirements: plan your savings transition now

Azure published a transition guide for customers affected by Reserved VM Instances that will stop being available for new purchase or renewal for select VM series starting July 1, 2026. The guidance focuses on identifying which reservations are impacted and deciding whether to renew before the cutoff, modernize to different SKUs, or move to an Azure savings plan for compute.

This is not just a finance change. If your capacity planning and release processes assume reservation-backed pricing, you will want to model the impact of savings plans vs. SKU changes, then feed those decisions back into deployment templates and scaling rules.

AI agents on Azure: tooling, governance, and paths to production

This week’s Azure AI story was less about new models and more about what it takes to run agents safely: giving them the right tools, constraining access, deploying reliably, and preventing “agent sprawl” across regions and teams.

Azure Resource Manager MCP Server (public preview)

Picking up on last week's AI Foundry/agents momentum (where client libraries stabilized and reference architectures showed end-to-end production assumptions), Azure announced a public preview of the Azure Resource Manager MCP Server, a remote Model Context Protocol (MCP) server that lets AI agents perform Azure Resource Manager operations using tools rather than ad-hoc text generation. It also supports natural-language-to-Azure Resource Graph queries and ARM template deployments from VS Code, positioning MCP as a practical bridge between Copilot-style chat and real, permissioned cloud actions.

For developers, the immediate takeaway is that “agent can operate on Azure” is becoming a first-class workflow: query resources, reason about what exists, then deploy changes through a tool interface. The governance angle matters too, because once an agent can act, you will want to pair it with Azure Policy, RBAC, and repeatable deployment patterns to keep those actions predictable.

Multi-region landing zone reference architecture for governing agents

A new reference architecture tackles the operational problem that shows up right after the first few successful pilots: too many agents, too many endpoints, and inconsistent controls. The design layers an Azure API Management AI Gateway with Azure AI Foundry Control Plane and Microsoft Agent 365, aiming to centralize policy, safety controls, evaluation, and oversight across regions.

If you are building agents for multiple business units, this is a useful blueprint for standardizing provisioning via pipelines and preventing every team from inventing its own gateway, logging, and identity patterns. The architecture also signals a shift toward treating agents as managed runtime workloads that need the same rigor as APIs (routing, authZ, telemetry, and change control).

Least-privilege agent template with azd, Entra ID, and OAuth token exchange

Following last week's repeated push to remove long-lived credentials (managed identities, federated credentials, and private-first patterns), Curity and Microsoft published an Azure Developer CLI (azd) template that deploys an AI agent app designed around least-privilege authorization. The pattern uses short-lived OAuth 2.0 JWTs and token exchange, with Microsoft Entra ID plus Curity Identity Server, and includes API gateways with audit logging and a layered Bicep deployment targeting Azure Container Apps.

The practical value is the opinionated security posture: instead of long-lived secrets or broad roles, you get a starting point that forces you to think in claims, scopes, and short token lifetimes. If you are building agents that call internal APIs, this template helps establish an “auth first” baseline that security teams can review without reverse-engineering a one-off implementation.

From local to production with Foundry Hosted Agents

Microsoft documented a production path for Microsoft Agent Framework agents using Foundry Hosted Agents inside Azure AI Foundry. The guide covers packaging (using azd and Azure Container Registry), choosing protocols (including guidance around /responses vs invocations), integrating identity with Entra ID, managing versioned rollouts, and using Application Insights for observability.

The main developer implication is that agent deployment is converging on familiar cloud patterns: containers, controlled rollouts, identity, and telemetry. If your agent prototype is still a notebook or a local service, this walkthrough provides a concrete checklist for what “production-ready” means in Foundry terms.

Durable workflows and orchestration in the Microsoft Agent Framework

On the framework side, Microsoft described how to add durability to Agent Framework workflows using the Durable Task runtime. The tutorial shows fan-out/fan-in patterns for parallel agents and hosting the workflow on Azure Functions, with an option to expose tools through MCP.

If you are orchestrating multi-step agent work (collect signals, call tools, wait on external approvals, then continue), durability is what keeps you from rebuilding state management yourself. The Azure Functions hosting option is also a practical way to integrate existing operational controls like deployment slots, scaling, and centralized logging.

Building blocks for Agent Framework v1.0 in .NET

A separate guide positions Microsoft Agent Framework (v1.0) as part of the .NET AI building blocks story, focusing on tool calling, sessions, memory through context providers, and graph-based workflows for multi-agent orchestration. The emphasis on AgentSession and pluggable context providers points toward repeatable patterns for memory and retrieval that do not require every team to invent its own abstraction layer.

For .NET shops, this is a signal to start thinking about shared agent primitives the same way you think about shared web primitives: standard dependency injection, consistent telemetry, and reusable components for memory and tools. That becomes especially relevant as more of the Azure platform starts exposing MCP-compatible tools.

OneLake catalog access inside Azure AI Foundry (GA)

Building on last week's “AI in the field” architecture direction (where governed data stores and repeatable pipelines mattered as much as model calls), Azure AI Foundry now has native access to the OneLake catalog (GA), enabling in-context discovery of governed Fabric/OneLake data. The announcement also connects this to creating knowledge bases using Azure AI Search, which is a common pattern for retrieval-augmented generation (RAG) when you need access control and traceable sources.

For teams struggling with “where does the agent get its data,” this improves the developer experience around data discovery and governance alignment. Instead of wiring up ad-hoc connectors, you can start with a governed catalog and then decide what gets indexed into a knowledge base.

Infrastructure as Code: safer Terraform changes and AI-assisted platform engineering

This week’s IaC content centered on avoiding destructive changes (especially disks), shifting failure detection earlier in CI, and using Copilot-driven workflows without giving up repeatability.

Terraform stable keys for Azure managed disks (and deterministic state moves)

A practical Terraform guide highlighted a common failure mode: using index-based keys with for_each can cause Azure managed disks to churn when list order changes, leading to destructive replacement. The proposed fix is a stable-key migration strategy using terraform state mv, plus a reusable GitHub Copilot skill to generate deterministic move commands rather than hand-writing dozens (or hundreds) of state operations.

The key lesson is that the “design” of for_each keys is a long-term contract, not a convenience detail. If you have an Azure Landing Zone or shared module library, baking stable key practices into review guidelines can prevent costly churn later.

Validation-driven Terraform for predictable Azure Functions deployments

Another case study showed how a team reduced Azure Functions deployment failures by moving checks left, from terraform apply into pull requests and plan-time validation. The approach combines PR checks, Azure pre-flight checks, and Terraform-native validations (including input validation and preconditions) so common issues fail fast before reaching production pipelines.

For teams operating serverless at scale, this is a reminder that many “Azure failures” are actually contract mismatches (naming, configuration shape, missing permissions) that can be encoded as validations. Codifying those checks can make your deployments boring again, which is usually the goal.

Drift detection and self-healing patterns for Azure AI infrastructure

Continuing last week's theme of day-two readiness through explicit, testable platform patterns (telemetry pipelines, identity defaults, and private networking), two posts approached the same operational pain from different angles: infrastructure drift across repos and environments. One described a multi-repo Terraform platform that reconciles deployment metadata daily and “self-heals” by ensuring the next terraform plan/apply corrects unintended changes, using OIDC and Key Vault plus private endpoints for a tighter security posture. Another proposed an infrastructure validation agent that compares an Excel “source of truth” with Terraform config and deployed Azure resources to detect drift and mismatches, optionally layering AI-based validation and Azure AI Search.

Both patterns highlight a real constraint in regulated environments: you often need a human-auditable source of truth and deterministic remediation steps, not just a chatbot. If you are building internal platforms, consider pairing automated drift detection with strict approval gates and clear ownership so remediation does not turn into surprise changes.

Copilot-assisted Landing Zone engineering

A separate Landing Zone piece described a prompt-driven delivery workflow where GitHub Copilot generates initial drafts for Terraform modules, networking, OIDC setup, GitHub Actions workflows, and policy assignments. The engineering effort shifts toward reviewing, standardizing, and enforcing conventions rather than writing every file from scratch.

The risk to watch is consistency: generated drafts can diverge unless you have strong module boundaries, linting, and policy-as-code checks. If you try this approach, treat Copilot output like code from a junior contributor: useful acceleration, but only if your repo has guardrails.

Security and trust: HSM transparency, code-to-cloud visibility, and secure-by-design guidance

Azure Integrated HSM is being open-sourced

Microsoft announced it is open-sourcing Azure Integrated HSM through the Open Compute Project (OCP), including firmware and supporting software plus independent validation artifacts. The post frames server-integrated HSMs as complementary to Azure Key Vault and Managed HSM, offering hardware-enforced, verifiable key protection at scale, and cites alignment with standards like FIPS 140-3 Level 3.

For security teams, open artifacts matter because they enable deeper review and independent validation beyond vendor claims. For builders, the implication is that Azure’s key management story increasingly includes attestation and hardware verifiability as first-class concerns, especially for high-assurance environments.

Managed HSM + Bicep walkthroughs for customer-managed keys

A hands-on guide walked through deploying Azure Managed HSM with Bicep, creating an RSA-HSM key, assigning RBAC roles, and configuring an Azure resource (example: Storage) to use customer-managed keys (CMK). It also calls out where deployments commonly fail in practice: permissions and key rotation behavior.

If you are rolling CMK across services, this is a good reminder to test the full lifecycle (creation, assignment, rotation, and failure recovery), not just initial provisioning. The RBAC design is often the make-or-break detail, so plan to codify role assignments and validate them in CI.

Defender for Cloud + GitHub Advanced Security (GA): code-to-cloud correlation

GitHub announced general availability of Defender for Cloud integration with GitHub Advanced Security, enabling code-to-cloud correlation for deployed container artifacts. The update surfaces runtime risk context directly inside GitHub security views, with runtime-aware filters and improved campaign targeting across code scanning and Dependabot.

The practical impact is tighter feedback loops: security teams can see whether a vulnerable component is actually deployed and exposed, and developers can prioritize fixes with runtime context rather than CVE lists alone. If you already publish artifact attestations or use the Deployment Record API, this is a good time to verify how your deployment metadata flows into the new views.

Secure-by-design guidance for Azure IaaS and memory-safe hardware

Echoing last week's “make opportunistic attacks harder by design” posture (secure defaults, smaller blast radius, and fewer reusable secrets), Azure published a defense-in-depth overview for Azure IaaS that ties together hardware trust, VM protections (Trusted Launch and confidential computing), secure networking and encryption defaults, plus monitoring through Azure Monitor and Defender for Cloud. In parallel, Microsoft highlighted CHERIoT-Ibex, an open-source CHERIoT/CHERI-enabled RISC-V Ibex core (CHERI Alliance certified) that targets memory safety issues using hardware-enforced protections and fine-grained compartmentalization for embedded systems.

Together, these posts point to a spectrum of security work: hardening today’s VM fleets while also investing in architectures that reduce entire vulnerability classes long term. If you build regulated workloads, it is worth mapping which defenses you can enable immediately (Trusted Launch, confidential SKUs, encryption defaults) versus which require platform choices (hardware capabilities, memory-safety approaches).

Data, analytics, and integration: Cosmos DB lessons, Databricks operations, and Logic Apps modernization

Azure’s data and integration updates this week leaned into production operations: cost and partitioning realities for Cosmos DB, governance and resilience for Databricks, and practical migration tooling for Logic Apps Standard.

Cosmos DB Conf 2026: production lessons on RU economics and AI-ready patterns

Following last week's Cosmos DB cost/performance tuning thread, two Cosmos DB Conf recaps focused on the parts teams usually learn the hard way: partition key choice, query shape, and indexing decisions that directly drive RU cost, throttling, and latency. The devblogs recap also points to newer capabilities such as hierarchical partition keys, partition-level auto-failover, and five-nines availability, plus patterns for event-driven systems and using Cosmos DB as an “agent memory” store alongside vector search.

If you are building AI features on Cosmos DB, the recurring message is that “vector search exists” does not remove the need to model RU consumption and query behavior carefully. Treat RU/s as a first-class SLO input, benchmark hybrid search patterns, and validate indexing policies early.

Azure Databricks: visibility and disaster recovery practices

On the platform operations side, a “single pane of glass” Discovery utility for Azure Databricks inventories workspace assets (clusters, jobs, warehouses, Delta Live Tables, Unity Catalog, security, billing, utilization) into Unity Catalog Delta tables and visualizes them in a Lakeview dashboard. A separate disaster recovery guide lays out customer-managed DR patterns, including RTO/RPO trade-offs, active-active vs warm standby, and cross-region replication of Unity Catalog metadata and Delta data using IaC and repeatable DR pipelines.

If your Databricks footprint has grown organically, consolidating inventory into governed tables is a practical step toward enforcing standards and controlling spend. Pairing that with explicit DR pipelines helps you avoid “we have backups” complacency and moves you toward tested recovery procedures.

Logic Apps Standard: migration tooling and Oracle connector preview

Building on last week's messaging change-management focus (Service Bus SBMP retirement and BizTalk 2020 migration steps), integration teams got two notable updates for Logic Apps Standard. First, Microsoft announced an open-source Logic Apps Migration Agent for modernizing BizTalk Server and other integration platforms with an AI-assisted, stage-gated workflow, including VS Code and GitHub Copilot integration plus human review checkpoints. Second, Logic Apps Standard added a public preview built-in Oracle Database connector that runs in-process for single-tenant workflows and can work without a gateway when network connectivity is available, with documentation on supported actions and limitations.

For teams with legacy integrations, the Migration Agent is a concrete pathway to reduce manual rewrites while keeping control through gating. For Oracle-heavy estates, the built-in connector preview is worth a lab test, especially if you have avoided Logic Apps due to gateway requirements and want simpler deployment within a VNet-integrated footprint.

Developer operations and platform building on Azure: containers, resilience, and migration playbooks

ACR Artifact Cache now supports ACR-to-ACR upstream sources

Extending last week's push toward private networking and managed identity as the default building blocks, Azure Container Registry announced that ACR Artifact Cache can now use another ACR as an upstream source, enabling ACR-to-ACR pull-through caching. This helps with image promotion flows and registry hierarchy setups, where you want controlled replication and faster pulls across environments, and the post includes an Azure CLI walkthrough using a user-assigned managed identity plus RBAC roles (and notes supported networking/auth combinations, including Private Link scenarios).

If you run multi-environment clusters (dev/test/prod) and want to reduce external pulls while keeping promotion explicit, this feature fits well with a “central registry plus environment registries” pattern. It is also a useful knob for improving reliability when upstream registries or networks get flaky during deploy windows.

AKS productivity and resiliency testing patterns

After last week's AKS App Routing move toward Gateway API and Envoy-based boundaries, Microsoft showcased Containerization Assist integration inside the AKS VS Code extension, aiming to streamline containerizing apps, generating Kubernetes deployment assets, and setting up an automated pipeline, with an open-source contribution path. Separately, an AKS high availability testing guide outlined how to validate single-region resiliency using Availability Zones, including simulations of pod, node, zone, network, and dependency failures and the metrics you can watch with Azure Monitor.

Put together, these are the two halves of day-2 Kubernetes reality: getting workloads packaged and deployed consistently, then proving they survive failures under traffic. If you are adopting zones, make sure your test plan includes cross-zone dependencies (DNS, ingress, storage, external services) so you do not discover weak links during an incident.

Near zero-downtime migration cutovers for Azure PaaS

A detailed migration guide described a phased-parallel cutover strategy for Azure PaaS, focusing on traffic shifting via Azure Front Door, handling Azure Service Bus messaging risks, and validating HA/DR, observability, and rollback. The practical value is the emphasis on sequencing and verification, not just “deploy the new thing and switch DNS.”

If you operate customer-facing systems, this is a good checklist for rehearsing cutovers as engineering work rather than a one-night event. Pay special attention to data-plane and message-plane consistency, since those are common sources of partial failures during migrations.

Other Azure News

John Savill’s May 8, 2026 Azure update video covered a spread of platform changes across Azure Functions, AKS networking, storage, databases, and Azure AI Foundry, and it called out Document Intelligence v3 API retirement timing that teams should factor into backlog planning.

Microsoft also published a Europe-focused post about Azure region expansion and resilience expectations, connecting capacity growth to compliance, data residency, and sovereign cloud requirements. If you build for EU customers, it is a reminder to validate multi-region designs against the Azure Well-Architected Framework and the EU Data Boundary constraints you actually need to meet.