Weekly AI Roundup: Offline Agents, Orchestration, and Controls

This week’s AI updates were less about new model behavior and more about making agent systems workable: running locally, standardizing orchestration across languages, and tightening operational controls (tools, governance, cost) so systems hold up in production. It continues last week’s “run it like software” direction (repeatable workflows, inspectable grounding, and day-two controls), with more emphasis on building blocks you can ship: offline templates, stable multi-agent runtimes, and governable tool-integration patterns.

Offline, On-Device Agents with Foundry Local (RAG vs CAG)

Two local-first assistant blueprints built around Microsoft Foundry Local and foundry-local-sdk show how to run entirely on one machine with no API keys and no network after the initial model download. This builds on last week’s Foundry Local thread (OpenAI-compatible local endpoints, stable client code while endpoints swap, lightweight grounding) by showing two concrete app shapes that fit internal tools or offline/field use. Both samples keep the app intentionally small: Node.js 20+ with Express, a single-page UI, and Server-Sent Events (SSE) used twice. First, SSE streams model download/load status until “Offline Ready.” Then it streams tokens into chat. That “operate the loop” approach (status streaming, predictable startup, explicit offline readiness) lines up with last week’s idea that local runtimes should be operable systems, not just demos. They differ mainly in how grounding works. The CAG version is startup-loaded and straightforward: preload Markdown docs from docs/, score documents with keyword scoring (no embeddings/chunking/vector DB), and inject the top docs into the prompt. The trade-offs are explicit: it is limited by the context window, best for “tens of documents,” and KB updates require a restart. It also includes practical model selection: filter the local catalog by capability (chat-completion) and a RAM budget policy (for example, “60% of system RAM”), then pick models like phi-4 on 32 GB or phi-3.5-mini on 8 GB, download if needed, load, and run completions in-process. This keeps last week’s “predictable endpoint swaps” idea but adds guidance for “what runs on this machine.” The RAG version adds more components for scale and hot updates, which echoes last week’s point that grounding should be reusable and testable. It chunks Markdown into ~200-token segments with overlap, stores chunks and TF-IDF vectors in a single-file SQLite DB (better-sqlite3), and retrieves using TF-IDF + cosine similarity, explicitly avoiding embeddings to stay offline and lightweight. Retrieval is optimized with an inverted index, prepared statements, and caching, and the author reports sub-millisecond retrieval for the target workload. The prompt contract is also strict: safety-first behavior, bans on guessing for procedures/tolerances, a required “This information is not available in the local knowledge base” response when grounding is insufficient, and a structured output format (summary, safety warnings, steps, references) with UI-visible citations and relevance scores. It also supports runtime doc upload (.md/.txt) with immediate chunk/vector/index updates without restart, which is where the extra RAG complexity pays off. Both posts include setup details (winget install Microsoft.FoundryLocal, model sizes like ~2 GB for Phi-3.5 Mini, npm test via Node’s built-in test runner) and close with extension paths such as hybrid retrieval (TF-IDF + embeddings), persisted memory, multimodal input, and PWA packaging for offline install, which matches last week’s “start simple, stay inspectable” direction.

Agent Orchestration Goes Production-Ready (Microsoft Agent Framework 1.0 + Copilot Studio Multi-Agent GA)

Microsoft moved multi-agent development toward more stable foundations in two places: Agent Framework 1.0 for developers and Copilot Studio multi-agent GA for makers/developers. This follows last week’s platform-choice framing (Copilot Studio vs Azure AI Agents vs Foundry) by translating “production-ready” into stable APIs across languages, reviewable configs, and evaluation/moderation hooks that fit CI/CD. Microsoft Agent Framework 1.0 is out for .NET and Python with stable APIs and an LTS/backward-compatibility commitment, positioned as a convergence of Semantic Kernel foundations and AutoGen orchestration patterns. The core value is standardization: build single- or multi-agent systems with the same abstractions in both runtimes, and swap providers via connectors (Foundry, Azure OpenAI, OpenAI, Anthropic, Bedrock, Gemini, Ollama). That matches last week’s theme of keeping app/orchestration contracts stable while endpoints evolve (similar to Foundry Local’s endpoint swap story). It includes core building blocks teams need early: tools/functions, multi-turn session management, and streaming. For orchestration, it provides a graph workflow engine (branch/fan-out/converge), checkpointing/hydration for long-running flows, and patterns like sequential, concurrent, handoff, group chat, and Magentic-One, plus middleware hooks for policy, observability, and compliance logic. Memory is pluggable (history, persistent KV, vector retrieval) with backends like Foundry Agent Service memory, Mem0, Redis, Neo4j, and custom providers. It also introduces YAML-defined agents/workflows (instructions, tools, memory, topology) that can be version-controlled and promoted, which lines up with last week’s repo-first operating model. Copilot Studio’s multi-agent orchestration is rolling into GA over the next few weeks (targeting full availability for eligible customers by April 2026). It extends last week’s “hybrid approach” framing (Copilot Studio for controlled experiences, programmable layers behind it) into multi-agent coordination. The GA scope emphasizes connected experiences: Fabric integration (Copilot Studio agents coordinate with Fabric agents), orchestration with the Microsoft 365 Agents SDK (reuse retrieval/actions across Microsoft 365 and Copilot Studio), and Agent-to-Agent (A2A) communication via an open protocol for delegating to other agents. Prompt Builder is now GA and integrated into the Tools tab for iterating instructions/models/inputs/knowledge in one place, and prompt-level content moderation controls are GA (supported regions) for managed models, which can help where default filters block legitimate regulated terms. Evaluation automation APIs are GA via Power Platform APIs/connectors for CI/CD gating against regression scores, and connectors like ServiceNow and Azure DevOps are called out as improved to better support operational grounding. The shared direction is multi-agent work as engineering: stable runtimes (.NET/Python), checkpointed workflow graphs, versioned YAML orchestration, and platform features that make prompt iteration, moderation, and automated evaluation part of regular releases.

MCP as the Tooling Glue (VS Code, Azure Functions, and Governed Data/Metadata Access)

MCP kept showing up as the tool layer that makes agents reusable across products: custom tools in Foundry agents, local development in VS Code, and governed metadata access for data copilots. This continues last week’s MCP storyline (maturity, hosted endpoints, identity-aware access, deterministic tool surfaces), with more emphasis on hosting, auth, and integration with governed systems. In Azure AI Foundry, the practical pattern is to host an MCP server remotely on Azure Functions, then register that endpoint as a tool in Foundry so agents can discover/invoke it from the Agent Builder Playground. Azure Functions is positioned as the default host because it fits tool workloads (serverless scaling, consumption billing, multiple auth models). The post lays out identity choices teams need to decide early, following last week’s “tool calls need boundaries” theme: key-based auth (simple for dev), Entra ID + managed identity (recommended for production service-to-service calls using the Foundry project managed identity), OAuth identity passthrough (tool calls under each end user identity), and unauthenticated access (dev/public tools only). It also gives a concrete endpoint format for MCP extension-based Functions: https://<FUNCTION_APP_NAME>.azurewebsites.net/runtime/webhooks/mcp. The reuse point is explicit: MCP servers built for VS Code/Visual Studio/Cursor can be reused in Foundry without rebuilding integrations. For local development, a VS Code video walkthrough shows end-to-end MCP server development with Python and FastMCP, including client/server responsibilities, tool discovery and invocation, and STDIO transport (server as a local process over stdin/stdout). This reinforces the schema discipline point: MCP tool schemas enable cross-client discovery, and transport can be local for dev even if production moves to remote HTTPS for governance and networking. MCP also appears in the Fabric/Purview governance story as a way to expose metadata and governance-aware capabilities to AI agents without bypassing permissions. This aligns with last week’s Fabric direction (semantics, permission-aware context, MCP endpoints on the roadmap): instead of copying catalog/lineage/classification into prompts, you expose controlled tools that enforce Fabric/Purview rules. It is paired with API governance updates (OneLake Catalog Search API GA and Bulk Import/Export of Item Definitions preview) so teams can automate metadata operations instead of relying on UI-heavy workflows.

Azure SRE Agent: Provider Choice, Prerequisites, and a Shift to Token-Based Billing

Azure SRE Agent added operational details that shape safe, sustainable on-call usage: prerequisites, integrations, model provider choice, and billing. This builds on last week’s “external system → managed identity bridge → SRE Agent trigger” patterns and cost guardrails by adding rollout constraints, network realities, and cost units that map directly to usage. One post focuses on preview onboarding prerequisites and infrastructure scenarios, framing the agent as an AI reliability operator that observes Azure telemetry (Azure Monitor, Log Analytics, Application Insights) and Azure service APIs, then helps with incident investigation, correlation, RCA, and optional controlled remediation. Teams can run in recommendation/review mode or enable autonomous execution for pre-approved steps with guardrails, approvals, and specialized subagents (VMs/databases/networking). The actionable content is the checklist: the preview control plane must be created in Sweden Central, Australia East, or US East 2 (monitored workloads can be elsewhere), subscriptions may need allow-listing, and identity/RBAC is the core dependency, often elevated for onboarding and then tightened to least privilege for the managed identity (read for investigation, scoped write for approved remediation). It also calls out integration edges: outbound HTTPS to Azure management endpoints and any third-party systems/MCP servers (custom MCP endpoints must be remote HTTPS, not local endpoints), no guaranteed static egress IPs for firewall allow lists, and allowing domains like *.azuresre.ai. Integrations include ServiceNow/PagerDuty, GitHub/Azure DevOps, Grafana, and Azure Data Explorer (Kusto). The “remote-only tool endpoints + identity boundaries” constraint matches the MCP hosting patterns showing up elsewhere. Two updates moved the product toward more flexible operations and clearer cost planning. First, SRE Agent now supports multiple model providers, adding Anthropic with Claude Opus 4.6 as the baseline when selected, which fits this week’s provider-abstraction theme (Agent Framework). Second, active flow billing shifts from time-based to token-based metering effective April 15, 2026. The unit remains Azure Agent Units (AAUs): always-on flow stays 4 AAUs per agent-hour; active flow becomes “AAUs per million tokens” with rates varying by provider. This ties cost to investigation depth (conversation length, correlated telemetry breadth) and makes provider choice part of cost planning. Monthly AAU allocation limits (Settings → Agent consumption) remain the key guardrail: when you hit the active flow limit, chat/autonomous actions pause until next month, while always-on continues, which matches last week’s cost-control approach.

Other AI News

Foundry’s model catalog keeps expanding beyond chat into modality-specific building blocks. This follows last week’s “voice as an operational modality” thread by adding first-party primitives in Foundry so teams can build voice and image features without immediately using third-party hosting. Microsoft announced MAI models in Azure AI Foundry: MAI-Transcribe-1 (speech-to-text, 25 languages), MAI-Voice-1 (text-to-speech), and MAI-Image-2 (image generation). The goal is first-party options in Foundry’s catalog, with details like parameters/pricing/regions expected in the linked build surfaces rather than the announcement.