Weekly Machine Learning Roundup: Ops guardrails and context control
This week in machine learning and analytics tooling was mostly about making day-to-day platform operations less fragile: Fabric pushed several previews that help teams scale Spark automation, find assets across workspaces, and centralize monitoring and cost controls, while Databricks guidance focused on disaster recovery and visibility across sprawling workspaces. Building on last week's Fabric-heavy focus on “operational plumbing” (MLOps boundaries with MLflow, real-time ingestion paths, and secure-by-default architecture choices), the throughline here is similar: once the platform grows beyond a single workspace or a single team, automation, discoverability, and guardrails matter as much as the model code. Alongside the platform work, model-behavior guidance reinforced a practical theme: better outcomes come from better context, not just bigger prompts.
Microsoft Fabric: Scaling Spark automation, discoverability, and operational guardrails
Fabric's Spark automation story moved forward with preview support for High Concurrency sessions in the Fabric Livy API, aimed at teams running many parallel jobs without paying the overhead of constantly spinning up new sessions. Instead of serializing work through a single session, you can run multiple isolated Spark workloads in parallel while still reusing sessions, tagging them with sessionTag so clients can reliably reconnect to the right session and track what is running where. In practice, this changes how you build orchestration: rather than treating a Livy session as a single-threaded bottleneck, you can design job runners that multiplex workloads, isolate failure domains, and get cleaner monitoring and cost attribution because sessions are no longer a black box shared by unrelated jobs. If last week's theme was promoting ML work safely across environments (Dev/Test/Prod boundaries with MLflow), this is the adjacent scaling problem: once you have repeatable pipelines, you need them to run concurrently without turning into a session-management and cost-debugging mess.
Cross-workspace sprawl is another common pain point in Fabric, and a new preview OneLake Catalog Search REST API targets exactly that. It lets you discover items across workspaces programmatically, which matters once you have dozens of domains, duplicated datasets, and a growing list of notebooks, warehouses, and semantic models. This ties directly back to last week's cross-workspace MLOps story: separating workspaces is only helpful if teams can still find the right datasets, artifacts, and owners across those boundaries without manual inventories. Microsoft also wired this search into the Fabric Core MCP Server (so agent-driven workflows can query the catalog as a tool) and added a new fab find command in the Fabric CLI. If you are building internal tooling or CI checks, the details here are practical: results can be filtered and shaped using JMESPath, making it easier to build repeatable “what do we have and where is it” scripts without maintaining your own inventory tables.
Operationally, Fabric also added a preview feature in the monitoring hub that centralizes failure notification management for scheduled items. The new Schedule failures page consolidates the configuration and maintenance of email failure notifications across workspaces, reducing the common drift where some pipelines notify the right owners and others silently fail. This is not a new scheduler, but it is a step toward treating alerting as a platform setting instead of a per-item afterthought. That lands neatly after last week's architecture guidance on designing maintainable lakehouse pipelines (idempotency, retries, observability): you can make the pipeline logic robust, but you still need consistent operational ownership when something breaks at 2am.
On the cost and capacity side, Fabric moved a set of tools to general availability: updates to the capacity metrics app now include a Capacity health page plus timepoint summary and timepoint detail views, which are designed to make throttling and capacity pressure easier to spot and explain at a specific point in time. The Fabric Chargeback app is also now generally available, giving teams a supported path to allocate capacity costs across workspaces and workloads, which pairs naturally with the push toward more parallel Spark usage and better monitoring. In other words, as the platform encourages more automation and concurrency, it is also giving admins and platform teams better levers to answer the inevitable follow-up questions: “what caused the spike?” and “who pays for it?”
Finally, Fabric Data Warehouse got a schema-evolution quality-of-life improvement in preview: T-SQL ALTER TABLE ... ALTER COLUMN support for metadata-only schema changes. For many teams, schema evolution is where pipelines become brittle because type changes trigger rebuilds or force complicated migration playbooks. The pitch here is fewer disruptive rebuilds and fewer downstream breaks when the change can be handled as a metadata update, aligning with Delta Lake patterns like type widening. That also echoes last week's medallion framework guidance on schema evolution and rerun safety: you can design layers well, but you still need the warehouse and lakehouse tooling to make everyday changes survivable for downstream transformations and ML feature tables.
- High Concurrency Support for the Fabric Livy API— Scalable Spark Automation (Preview)
- Discover items across workspaces with the OneLake Catalog Search API, MCP and CLI tools (Preview)
- Manage failure notifications from the monitoring hub in Fabric (Preview)
- Providing more insights & tools: Capacity health, timepoint summary, timepoint detail, chargeback now generally available
- Simplify Schema Changes in Fabric Data Warehouse with ALTER COLUMN (Preview)
Microsoft Fabric Dataflow Gen2 and Direct Lake: Less rework in data prep, fewer surprises in semantic performance
Dataflow Gen2 picked up a preview feature that targets a very specific kind of waste: repeated Power Query work copied across pipelines. “My queries” lets you save Power Query (M) queries into a personal library, then import them into other dataflows when you need the same cleaning logic again. That changes the default from copy-paste reuse (which quietly forks logic) to an explicit reuse path, which should help teams standardize transformations like date handling, normalization, and data quality fixes without maintaining separate “template” PBIX files or shared snippets in wikis. It fits with last week's Fabric data engineering push (nested folder-aware lake transformations and dbt orchestration): the common goal is to make the “last mile” of dataset shaping more maintainable, whether that logic lives in lake transformations, dbt models, or Power Query steps. On the consumption side, guidance for Direct Lake on SQL with Fabric Data Warehouse focused on what drives real performance when semantic models page Delta Parquet data into memory. The practical takeaway is that schema design and cardinality directly affect whether Direct Lake behaves like you expect, and the article calls out how and why models fall back to DirectQuery. That matters because fallback often shows up as “it was fast yesterday, why is it slow today” once a column's cardinality grows or a model change increases memory pressure. The best practices here center on designing for VertiPaq-friendly shapes, watching the common causes of fallback, and using those signals to decide whether to remodel, reduce cardinality, or accept DirectQuery for specific queries. Read next to last week's medallion decision guide and real-time ingestion pattern (SQL change events into Fabric), this is a useful reminder that upstream choices (schema evolution, incremental feeds, partitioning) quickly turn into downstream semantic performance issues if teams do not manage cardinality and model shape intentionally.
- From repetition to reuse: accelerate data prep with My queries in Dataflow Gen2
- Direct Lake on SQL with Fabric Data Warehouse
Azure Databricks: Disaster recovery planning and a workspace inventory you can query
Two Databricks posts landed on the operational end of machine learning platforms: keep the platform recoverable, and make it observable. A disaster recovery strategy write-up laid out a phased, customer-managed approach that forces the usual hard decisions into the open: what RTO/RPO targets are realistic, when active-active is worth the complexity, and when warm standby is the better trade. It also gets concrete about what must be replicated across regions, including Unity Catalog metadata and Delta data, and it frames DR as something you automate with infrastructure as code (IaC) and repeatable pipelines rather than a runbook you hope to never use. The inclusion of patterns like Delta Sharing and Deep Clone highlights a key detail: data and governance metadata need different replication tactics, and your DR plan needs to cover both. That parallels last week's Fabric guidance on planning secure streaming paths and maintainable lakehouse layers early: if you treat networking, governance metadata, and data layout as afterthoughts, they become the hardest things to retrofit when you need higher assurance (whether for private connectivity or for cross-region recovery). The “single pane of glass” post tackled a different, but related, problem: once a Databricks workspace grows, it gets hard to answer basic questions about what exists, who owns it, what it costs, and how it is used. The proposed Discovery utility scans workspace assets (clusters, jobs, warehouses, Delta Live Tables, Unity Catalog objects, security configuration, billing, and utilization), writes the results into Unity Catalog Delta tables, and surfaces them through a Lakeview dashboard. That approach matters because it turns operational visibility into queryable data: you can audit configurations, track utilization patterns, and build internal controls without relying on screenshots or one-off admin scripts. In spirit, it is solving a similar problem to Fabric's new cross-workspace catalog search and centralized monitoring: at scale, governance and operations start with “can we reliably inventory what we run?”
- Resilient by Design: Azure Databricks Disaster Recovery Strategy
- From Chaos to Clarity: Your Databricks Workspace on a Single Pane of Glass
Model behavior control: Context engineering, RAG, and when to fine-tune
A model-behavior primer this week pulled together the toolbox teams are actually using in production to make models respond the way they need. It starts with prompt-level controls (zero-shot, one-shot, few-shot examples, and system prompts) and then moves into context engineering, where you shape what the model sees and how it sees it, often by structuring inputs rather than just adding more text. Retrieval-augmented generation (RAG) and embeddings show up here as the practical bridge between static model behavior and dynamic, domain-specific answers, especially when you need responses grounded in internal documents. When prompts and RAG are not enough, the guidance shifts to model adaptation: fine-tuning and LoRA (low-rank adaptation). The useful framing is that fine-tuning is not the first tool you reach for, but it can be the right one when you need consistent style, structured outputs, or domain behaviors that are hard to achieve through context alone. LoRA is called out as a lighter-weight alternative for adaptation, which is often relevant when you want targeted behavior changes without the cost and operational overhead of full fine-tunes. Read alongside last week's Fabric MLOps thread (MLflow tracking across workspaces and secure, repeatable promotion), the “behavior control” takeaway is that whichever lever you choose (RAG, fine-tune, LoRA), you still need the operational backbone to version inputs, track experiments, and promote changes safely, otherwise improvements in model behavior are hard to reproduce and govern.