Weekly Machine Learning Roundup: Lakehouse Pipelines and Governance

This week's ML-adjacent data engineering updates were less about model releases and more about tightening pipelines and developer surfaces. Fabric moved Spark and notebook capabilities closer to production usage, and Azure Databricks shared a concrete pattern for consolidating near-real-time ingestion, transformation, and governance into a single Lakeflow workflow.

Microsoft Fabric: Lakehouse pipelines, runtimes, and notebook automation

Materialized Lake Views (MLVs) GA is Fabric's clearest move toward declarative lakehouse transforms without hand-rolled Spark ETL plus separate orchestration. The GA release focuses on refresh behavior and manageability: Fabric expands incremental refresh support across common patterns (aggregations with GROUP BY, left outer joins, left semi joins, CTEs) and decides per run whether incremental or full recompute is cheaper based on change volume and estimated cost. With Change Data Feed enabled by default for new MLVs, incremental processing becomes the default rather than another setting. Operationally, multi-schedule support at the lakehouse level lets you define named schedules for subsets of views (hourly gold vs six-hour lower-priority), with Fabric handling dependencies, parallelizing independent views, and centralizing errors; overlapping triggers are skipped if a refresh is already running. “Replace” allows updating an MLV definition in place without drop/recreate, preserving identity/metadata/lineage and avoiding broken dependencies. Data quality constraints get fuller reporting across refresh history, including richer expression-based constraints for PySpark-authored MLVs (multi-column expressions, arithmetic/functions, session-scoped Python UDFs), while PySpark authoring itself is preview; PySpark-authored MLVs still full-refresh for now. Fabric Runtime 2.0 entered preview as a new baseline for Spark engineering and science workloads, and it is the kind of upgrade that usually requires retesting. It brings Spark 4.0, Delta 4.0, Azure Linux 3.0 (Mariner 3.0), Java 21, Scala 2.13, and Python 3.12, with Spark 4.1 / Delta 4.1 / Python 3.13 planned soon. Because you can enable preview at workspace or environment level, teams can stage rollout: validate connector JARs with Scala 2.13, check Java 21 requirements, and confirm Python wheels/native dependencies before moving to production. Fabric also added guidance to make Spark work more reproducible. Environments best practices splits workflows into Quick mode (fast publish, installs at session start, good for iteration and testing overrides) and Full mode (3-6 minute publish for a validated snapshot, then 1-3 minute session startup), with a practical middle ground: Full mode plus a custom live pool for reproducibility with ~5s session startup. It also recommends using Resources folder/inline installs only for early-stage or one-off work, then promoting validated dependencies into Full mode for scheduled jobs and shared production runs. Notebook automation is also becoming a first-class integration surface with Fabric Notebook Public APIs GA. Teams can manage notebooks via REST (create/update/get/list/delete) and execute them through the Fabric Job Scheduler API with parameters, session config, and explicit execution context (environment/lakehouse). Two details matter for CI/CD-style orchestration: service principal auth for unattended automation, and the Run Notebook API returning an exit value (via notebook utilities) so external orchestrators can branch or gate on structured output, not just success/failure. Together, the story is coherent: define transforms declaratively (MLVs), adopt a new Spark/Delta baseline when ready (Runtime 2.0), control dependencies (Environments), and orchestrate notebooks via APIs.

Azure Databricks Lakeflow: near-real-time ingestion, SCD transformations, and governance in Delta Lake

Azure Databricks published a detailed walkthrough for collapsing “too many tools” into a Lakeflow-native pipeline, covering ingestion, transformation, orchestration, monitoring, lineage, and Unity Catalog access control. The architecture starts with two Bronze ingestion paths into Delta: application telemetry streamed into Delta via Lakeflow Connect “Zerobus Ingest” over gRPC, and SQL Server CDC ingested incrementally from an on-prem transaction log (assuming ExpressRoute). For telemetry, it includes concrete prerequisites (Unity Catalog + serverless), SQL for creating UC tables (for example, prod.bronze.telemetry_events), and service principal grants (GRANT USE CATALOG/SCHEMA plus GRANT MODIFY, SELECT). It shows deriving a Zerobus endpoint from workspace URL (<workspace-id>.zerobus.<region>.azuredatabricks.net) and Python using databricks-zerobus-ingest-sdk to stream client-credential auth, define JSON record types, ingest records, and close streams, with targets like sub-5s latency and up to ~100 MB/sec per connection; records become queryable immediately via Unity Catalog. For SQL Server CDC, the focus is correctness and incremental efficiency: TCP 1433 connectivity, enabling CDC with sys.sp_cdc_enable_db / sys.sp_cdc_enable_table, plus SQL permissions (CDC read) and Databricks privileges (CREATE CONNECTION at metastore, plus destination USE CATALOG / CREATE TABLE). Setup then uses the Databricks UI (Data Ingestion → Add Data): configure an ingestion gateway, connection details, select tables, optionally enable SCD Type 2 history per table, map outputs to Bronze tables (orders_raw, customers_raw), and schedule runs (example every 5 minutes). Transformations use Lakeflow Spark Declarative Pipelines and a medallion pattern with SQL-defined incremental processing: CREATE OR REFRESH STREAMING TABLE for Silver; APPLY CHANGES INTO keyed with SEQUENCE BY updated_at for SCD Type 1 “latest state” and SCD Type 2 customer history; telemetry data quality constraints with EXPECT ... and violating rows dropped. Gold uses CREATE OR REFRESH MATERIALIZED VIEW joining orders/customers/telemetry and aggregating metrics (including conditional sums like purchase event counts). Continuous mode keeps it near real time; Unity Catalog registers everything so lineage flows from Gold back to Bronze automatically. Governance details include granting analysts access only to Gold and applying PII masking via UDF (for example, mask_email) that reveals full data only for privileged groups, enforced with ALTER TABLE ... ALTER COLUMN ... SET MASK. Orchestration and monitoring use Lakeflow Jobs to chain dependencies (CDC ingestion then transforms) with scheduling and notifications. For day-2 operations, it shows querying system tables like system.lakeflow.job_run_timeline for runs, states, and durations. Consumption examples stay inside Databricks (AI/BI Dashboards and Genie NL->SQL) while relying on Unity Catalog permissions, keeping access control and lineage consistent for BI and ML feature preparation off the same Gold layer.

Other ML News

Fabric Eventhouse (Real-Time Intelligence) previewed a small but useful KQL workflow update: DB Explorer can browse stored functions, show definitions read-only, and run “Preview results” without manually writing the KQL call (including parameter formatting). Parameter prompts plus a 100-row preview cap make it a quick validation step when iterating on function libraries or reviewing inherited functions before using them in dashboards and reports.