Microsoft Fabric Blog authors detail how Forecasting Service uses hybrid AI and ML models to optimize Spark cluster provisioning, reducing latency and cloud cost for analytics workloads.

How Microsoft Fabric’s Forecasting Service Makes Spark Notebooks Instant

Overview

Microsoft Fabric’s unified analytics platform now delivers near-instant Spark startup times and cloud cost efficiency thanks to its Forecasting Service: an AI and ML-backed system for proactive compute resource provisioning. This service underpins scalable data science and engineering workloads across Microsoft Fabric’s global footprint.

The Challenge

Spark cluster startup delays (sometimes several minutes) can hinder analytics workflows, slow business insights, and inflate cloud expenses through inefficient capacity management. Traditional static pools keep clusters idle (and expensive), while on-demand start can break SLAs with slow spin-up.

The Fabric Forecasting Service Solution

To address these challenges, Microsoft Fabric introduced the Forecasting Service:

Hybrid ML + Optimization Pipeline: Predicts workload patterns and dynamically tunes starter pool size using time-series forecasting and linear programming.
Starter Pool Rehydration: Keeps a fleet of ready-to-use Spark clusters/sessions so most user requests enjoy instant starts. Any usage is immediately replaced by provisioning another instance.
Adaptive Scaling: Continuously adjusts pool size based on demand telemetry to balance fast access and cost. Algorithms explicitly trade off cluster idle time against user wait times using a cost-aware linear program.

Technical Architecture

ML Predictor: Uses Azure Data Explorer telemetry to forecast short-term demand via Singular Spectrum Analysis (SSA) and a neural network (SSA+), outputting demand estimates.
SAA Optimizer: Determines optimal pool size in real time, minimizing total cost and latency.
Forecasting Worker: Runs inference pipelines, stores recommendations in Azure Cosmos DB.
Pool Worker: Executes orchestration, creating or deleting Spark sessions for pool equilibrium by interfacing with Fabric’s Big Data Infrastructure Platform.
Telemetry Dashboard: Offers real-time metrics on pool hits/misses, COGS, and startup latencies.

Key Innovations

Hybrid Time-Series Forecasting (SSA+): Accuracy and responsiveness to demand spikes.
Optimization Engine (SAA): Balances idle and wait costs, uses linear programming.
Adaptive Hyperparameter Tuning: Maintains SLA by auto-adjusting provisioning strategies.
End-to-End Automation: Seamless integration with Microsoft Fabric’s big data orchestration services.

Impact & Results

Since its rollout in Nov 2023 across all Microsoft Fabric regions, Forecasting Service has:

Significantly reduced idle compute resources compared to static pooling
Delivered consistent low-latency startup times (<10 seconds in most cases)
Lowered operational costs by minimizing waste while scaling to production data science workloads
Provided resilience during demand spikes (e.g., market opening/closing periods)

User Experience

Fast cluster startups: Default cases see notebook launch in seconds due to warm clusters.
Cold starts: Custom libraries, network isolation, or exhausted pools result in brief setup periods (2–5 mins).
Usage monitoring: Telemetry continuously feeds the predictive engine.

References and Further Reading

Authors

Kunal Parekh, Senior Product Manager, Azure Data
Yiwen Zhu, Principal Researcher, Azure Data, Microsoft Research
Subru Krishnan, Principal Architect, Azure Data
Aditya Lakra, Software Engineer, Azure Data
Harsha Nagulapalli, Principal Engineering Manager, Azure Data
Sumeet Khushalani, Principal Engineering Manager, Azure Data
Arijit Tarafdar, Principal Group Engineering Manager, Azure Data

This post appeared first on “Microsoft Fabric Blog”. Read the entire article here