Microsoft Fabric Blog authors detail how Forecasting Service uses hybrid AI and ML models to optimize Spark cluster provisioning, reducing latency and cloud cost for analytics workloads.

How Microsoft Fabric’s Forecasting Service Makes Spark Notebooks Instant

Overview

Microsoft Fabric’s unified analytics platform now delivers near-instant Spark startup times and cloud cost efficiency thanks to its Forecasting Service: an AI and ML-backed system for proactive compute resource provisioning. This service underpins scalable data science and engineering workloads across Microsoft Fabric’s global footprint.

The Challenge

Spark cluster startup delays (sometimes several minutes) can hinder analytics workflows, slow business insights, and inflate cloud expenses through inefficient capacity management. Traditional static pools keep clusters idle (and expensive), while on-demand start can break SLAs with slow spin-up.

The Fabric Forecasting Service Solution

To address these challenges, Microsoft Fabric introduced the Forecasting Service:

  • Hybrid ML + Optimization Pipeline: Predicts workload patterns and dynamically tunes starter pool size using time-series forecasting and linear programming.
  • Starter Pool Rehydration: Keeps a fleet of ready-to-use Spark clusters/sessions so most user requests enjoy instant starts. Any usage is immediately replaced by provisioning another instance.
  • Adaptive Scaling: Continuously adjusts pool size based on demand telemetry to balance fast access and cost. Algorithms explicitly trade off cluster idle time against user wait times using a cost-aware linear program.

Technical Architecture

  • ML Predictor: Uses Azure Data Explorer telemetry to forecast short-term demand via Singular Spectrum Analysis (SSA) and a neural network (SSA+), outputting demand estimates.
  • SAA Optimizer: Determines optimal pool size in real time, minimizing total cost and latency.
  • Forecasting Worker: Runs inference pipelines, stores recommendations in Azure Cosmos DB.
  • Pool Worker: Executes orchestration, creating or deleting Spark sessions for pool equilibrium by interfacing with Fabric’s Big Data Infrastructure Platform.
  • Telemetry Dashboard: Offers real-time metrics on pool hits/misses, COGS, and startup latencies.

Key Innovations

  • Hybrid Time-Series Forecasting (SSA+): Accuracy and responsiveness to demand spikes.
  • Optimization Engine (SAA): Balances idle and wait costs, uses linear programming.
  • Adaptive Hyperparameter Tuning: Maintains SLA by auto-adjusting provisioning strategies.
  • End-to-End Automation: Seamless integration with Microsoft Fabric’s big data orchestration services.

Impact & Results

Since its rollout in Nov 2023 across all Microsoft Fabric regions, Forecasting Service has:

  • Significantly reduced idle compute resources compared to static pooling
  • Delivered consistent low-latency startup times (<10 seconds in most cases)
  • Lowered operational costs by minimizing waste while scaling to production data science workloads
  • Provided resilience during demand spikes (e.g., market opening/closing periods)

User Experience

  • Fast cluster startups: Default cases see notebook launch in seconds due to warm clusters.
  • Cold starts: Custom libraries, network isolation, or exhausted pools result in brief setup periods (2–5 mins).
  • Usage monitoring: Telemetry continuously feeds the predictive engine.

References and Further Reading

Authors

  • Kunal Parekh, Senior Product Manager, Azure Data
  • Yiwen Zhu, Principal Researcher, Azure Data, Microsoft Research
  • Subru Krishnan, Principal Architect, Azure Data
  • Aditya Lakra, Software Engineer, Azure Data
  • Harsha Nagulapalli, Principal Engineering Manager, Azure Data
  • Sumeet Khushalani, Principal Engineering Manager, Azure Data
  • Arijit Tarafdar, Principal Group Engineering Manager, Azure Data

This post appeared first on “Microsoft Fabric Blog”. Read the entire article here