Browse Machine Learning Community (41)

BhaktiRath95 walks through common failure modes when running AI/ML inference workloads on Azure Container Apps, including slow model startup, probe timeouts, OOM kills, and GPU initialization problems. The post provides concrete probe settings, Python/FastAPI patterns, and Log Analytics queries to diagnose and fix issues methodically.
Rafia Aqil explains how to diagnose and respond when Azure Databricks clusters can’t start or scale due to Azure regional VM capacity constraints, including what to send to Microsoft support, which VM families to switch to, and longer-term design choices like instance pools, serverless compute, and multi-region deployments.
Rafia_Aqil outlines a reference architecture for ingesting both streaming and batch data through Microsoft Fabric into Azure Databricks, using OneLake/ADLS and a medallion (Bronze/Silver/Gold) layout. The post breaks down five Fabric-to-Databricks integration paths and calls out security, governance, and monitoring considerations.
Sally Dabbah explains how to turn Synapse/ADF/Microsoft Fabric pipeline failures into structured, queryable telemetry by sending standardized failure events into Azure Monitor Log Analytics via the Logs Ingestion API and a Data Collection Rule, enabling KQL-based analysis, alerting, and reliability reporting across environments and datasets.
Ram Kakani explains how Oracle Managed Database MCP (Model Context Protocol) remote servers can be used from Microsoft Foundry to build enterprise AI agents that query Oracle AI Database@Azure, including local VS Code workflows, self-hosted Azure deployments, and a fully managed OCI option with identity, networking, and governance controls.
bobmital introduces Anyscale on Azure, an Azure Native way to run the Ray distributed runtime on AKS so teams can unify data prep, training, tuning, and serving in one system. The post focuses on architecture (split control/data plane), GPU utilization and scheduling features, and Azure-native identity, networking, and governance.
Yves-Pitsch announces the general availability of Microsoft Planetary Computer Pro, an Azure-native enterprise geospatial data platform designed to operationalize geospatial analytics and GeoAI workflows, with deeper integration into Microsoft Fabric and Microsoft AI Foundry and new developer capabilities like MCP server support.
coryskimming summarizes the Azure Kubernetes Service (AKS) announcements from Microsoft Build 2026, focusing on running AI training and inference at scale. It covers new options for cluster operations, bare-metal performance, fleet management across Arc-enabled clusters, and Kubernetes-native model serving with tools like KAITO and AI Runway.
Brendan Burns announces the public preview of Anyscale on Azure, a managed Ray platform that runs on Azure Kubernetes Service (AKS). The post focuses on scaling distributed AI training and inference across regions, simplifying operations via Azure-native provisioning and billing, and using Microsoft Entra workload identity for governance.
Jason Pereira introduces two Azure Databricks public preview capabilities that connect Microsoft Copilot Studio and GitHub Copilot to Databricks: a workspace-wide Genie MCP endpoint for building workspace-aware agents, and Lakebase branching for debugging agent issues against real data without touching production.
yuvmaz breaks down the MegaTrain paper’s approach to training 100B+ parameter LLMs on a single GPU by treating GPU memory as a cache and streaming layers from host memory/NVMe. The post connects the technique to Azure NC-series VM choices, storage throughput, PCIe constraints, and cost/performance trade-offs.
vinilv explains how to run a fast, user-space “preflight” on Azure HPC GPU clusters to catch common distributed training failures early. The post introduces ai-cluster-validator and walks through validating Slurm topology, PyTorch DDP initialization, GPU affinity, and NCCL collectives, with actionable logs and telemetry for ops teams.
himachauhan explains why RAG systems that work well at 1,000 documents often degrade at hundreds of thousands or millions, and outlines practical architecture shifts—like semantic chunking, hierarchical indexing, hybrid retrieval, and precomputed embeddings—to keep retrieval quality, latency, and cost predictable at scale.
kinfey explains how AI Runway turns LLM deployment into a Kubernetes-native platform capability, using a unified ModelDeployment CRD to run and operate inference reliably across multiple engines and providers. The post walks through local CPU setups and an AKS-based path to production, with practical guidance on cost, observability, and governance.

What's new in FinOps toolkit 14 – April 2026

Michael Flanakin summarizes FinOps toolkit 14, including a Copilot Studio agent template for querying FinOps hub data with KQL, a new recommendations pipeline that ingests Azure Advisor and Resource Graph results, a simplified hub deployment UI, and a preview dataset for commitment discount eligibility.
FaizaanMerchant explains a Zero Trust network design for Azure Databricks that avoids public workspace exposure by fronting external access with Azure Application Gateway WAF and routing traffic to the workspace through Private Endpoints, while keeping internal access on private connectivity (VPN/ExpressRoute).
robece announces General Availability of Stripe as a partner event source for Azure Event Grid, and outlines how to route Stripe events into Azure services (Functions, Logic Apps, Event Hubs, Service Bus) and Microsoft Fabric Eventstream for real-time processing and analytics.
mscagliola shows how to use GitHub Copilot skills for spec-driven development, turning a Medallion Architecture blog post into a repeatable repo that generates Terraform for Azure platform setup and Databricks bundle files for workloads, while enforcing strict placeholder/TODO rules to avoid invented environment values.
pauledwards explains how to cut “model weight pre-flight” time on multi-node Azure GPU clusters by sharding downloads from Azure storage and broadcasting the remaining data over InfiniBand using MPI, with practical launch patterns for both Slurm and AKS.
KonstantinaF outlines a practical, phased disaster recovery strategy for Azure Databricks, focused on cross-region resilience for lakehouse workloads. The post explains RTO/RPO trade-offs, compares active-active vs warm standby patterns, and details how to replicate Unity Catalog metadata and Delta data using IaC, CI/CD, and repeatable DR pipelines.
Amit Damle and RK Iyer describe a “Discovery” utility for Azure Databricks that inventories workspace assets into Unity Catalog-backed Delta tables and a Lakeview dashboard, helping platform teams quickly understand clusters, jobs, warehouses, pipelines, security settings, and DBU usage.
sameeraman explains how Microsoft Discovery can automate a scientific simulation workflow using a coordinated set of AI agents, reducing manual scripting and job monitoring while keeping scientific decision-making with researchers.
PrabalDeb lays out a practical reference architecture for running diffusion model workloads on Azure Kubernetes Service (AKS), focusing on GPU/CPU lane separation, dispatch and autoscaling options (Kubernetes-native vs Service Bus + KEDA), secure ingress and identity, durable storage for outputs and model caches, and end-to-end observability for both apps and GPU hardware.
Parvathy_R_Pillai compares traditional ML pipelines with Azure AI Foundry, focusing on the shift from model-centric delivery to operating end-to-end AI applications (including agents) with built-in governance, evaluation, and observability for production use.
kmalkov shares a real-world fintech lending ML decisioning workload evaluated using Microsoft’s Analog Optical Computer (AOC) digital twin on Azure, focusing on production-scale volumes, weighted ensemble models, and end-to-end explainability and auditability for credit, affordability, and risk decisions.
PeterTHLee shares a validated Azure reference architecture for drone-based industrial inspections that combines deterministic computer vision with Azure OpenAI reasoning. The post breaks down an event-driven pipeline (Blob Storage → Functions → Vision/AML → OpenAI → Foundry evaluation → Cosmos DB → Power BI) and calls out security controls needed for production use.
Subhajit1994 breaks down the real design choices behind a Bronze/Silver/Gold medallion framework, focusing on where responsibilities should live (staging, cleaning, modeling, marts), and how to make decisions around load patterns, orchestration, retries, observability, schema evolution, and replayability.
ankitasarkar explains why a pure RAG approach can produce inconsistent or logically wrong matches in enterprise document mapping, and how adding a knowledge-graph layer to constrain retrieval improves consistency, relevance, and explainability.
GalimahB shares a Microsoft Build //local host kit overview, listing breakout sessions and hands-on labs you can run in your city—covering GitHub Copilot agentic workflows, Microsoft Foundry (agents, models, evals), and Azure topics like Container Apps, AKS, databases, and Cobalt VMs.
Moaz_Mirza outlines a reference architecture for “agentic” data governance across hybrid/multi-cloud estates using Azure Arc, Microsoft Purview, and Microsoft Fabric, with a Copilot-style agent (via Power Platform/Teams) that reports on compliance and can enforce selected controls through Azure Functions and policy-driven actions.
NaufalPrawironegoro walks through setting up Microsoft Fabric Operations Agent end-to-end: capacity and Eventhouse prerequisites, enabling the preview in the Admin Portal, wiring a KQL database as a knowledge source, and triggering Power Automate actions via Teams when conditions (like failed pipeline runs) are detected.
In this community deep dive, junjieli walks through the GA release of Microsoft Foundry Toolkit for Visual Studio Code—covering model experimentation, agent development (no-code and code-first), evaluations, deployment to Microsoft Foundry Agent Service, and workflows for converting, profiling, and fine-tuning local models on Windows.
Gapandey lays out a practical, end-to-end MLOps template on Azure: train a scikit-learn model from data in Azure Blob Storage, package it as a self-contained pickle bundle, register it in an Azure ML Registry with auto-versioning, and deploy it to an Azure ML Managed Online Endpoint via an Azure DevOps multi-stage pipeline.
AnjaliSadhukhan argues that AI agents fail on enterprise questions mainly due to fragmented data and missing semantics, and outlines how Microsoft Fabric (OneLake, semantic models, Data Agents) and Azure AI Foundry can work together to provide governed, agent-ready access to business data.
ShivaniThadiyan explains how Azure SQL Managed Instance is evolving from a SQL Server-compatible PaaS into an AI-enabled platform, covering built-in operational intelligence, vector search, in-database Python/R machine learning, and Copilot-assisted diagnostics with security and governance considerations.
Vaibhav Pandey shares a production-oriented “Bring Your Own Model” (BYOM) pattern for Azure AI applications, showing how to package, register, and deploy a custom model on Azure Machine Learning with secure identity, networking, and scalable managed endpoints.
In this post, robece explains how to route Stripe events into Azure Event Grid to build scalable, real-time payment workflows, and how to extend those streams into Microsoft Fabric Real-Time Intelligence for live analytics.
ashish-chhabria argues that Azure Event Hubs is the practical default for Kafka-style streaming on Azure, focusing on its Kafka-compatible endpoint, managed scaling, tier capabilities (Standard/Premium/Dedicated), and integrations like Capture to Azure Data Lake Storage and streaming into Microsoft Fabric for real-time analytics.
Connected-Seth shares March 2026 updates for Azure Event Grid MQTT Broker, covering protocol support (MQTT v3.1.1/v5, HTTP publish), security options (Entra ID/OAuth JWT, X.509, webhook auth, TLS 1.2+), scaling characteristics, and native routing into Azure services like Fabric Eventstreams, Azure Data Explorer, Event Hubs, Functions, and Logic Apps.
AnaviNahar walks through a near-real-time ingestion and transformation setup on Azure Databricks using Lakeflow (Connect, Spark Declarative Pipelines, and Jobs), covering CDC from SQL Server, streaming telemetry ingestion, Bronze/Silver/Gold modeling, Unity Catalog governance, and monitoring via system tables.

Rejoining the server...

Rejoin failed... trying again in seconds.

Failed to rejoin.
Please retry or reload the page.

The session has been paused by the server.

Failed to resume the session.
Please reload the page.