Browse Machine Learning Community (35)

Enabling the Compliance Security Profile (CSP) for HIPAA on Azure Databricks

Today by Rafia Aqil

Rafia Aqil outlines how to enable Azure Databricks’ Compliance Security Profile (CSP) for HIPAA workloads, including the September 1, 2026 deadline, required prerequisites like Azure VNet encryption and supported VM series, and a rollout approach to validate cluster startup and end-to-end connectivity before production.

Introducing Physical-World Intelligence: How GeoAI Is Helping Expand Enterprise AI

Yesterday by Nora Zhan

Nora Zhan introduces “Physical-World Intelligence” and explains how GeoAI combines geospatial data (weather, satellite imagery, sensors, and maps) with enterprise context and AI. The article outlines Microsoft Foundry’s geospatial model catalog, Planetary Computer Pro as a GeoAI data plane, and a GeoAI SDK for production-scale inference workflows.

Announcing the Open-Source Release of ML Video Codec (MLVC)

5 days ago by Naba Kumar

Naba Kumar and the MLVC team announce the open-source release of MLVC, a learned video codec that replaces traditional codec primitives with end-to-end neural compression. The post shares bitrate savings versus H.264/H.265, real-time performance targets on commodity NPUs, and what’s included in the GitHub release (models, weights, training scripts, and conversion tooling).

Meet the IQ's: How Microsoft is Creating Context-Aware AI

1 weeks ago by Rafia Aqil

Rafia Aqil explains Microsoft’s IQ Platform (Work IQ, Fabric IQ, and Foundry IQ) and how it adds business and organizational context to AI systems. The post breaks down Fabric’s OneLake-based data layer, ontology-driven meaning, and Foundry IQ’s managed services for RAG, memory, ranking, and citations.

From inception to Blueprint: Introducing the Oracle AI Database@Azure AI adoption playbook

Jun 25, 2026 by RajyaLaxmiYellajosyula

RajyaLaxmiYellajosyula announces the Oracle AI Database@Azure AI adoption playbook and outlines the main blueprint patterns for building AI experiences on Oracle data using Microsoft services, with a strong emphasis on security, governance, and regulated-industry requirements.

Join our free livestream series on using Microsoft IQ with Python

Jun 25, 2026 by Pamela Fox

Pamela Fox announces a free 3-part livestream series that teaches developers how to use Microsoft IQ (Foundry IQ, Work IQ, and Fabric IQ) from Python to ground AI agents in organizational knowledge, workplace context, and structured data, with runnable code shared in an open-source repo.

Inside Llama 3.1 405B MLPerf Training on Azure: System-Level Insights at 8K+ GPU Scale

Jun 24, 2026 by Shantanu Patankar and Azin Heidarshenas

Shantanu Patankar and Azin Heidarshenas break down Azure’s MLPerf Training v6.0 run for Llama 3.1 405B, sharing what they learned scaling pretraining to 8,192 NVIDIA GB200 GPUs on Fairwater—where the time goes per step, why topology-aware parallelism mapping matters, and what actually limits scaling efficiency at extreme scale.

Building an Azure architecture that’s ready for every signature

Jun 23, 2026 by Phil Vetter, Lee Jones

Phil Vetter and Lee Jones describe how Exclaimer evolved its global email-signature platform on Microsoft Azure, moving from VM-heavy deployments to AKS-based microservices and adopting purpose-fit data stores (Azure SQL, PostgreSQL, Cosmos DB, Data Explorer, Databricks) to improve scaling, reliability, and cost.

Azure Sets a New Performance Record for LLM Training Benchmark at Extreme Scale

Jun 16, 2026 by azinh17

azinh17 breaks down how Azure achieved a top MLPerf Training v6.0 result for Llama 3.1 405B, training at extreme scale across 8,192 GPUs. The post focuses on the cluster and network architecture choices—NVLink scale-up domains, Azure’s MRC fabric, and topology-aware parallelism mapping—that kept step time stable as the system scaled.

Azure Databricks at Databricks Data + AI Summit 2026: updates and new announcements

Jun 16, 2026 by Anavi Nahar

Anavi Nahar rounds up Azure Databricks announcements and sessions from Databricks Data + AI Summit 2026, focusing on tighter interoperability with Microsoft’s data stack (OneLake, ADLS) and governed access via Unity Catalog, plus new integrations like the Excel add-in, SharePoint ingestion, and OneLake catalog federation.

The Case for an Ontology Layer in Telecoms

Jun 16, 2026 by Alberto_Manuel

Alberto_Manuel explains why telecom operators need an ontology (semantic) layer to keep data meaning intact for GenAI and analytics, and outlines how Microsoft Fabric IQ (preview) uses ontology items, graph relationships, and data agents to enable cross-domain reasoning, governance, and scalable AI-driven data access.

From Enterprise File Storage to an AI-Ready Data Foundation using Azure NetApp Files and OneLake

Jun 15, 2026 by GeertVanTeylingen

GeertVanTeylingen outlines a zero-copy pattern for making enterprise file data usable by modern AI and analytics platforms, using Azure NetApp Files as the system of record and Microsoft OneLake shortcuts to expose that data without migration or duplication.

Troubleshooting ML Model Loading, GPU Issues, and Memory Pressure in Azure Container Apps

Jun 12, 2026 by BhaktiRath95

BhaktiRath95 walks through common failure modes when running AI/ML inference workloads on Azure Container Apps, including slow model startup, probe timeouts, OOM kills, and GPU initialization problems. The post provides concrete probe settings, Python/FastAPI patterns, and Log Analytics queries to diagnose and fix issues methodically.

What to Do When You Hit Capacity in Azure Databricks: Engage, Mitigate, Plan!

Jun 11, 2026 by Rafia Aqil

Rafia Aqil explains how to diagnose and respond when Azure Databricks clusters can’t start or scale due to Azure regional VM capacity constraints, including what to send to Microsoft support, which VM families to switch to, and longer-term design choices like instance pools, serverless compute, and multi-region deployments.

Streaming and Batch Data Architectures with Microsoft Fabric to Azure Databricks

Jun 9, 2026 by Rafia Aqil

Rafia_Aqil outlines a reference architecture for ingesting both streaming and batch data through Microsoft Fabric into Azure Databricks, using OneLake/ADLS and a medallion (Bronze/Silver/Gold) layout. The post breaks down five Fabric-to-Databricks integration paths and calls out security, governance, and monitoring considerations.

Designing Reliable Data Platforms: Centralized Failure Logging Framework with Azure Monitor

Jun 8, 2026 by Sally Dabbah

Sally Dabbah explains how to turn Synapse/ADF/Microsoft Fabric pipeline failures into structured, queryable telemetry by sending standardized failure events into Azure Monitor Log Analytics via the Logs Ingestion API and a Data Collection Rule, enabling KQL-based analysis, alerting, and reliability reporting across environments and datasets.

Build Enterprise-Ready Agents with Microsoft IQ and Oracle AI Database@Azure — now with Oracle MCP

Jun 3, 2026 by Ram Kakani

Ram Kakani explains how Oracle Managed Database MCP (Model Context Protocol) remote servers can be used from Microsoft Foundry to build enterprise AI agents that query Oracle AI Database@Azure, including local VS Code workflows, self-hosted Azure deployments, and a fully managed OCI option with identity, networking, and governance controls.

Anyscale on Azure: Powering Enterprise AI at Massive Scale on Azure Kubernetes Service

Jun 2, 2026 by bobmital

bobmital introduces Anyscale on Azure, an Azure Native way to run the Ray distributed runtime on AKS so teams can unify data prep, training, tuning, and serving in one system. The post focuses on architecture (split control/data plane), GPU utilization and scheduling features, and Azure-native identity, networking, and governance.

What's new in Azure Kubernetes Service at Microsoft Build 2026

Jun 2, 2026 by coryskimming

coryskimming summarizes the Azure Kubernetes Service (AKS) announcements from Microsoft Build 2026, focusing on running AI training and inference at scale. It covers new options for cluster operations, bare-metal performance, fleet management across Arc-enabled clusters, and Kubernetes-native model serving with tools like KAITO and AI Runway.

Announcing Anyscale on Azure public preview: Powered by Ray on AKS

Jun 2, 2026 by Brendan Burns

Brendan Burns announces the public preview of Anyscale on Azure, a managed Ray platform that runs on Azure Kubernetes Service (AKS). The post focuses on scaling distributed AI training and inference across regions, simplifying operations via Azure-native provisioning and billing, and using Microsoft Entra workload identity for governance.

Microsoft Planetary Computer Pro is Generally Available

Jun 2, 2026 by Yves-Pitsch

Yves-Pitsch announces the general availability of Microsoft Planetary Computer Pro, an Azure-native enterprise geospatial data platform designed to operationalize geospatial analytics and GeoAI workflows, with deeper integration into Microsoft Fabric and Microsoft AI Foundry and new developer capabilities like MCP server support.

Building AI apps and agents with Azure Databricks, Copilot Studio, and GitHub Copilot

Jun 2, 2026 by Jason Pereira

Jason Pereira introduces two Azure Databricks public preview capabilities that connect Microsoft Copilot Studio and GitHub Copilot to Databricks: a workspace-wide Genie MCP endpoint for building workspace-aware agents, and Lakebase branching for debugging agent issues against real data without touching production.

Training 100B+ Models on a Single GPU: What MegaTrain Changes - and What It Means for Azure

May 29, 2026 by yuvmaz

yuvmaz breaks down the MegaTrain paper’s approach to training 100B+ parameter LLMs on a single GPU by treating GPU memory as a cache and streaming layers from host memory/NVMe. The post connects the technique to Azure NC-series VM choices, storage throughput, PCIe constraints, and cost/performance trade-offs.

AI Infrastructure Preflight at User space: Validating Multi Node, Multi GPU Slurm Clusters

May 22, 2026 by vinilv

vinilv explains how to run a fast, user-space “preflight” on Azure HPC GPU clusters to catch common distributed training failures early. The post introduces ai-cluster-validator and walks through validating Slurm topology, PyTorch DDP initialization, GPU affinity, and NCCL collectives, with actionable logs and telemetry for ops teams.

When RAG Hits the Wall: Designing Systems That Scale from 1,000 to 1 million Documents

May 22, 2026 by himachauhan

himachauhan explains why RAG systems that work well at 1,000 documents often degrade at hundreds of thousands or millions, and outlines practical architecture shifts—like semantic chunking, hierarchical indexing, hybrid retrieval, and precomputed embeddings—to keep retrieval quality, latency, and cost predictable at scale.

Building a Controllable Inference Platform on Kubernetes with AI Runway

May 18, 2026 by kinfey

kinfey explains how AI Runway turns LLM deployment into a Kubernetes-native platform capability, using a unified ModelDeployment CRD to run and operate inference reliably across multiple engines and providers. The post walks through local CPU setups and an AKS-based path to production, with practical guidance on cost, observability, and governance.

What's new in FinOps toolkit 14 – April 2026

May 13, 2026 by Michael Flanakin

Michael Flanakin summarizes FinOps toolkit 14, including a Copilot Studio agent template for querying FinOps hub data with KQL, a new recommendations pipeline that ingests Azure Advisor and Resource Graph results, a simplified hub deployment UI, and a preview dataset for commitment discount eligibility.

How to Secure Azure Databricks without Public Exposure using WAF + Private Endpoints

May 11, 2026 by FaizaanMerchant

FaizaanMerchant explains a Zero Trust network design for Azure Databricks that avoids public workspace exposure by fronting external access with Azure Application Gateway WAF and routing traffic to the workspace through Private Endpoints, while keeping internal access on private connectivity (VPN/ExpressRoute).

Stripe Events + Azure Event Grid: Now Generally Available

May 11, 2026 by robece

robece announces General Availability of Stripe as a partner event source for Azure Event Grid, and outlines how to route Stripe events into Azure services (Functions, Logic Apps, Event Hubs, Service Bus) and Microsoft Fabric Eventstream for real-time processing and analytics.

Secure Medallion Architecture Pattern on Azure Databricks (Part II)

May 8, 2026 by mscagliola

mscagliola shows how to use GitHub Copilot skills for spec-driven development, turning a Medallion Architecture blog post into a repeatable repo that generates Terraform for Azure platform setup and Databricks bundle files for workloads, while enforcing strict placeholder/TODO rules to avoid invented environment values.

Distributing model weights to your AI cluster: a faster pre-flight on AKS and Slurm

May 6, 2026 by pauledwards

pauledwards explains how to cut “model weight pre-flight” time on multi-node Azure GPU clusters by sharding downloads from Azure storage and broadcasting the remaining data over InfiniBand using MPI, with practical launch patterns for both Slurm and AKS.

Resilient by Design: Azure Databricks Disaster Recovery Strategy

May 6, 2026 by KonstantinaF

KonstantinaF outlines a practical, phased disaster recovery strategy for Azure Databricks, focused on cross-region resilience for lakehouse workloads. The post explains RTO/RPO trade-offs, compares active-active vs warm standby patterns, and details how to replicate Unity Catalog metadata and Delta data using IaC, CI/CD, and repeatable DR pipelines.

From Chaos to Clarity: Your Databricks Workspace on a Single Pane of Glass

May 6, 2026 by Amit Damle, RK Iyer

Amit Damle and RK Iyer describe a “Discovery” utility for Azure Databricks that inventories workspace assets into Unity Catalog-backed Delta tables and a Lakeview dashboard, helping platform teams quickly understand clusters, jobs, warehouses, pipelines, security settings, and DBU usage.

How Microsoft Discovery Is Empowering Scientists to Do More

May 3, 2026 by sameeraman

sameeraman explains how Microsoft Discovery can automate a scientific simulation workflow using a coordinated set of AI agents, reducing manual scripting and job monitoring while keeping scientific decision-making with researchers.

Running Diffusion Models at Scale on AKS

Apr 30, 2026 by PrabalDeb

PrabalDeb lays out a practical reference architecture for running diffusion model workloads on Azure Kubernetes Service (AKS), focusing on GPU/CPU lane separation, dispatch and autoscaling options (Kubernetes-native vs Service Bus + KEDA), secure ingress and identity, durable storage for outputs and model caches, and end-to-end observability for both apps and GPU hardware.

End of content