Shipping custom models at scale from fine-tuning to inference | BRK234

Name: Shipping custom models at scale from fine-tuning to inference | BRK234
Uploaded: 2026-06-04T12:30:51+00:00
Description: Rob Ferguson leads a Microsoft Build 2026 panel on shipping custom AI models at scale, covering practical trade-offs in fine-tuning and serving, plus what...

Jun 4, 2026 by Rob Ferguson

Rob Ferguson leads a Microsoft Build 2026 panel with Daniel Han (Unsloth), Mark Saroufim (Stealth Startup), and other practitioners on how teams customize models, take them to production, and run them efficiently at scale.

Overview

The panel focuses on real-world considerations for moving from fine-tuning to production inference, including technique trade-offs, infrastructure choices, and performance/cost optimization.

Fine-tuning and production trade-offs

Discussion of fine-tuning approaches and the practical trade-offs teams face when customizing models for real products.
Considerations for taking fine-tuned models into production environments.

Reinforcement learning (RL): challenges and limits

Overview of DeepSeek R1 and reinforcement learning infrastructure challenges.
Historical context for reinforcement learning applied to optimization tasks (for example, video compression).
Key RL challenge: defining correct reward functions for diverse systems.
Debate on scaling reinforcement learning environments toward AGI and where the approach may hit limits.

Efficiency techniques: LoRA and collaboration

Example discussion around efficiency and collaboration using LoRA (including a reference to “Thinking Machines”).

Inference performance: GPU and kernel-level optimization

Deep dive into GPU optimization topics:
- Fusion
- CUDA graphs
- Kernel writing
Reflection on AI discovering optimization techniques such as gradient checkpointing.

Model math considerations for long contexts

Mathematical considerations discussed around:
- Softmax limitations
- Compaction in long contexts