Shipping custom models at scale from fine-tuning to inference | BRK234
Rob Ferguson leads a Microsoft Build 2026 panel with Daniel Han (Unsloth), Mark Saroufim (Stealth Startup), and other practitioners on how teams customize models, take them to production, and run them efficiently at scale.
Overview
The panel focuses on real-world considerations for moving from fine-tuning to production inference, including technique trade-offs, infrastructure choices, and performance/cost optimization.
Fine-tuning and production trade-offs
- Discussion of fine-tuning approaches and the practical trade-offs teams face when customizing models for real products.
- Considerations for taking fine-tuned models into production environments.
Reinforcement learning (RL): challenges and limits
- Overview of DeepSeek R1 and reinforcement learning infrastructure challenges.
- Historical context for reinforcement learning applied to optimization tasks (for example, video compression).
- Key RL challenge: defining correct reward functions for diverse systems.
- Debate on scaling reinforcement learning environments toward AGI and where the approach may hit limits.
Efficiency techniques: LoRA and collaboration
- Example discussion around efficiency and collaboration using LoRA (including a reference to “Thinking Machines”).
Inference performance: GPU and kernel-level optimization
- Deep dive into GPU optimization topics:
- Fusion
- CUDA graphs
- Kernel writing
- Reflection on AI discovering optimization techniques such as gradient checkpointing.
Model math considerations for long contexts
- Mathematical considerations discussed around:
- Softmax limitations
- Compaction in long contexts