Smaller, faster, smarter: Distilling models with fine‑tuning | DEM322
William Liang demonstrates how teams use Azure AI Foundry to distill large models into smaller, task-focused language models using supervised fine-tuning, with an emphasis on reducing production latency and cost while maintaining accuracy through structured evaluation.
Overview
Large language models can be powerful but expensive to run in production. This Build 2026 demo focuses on practical techniques for creating smaller models that are cheaper and faster at inference time while still performing well on specific tasks.
Key ideas covered
Why distillation now
- The session frames a shift in AI goals from raw speed to cost-effective scalability.
- Agentic workloads can consume large token volumes, which increases inference cost and can add latency.
Distillation + fine-tuning workflow
- The demo discusses using distillation alongside supervised fine-tuning (SFT) to train a “student” model for task-specific accuracy.
- It covers cleaning and fine-tuning traces used to train the student model.
- It positions distillation as complementary to fine-tuning and reinforcement learning (RL), with guidance on when distillation makes sense.
Evaluation approach
- The session describes setting up evaluation across 100 tasks.
- It uses a training/holdout split to measure generalization and avoid overfitting.
- It highlights that fine-tuning can improve student model performance and reasoning.
Azure AI Foundry demonstration
- The session transitions into a platform walkthrough showing how Foundry supports distillation and supervised fine-tuning.
- It emphasizes production considerations: reducing latency and cost while keeping task accuracy acceptable.
Example scenario discussed
Refund and cancellation behavior comparison
- A case study compares behavior between a base model and a fine-tuned model in a refund-related scenario.
- The scenario includes cancelling an unshipped order and verifying refund policy logic, illustrating how task-specific tuning can improve policy adherence.
Takeaways
- Distillation can be a practical path to cost-efficient operations when production workloads are dominated by token-heavy interactions.
- Pairing distillation with SFT and a disciplined evaluation setup (task suite + holdout split) helps teams validate that smaller models remain reliable for the target use case.
Resources
- Foundry Discord: https://aka.ms/build/foundrydiscord