Stop routing docstrings to 70B models with on-device AI on Snapdragon | BRKSP90

Name: Stop routing docstrings to 70B models with on-device AI on Snapdragon | BRKSP90
Uploaded: 2026-06-04T10:51:30+00:00
Description: Alberto Martinez explains how to reduce cost and latency in AI coding assistants by routing simple tasks (like docstrings) to smaller on-device models on...

Jun 4, 2026 by Alberto Martinez

Alberto Martinez presents a Microsoft Build 2026 breakout on designing an inference-routing strategy for AI coding assistants so that simple coding tasks don’t automatically hit expensive, high-latency 70B+ cloud models.

Overview

The session focuses on a three-tier inference routing architecture that prioritizes keeping code local and reducing cloud spend:

On-device tier (≤13B models) for low-complexity tasks (example given: generating docstrings)
On-prem tier (14B–34B models) for medium-complexity tasks
Cloud tier (70B+ models) for the hardest tasks

The stated goals and outcomes include:

Cutting cloud tokens by 67%
Reducing latency by 70%
Keeping most code and requests local by default

Architecture: three-tier inference routing

The core idea is to route requests based on task complexity rather than sending everything to the largest model.

Key elements called out in the session description:

A routing logic layer that decides which tier/model to use
A classifier that determines task complexity and selects the appropriate tier
A fallback mechanism (referred to in the chapter list as a “fifth ‘fallback’ classifier mechanism”)

Quantization and model trade-offs

The talk explicitly includes quantization trade-offs, framing quantization as part of making smaller models viable for on-device inference (especially when targeting an NPU).

Measurement and optimization loop

A named optimization framework is highlighted:

Measure token cost
Measure latency
Iterate

This positions routing as an ongoing tuning exercise rather than a one-time configuration.

Session structure (from chapters)

Quantitative breakdown of token usage in coding tasks
Workload complexity distribution analysis (includes a reference to Claude Sonnet 4.6)
Cost-savings discussion (including a claim of up to $24K daily savings potential)
Tiered architecture introduction and model complexity framing
Optimization framework and Q&A on the fallback classifier mechanism