Stop routing docstrings to 70B models with on-device AI on Snapdragon | BRKSP90

Alberto Martinez explains how to reduce cost and latency by routing coding-assistant requests across on-device, on-prem, and cloud LLMs, using Snapdragon X2 Elite’s NPU for smaller local models and reserving 70B+ models for harder tasks.

Overview

The session argues that many coding-assistant tasks (for example, generating docstrings) do not need a large 70B+ cloud model call. Instead, it proposes a three-tier inference routing architecture that selects the cheapest/fastest tier that can handle the request:

The description highlights measured outcomes from this approach:

Hardware and performance context

The talk frames the approach around Snapdragon X2 Elite hardware capabilities, emphasizing the impact of an 80 TOPS NPU for running smaller models locally.

Routing logic and decision signals

The session includes routing logic for deciding where a request should run, including signals such as:

Model optimization trade-offs

The session discusses quantization and precision trade-offs, focusing on how model size/precision choices affect:

Deployable components

The architecture includes a deployable classifier used to categorize requests and drive the orchestrator’s routing decisions, along with guidance on integrating the classifier into an orchestration layer.

Session chapters

Event context

This breakout session is part of Microsoft Build 2026. More sessions are available at https://build.microsoft.com