Stop routing docstrings to 70B models with on-device AI on Snapdragon | BRKSP90
Alberto Martinez explains how to reduce cost and latency by routing coding-assistant requests across on-device, on-prem, and cloud LLMs, using Snapdragon X2 Elite’s NPU for smaller local models and reserving 70B+ models for harder tasks.
Overview
The session argues that many coding-assistant tasks (for example, generating docstrings) do not need a large 70B+ cloud model call. Instead, it proposes a three-tier inference routing architecture that selects the cheapest/fastest tier that can handle the request:
- On-device inference (≤13B models) for low-complexity requests where keeping code local is a priority.
- On-prem inference (14B–34B models) for medium-complexity requests that still benefit from tighter control over data locality.
- Cloud inference (70B+ models) reserved for the hardest requests.
The description highlights measured outcomes from this approach:
- Cloud token reduction: cutting cloud tokens by 67%
- Latency reduction: reducing latency by 70%
- Locality: keeping most code local by default
Hardware and performance context
The talk frames the approach around Snapdragon X2 Elite hardware capabilities, emphasizing the impact of an 80 TOPS NPU for running smaller models locally.
Routing logic and decision signals
The session includes routing logic for deciding where a request should run, including signals such as:
- Entropy-based measures to estimate uncertainty/complexity
- Logic depth as a proxy for how much reasoning is required
Model optimization trade-offs
The session discusses quantization and precision trade-offs, focusing on how model size/precision choices affect:
- Quality vs speed
- Local feasibility on an NPU
- Overall cost efficiency when combined with routing
Deployable components
The architecture includes a deployable classifier used to categorize requests and drive the orchestrator’s routing decisions, along with guidance on integrating the classifier into an orchestration layer.
Session chapters
- 0:00 - Introduction of Alberto Martinez and his role at Qualcomm
- 00:02:05 - Exploring agentic AI, orchestration, and their future
- 00:05:10 - Provocative message: Stop routing docstrings to large language models
- 00:11:18 - Overview of X2 Elite Hardware Capabilities
- 00:12:10 - Discussion on Quantization and Precision Trade-offs
- 00:22:00 - Demonstrating 4x cost savings and efficiency
- 00:28:01 - Using entropy and logic depth to determine routing and compute distribution
- 00:31:17 - Summary of evidence showing 73% efficiency benchmark and capacity discussion
- 00:35:00 - Explanation of building custom classifiers and orchestrator integration
Event context
This breakout session is part of Microsoft Build 2026. More sessions are available at https://build.microsoft.com