← All work/№ 03·2025·CMS Desk

Multi-Model LLM Router

Cost Engineering · Production Routing

PRIVATE REPO · NDAAI Infrastructure

Production routing layer across 6+ providers — open-source (Llama, Mistral, DeepSeek) and commercial APIs (GPT, Claude, Gemini) — with intelligent task-complexity routing, automatic fallback, and per-call cost telemetry.

№ 0196% cost cut

№ 02£800/mo → £30/mo

№ 036+ providers

№ 040 outages user-visible

The Brief

Problem

Single-provider lock-in cost £800/mo and broke whenever the upstream had an outage. Every workflow used a premium model regardless of complexity.

The Architecture

Decision

Built a router that classifies task complexity, routes simple tasks to self-hosted OSS (Llama 3 / Mistral on Hetzner), medium tasks to Gemini 2.5 Flash on Vertex, and reserves Claude Opus / GPT-5 for high-stakes reasoning. OpenRouter and Hermes for spillover. Every call instrumented for cost.

The Outcome

Result

API spend collapsed from £800/mo to £30/mo (96% reduction). Latency improved on 70% of workflows. Provider outages now invisible to users.

The Workflow

animated

How it actually works in production.

Request

3 steps

Incoming prompt

from any chain

Complexity classifier

simple / medium / hard

Cost budget gate

per workflow

Provider Pool — 6+ models, real logos

6 steps

£0

Llama 3

Hetzner OSS

£0

Mistral

Hetzner OSS

DeepSeek

OpenRouter

Gemini Flash

GCP Vertex

Claude Sonnet

AWS Bedrock

GPT-5 / Opus

reserved · hard tasks

Decide & Execute

4 steps

Route to cheapest valid

simple→OSS · hard→premium

Fallback chain

auto-retry on outage

Cost meter

£ logged per call

Response

returned to caller

Net Result

3 steps

£800/mo

before routing

£30/mo

after routing

96% cut

0 user-visible outages

Animated · Built in code · No GIFs

Live in production

Visual proof

3 images · 2 videos

See Multi-Model LLM Router in action.

Demo video

Live request → routed → cheapest valid model wins

~ 0:30 · coming soon

Demo video

Cost dashboard scrub — £800/mo → £30/mo over 6 weeks

~ 0:45 · coming soon

Image slot

Grafana cost-per-provider dashboard

coming soon

Image slot

Routing decision tree (simple → OSS, hard → premium)

coming soon

Image slot

P50/P95/P99 latency per model

coming soon

Stack

Built with

PythonOpenRouterAWS BedrockAzure OpenAIGCP VertexHermesLlama 3MistralDeepSeek

← Previous

AutoQuote · CMS-AQI

Discuss a similar build

Blueprint Analysis Engine