Multi-Model LLM Router
Cost Engineering · Production Routing
Production routing layer across 6+ providers — open-source (Llama, Mistral, DeepSeek) and commercial APIs (GPT, Claude, Gemini) — with intelligent task-complexity routing, automatic fallback, and per-call cost telemetry.
The Brief
Problem
Single-provider lock-in cost £800/mo and broke whenever the upstream had an outage. Every workflow used a premium model regardless of complexity.
The Architecture
Decision
Built a router that classifies task complexity, routes simple tasks to self-hosted OSS (Llama 3 / Mistral on Hetzner), medium tasks to Gemini 2.5 Flash on Vertex, and reserves Claude Opus / GPT-5 for high-stakes reasoning. OpenRouter and Hermes for spillover. Every call instrumented for cost.
The Outcome
Result
API spend collapsed from £800/mo to £30/mo (96% reduction). Latency improved on 70% of workflows. Provider outages now invisible to users.
How it actually works in production.
Request
Incoming prompt
from any chain
Complexity classifier
simple / medium / hard
Cost budget gate
per workflow
Provider Pool — 6+ models, real logos
Llama 3
Hetzner OSS
Mistral
Hetzner OSS
DeepSeek
OpenRouter
Gemini Flash
GCP Vertex
Claude Sonnet
AWS Bedrock
GPT-5 / Opus
reserved · hard tasks
Decide & Execute
Route to cheapest valid
simple→OSS · hard→premium
Fallback chain
auto-retry on outage
Cost meter
£ logged per call
Response
returned to caller
Net Result
£800/mo
before routing
£30/mo
after routing
96% cut
0 user-visible outages
Animated · Built in code · No GIFs
See Multi-Model LLM Router in action.
Demo video
Live request → routed → cheapest valid model wins
~ 0:30 · coming soon
Demo video
Cost dashboard scrub — £800/mo → £30/mo over 6 weeks
~ 0:45 · coming soon
Image slot
Grafana cost-per-provider dashboard
coming soon
Image slot
Routing decision tree (simple → OSS, hard → premium)
coming soon
Image slot
P50/P95/P99 latency per model
coming soon
Stack