Multi-model router with intelligent routing, cost tracking, quality monitoring, and fallback chains. Route every request to the optimal model based on complexity, cost, and latency requirements.
Production AI systems rarely use a single model. The compound AI pattern -- routing requests to different models based on complexity, cost, and latency -- is how companies actually deploy AI at scale. A simple greeting should not consume the same resources as a complex analysis. This system implements the full routing stack: a classifier that estimates query complexity, a router that selects the optimal model, fallback chains for reliability, cost tracking for budgeting, and quality monitoring to detect degradation.
The key insight: you can serve 80% of requests with a model that costs 10x less, without measurable quality loss.
User Query
|
v
[Complexity Classifier] --> simple | moderate | complex | expert
|
v
[Routing Engine] --> strategy (quality | cost | balanced | latency)
|
v
[Fallback Chain] --> primary -> secondary -> tertiary
|
v
[Provider] --> mock-haiku | mock-sonnet | mock-opus | mock-gpt4o | ...
|
v
[Cost Tracker] + [Quality Monitor] + [Request Tracer]
| Feature | Description |
|---|---|
| Query classification | Heuristic complexity estimation (simple/moderate/complex/expert) |
| 4 routing strategies | Quality, cost, balanced, latency optimization |
| 5 mock providers | Simulating Haiku, Sonnet, Opus, GPT-4o, GPT-4o-mini |
| Fallback chains | Automatic retry with next provider on failure |
| Circuit breaker | Skip unhealthy providers (3 failures, 60s recovery) |
| Cost tracking | Per-request cost, daily budgets, provider breakdown |
| Quality monitoring | Coherence, completeness, conciseness scoring |
| Request tracing | Full audit trail for every routing decision |
| FastAPI API | /route, /health, /stats, /traces endpoints |
| CLI interface | Route, classify, providers, demo commands |
# Install
pip install -e ".[dev]"
# Run the demo
python -m src.cli demo
# Classify a query
python -m src.cli classify "Compare transformers and RNNs"
# List available providers
python -m src.cli providers
# Start API server
make serve| Strategy | Behavior | Best For |
|---|---|---|
quality |
Route to tier matching complexity | When accuracy matters most |
cost |
Downgrade tiers to minimize cost | High-volume, budget-constrained |
balanced |
Weighted cost + latency optimization | Default production use |
latency |
Route to fastest available model | Real-time applications |
For a mixed workload (25% simple, 25% moderate, 25% complex, 25% expert):
| Strategy | Est. Cost per 1M Requests |
|---|---|
| Quality | ~$7,500 |
| Balanced | ~$4,200 |
| Cost | ~$1,800 |
| Latency | ~$1,500 |
| Method | Path | Description |
|---|---|---|
| GET | /health |
System health and provider status |
| POST | /route |
Route a query to optimal provider |
| GET | /stats |
Aggregate system statistics |
| GET | /traces |
Recent request traces |
curl -X POST http://localhost:8000/route \
-H "Content-Type: application/json" \
-d '{"query": "What is machine learning?", "strategy": "balanced"}'See docs/architecture.md for detailed Mermaid diagrams.
pip install -e ".[dev]"
make test # Run tests with coverage
make lint # Lint with ruff
make demo # Run interactive demo
make serve # Start API serverMIT
- agent-orchestrator — Multi-agent orchestration with LangGraph
- mcp-toolkit-server — MCP server toolkit for Claude AI integration
- llm-eval-framework — LLM evaluation and testing framework