Agentic systems fail silently, expensively, then catastrophically. This manual gives you the patterns, checklists, and code to prevent that—from a team that learned operating 1.5M+ MAU systems the hard way.
- Principal Engineers inheriting or building agentic systems
- CTOs/VPEs who need to understand AI operational risk
- AI/ML Engineers shipping to production (not just prototyping)
- Platform Teams building shared AI infrastructure
This manual is for production systems where failure has consequences.
Add this to your inference calls today:
def log_inference(request, response, model_version, prompt_version):
return {
"trace_id": str(uuid.uuid4()),
"trigger_type": request.get("trigger", "user_explicit"),
"model_version": model_version,
"prompt_version": prompt_version,
"cost_usd": calculate_cost(response.usage),
"state": "speculative", # → "committed" when user accepts
}Then run weekly:
SELECT SUM(cost_usd) / NULLIF(COUNT(DISTINCT CASE
WHEN state = 'committed' THEN trace_id END), 0) as cost_per_outcome
FROM inference_logs WHERE created_at > NOW() - INTERVAL '7 days';If cost per outcome is rising, you have a problem → Cost Investigation
| Your Situation | Go To |
|---|---|
| Just inherited a system | System Assessment → Failure Modes |
| Building something new | State Model → Interaction Contract |
| In crisis mode | Crisis Playbook |
| Costs are spiking | Cost Investigation |
| Presenting to leadership | Board Explainer |
| Setting up observability | Metrics Reference |
| Preparing for an audit | Audit Preparation |
Tip
First time here? Run the 10-minute assessment to identify your gaps and get personalized recommendations.
These are how agentic systems break:
| Failure Mode | Signal | Fix |
|---|---|---|
| Legibility Loss | "Why did it do that?" takes hours | Decision envelopes |
| Control Surface Drift | Costs rise, traffic flat | Interaction contracts |
| Auditability Gap | Can show outputs, not rationale | Provenance logging |
| Margin Fragility | Success destroys margin | Cost-per-outcome tracking |
These decisions harden quickly—get them right early:
| Decision | Controls | Why It Hardens |
|---|---|---|
| State Model | What you persist | Downstream systems depend on schema |
| Interaction Contract | What triggers compute | Users form habits around UX |
| Control Plane Ownership | What you own vs rent | Contracts and migrations lock in |
| Story | Lesson |
|---|---|
| The Undo Button That Killed Our Margin | Hidden recompute from "free" UI actions |
| Why We Rebuilt State Twice | State model closes reversibility window |
| The Compliance Question We Couldn't Answer | Auditability gaps kill enterprise deals |
| From API to Owned in 90 Days | When to transition inference ownership |
| Resource | Purpose |
|---|---|
| Quick Reference Card | One-page summary - print this |
| System Assessment | 10-minute self-assessment with scoring |
| Adoption Guide | How to implement these patterns incrementally |
| Anti-Patterns | Common mistakes to avoid |
| Glossary | All terms with technical and executive definitions |
| Metrics Reference | Formulas and queries for every metric |
| Examples | Production code and schemas |
| Templates | Documents you copy and fill in |
Expand for complete navigation by topic
| Topic | Document |
|---|---|
| State persistence | State Model |
| User action triggers | Interaction Contract |
| Build vs buy | API vs Owned |
| Control plane | Control Plane Ownership |
| Multi-agent orchestration | Orchestration |
| Tool failure handling | Tool Reliability |
| Topic | Document |
|---|---|
| Cost investigation | Cost Investigation |
| Cost per outcome | Cost Model |
| Hidden recompute | Hidden Recompute |
| Capacity planning | Capacity Planning |
| Margin at scale | Margin Fragility |
| Topic | Document |
|---|---|
| Evals and regression | Eval and Regression |
| Latency and SLOs | Latency and SLOs |
| Rollout and rollback | Rollout and Rollback |
| Safety and guardrails | Safety Surface |
| Human oversight | Human in the Loop |
| Topic | Document |
|---|---|
| Audit preparation | Audit Preparation |
| Auditability requirements | Auditability |
| Data sovereignty | Sovereignty |
| Operational independence | Operational Independence |
| Data privacy | Data Privacy |
| Situation | Template |
|---|---|
| System failing | Crisis Playbook |
| Costs spiked | Cost Spike Runbook |
| Weekly review | Weekly Ops Checklist |
| After incident | Incident Post-Mortem |
| Before shipping | Pre-Ship Checklist |
| Architecture decision | Decision Record |
Found a bug? Have a war story to share? See CONTRIBUTING.md.
Rade Joksimovic - Principal engineer focused on AI systems at scale.
- 15+ years building SaaS systems
- Recent focus: LLM-driven products and agentic infrastructure
- Scale: 1.5M+ MAU, 30M+ monthly API calls, 50K+ orchestrated agents
Thanks to the teams at Filekit, ShortlistIQ, Olovka, and Rumora for battle-testing these patterns across diverse agentic architectures—from document generation to autonomous interviewing to social media orchestration.
Special thanks to everyone who shared war stories, reported issues, and contributed patterns from their own production systems.
If you can explain the output, you have control.
If you cannot, you are negotiating with your own product.