Skip to content

Production operations framework for AI-powered SaaS. The architectural patterns, failure modes, and operational playbooks that determine whether your AI systems scale profitably or fail expensively.

License

Notifications You must be signed in to change notification settings

whoisrade/agentic-field-manual

Repository files navigation

Agentic Field Manual

The missing operations manual for production AI systems.


License: MIT GitHub last commit GitHub stars PRs Welcome Author

Quick Start · Assess Your System · Crisis Playbook · Quick Reference


Agentic systems fail silently, expensively, then catastrophically. This manual gives you the patterns, checklists, and code to prevent that—from a team that learned operating 1.5M+ MAU systems the hard way.


Who This Is For

  • Principal Engineers inheriting or building agentic systems
  • CTOs/VPEs who need to understand AI operational risk
  • AI/ML Engineers shipping to production (not just prototyping)
  • Platform Teams building shared AI infrastructure

This manual is for production systems where failure has consequences.


Quick Start

Add this to your inference calls today:

def log_inference(request, response, model_version, prompt_version):
    return {
        "trace_id": str(uuid.uuid4()),
        "trigger_type": request.get("trigger", "user_explicit"),
        "model_version": model_version,
        "prompt_version": prompt_version,
        "cost_usd": calculate_cost(response.usage),
        "state": "speculative",  # → "committed" when user accepts
    }

Then run weekly:

SELECT SUM(cost_usd) / NULLIF(COUNT(DISTINCT CASE 
  WHEN state = 'committed' THEN trace_id END), 0) as cost_per_outcome
FROM inference_logs WHERE created_at > NOW() - INTERVAL '7 days';

If cost per outcome is rising, you have a problem → Cost Investigation


Start Here

Your Situation Go To
Just inherited a system System AssessmentFailure Modes
Building something new State ModelInteraction Contract
In crisis mode Crisis Playbook
Costs are spiking Cost Investigation
Presenting to leadership Board Explainer
Setting up observability Metrics Reference
Preparing for an audit Audit Preparation

Tip

First time here? Run the 10-minute assessment to identify your gaps and get personalized recommendations.


The 4 Failure Modes

These are how agentic systems break:

Failure Mode Signal Fix
Legibility Loss "Why did it do that?" takes hours Decision envelopes
Control Surface Drift Costs rise, traffic flat Interaction contracts
Auditability Gap Can show outputs, not rationale Provenance logging
Margin Fragility Success destroys margin Cost-per-outcome tracking

The 3 Irreversible Decisions

These decisions harden quickly—get them right early:

Decision Controls Why It Hardens
State Model What you persist Downstream systems depend on schema
Interaction Contract What triggers compute Users form habits around UX
Control Plane Ownership What you own vs rent Contracts and migrations lock in

Case Studies

Story Lesson
The Undo Button That Killed Our Margin Hidden recompute from "free" UI actions
Why We Rebuilt State Twice State model closes reversibility window
The Compliance Question We Couldn't Answer Auditability gaps kill enterprise deals
From API to Owned in 90 Days When to transition inference ownership

Reference

Resource Purpose
Quick Reference Card One-page summary - print this
System Assessment 10-minute self-assessment with scoring
Adoption Guide How to implement these patterns incrementally
Anti-Patterns Common mistakes to avoid
Glossary All terms with technical and executive definitions
Metrics Reference Formulas and queries for every metric
Examples Production code and schemas
Templates Documents you copy and fill in

Full Topic Index

Expand for complete navigation by topic

Architecture and Design

Topic Document
State persistence State Model
User action triggers Interaction Contract
Build vs buy API vs Owned
Control plane Control Plane Ownership
Multi-agent orchestration Orchestration
Tool failure handling Tool Reliability

Cost and Economics

Topic Document
Cost investigation Cost Investigation
Cost per outcome Cost Model
Hidden recompute Hidden Recompute
Capacity planning Capacity Planning
Margin at scale Margin Fragility

Quality and Reliability

Topic Document
Evals and regression Eval and Regression
Latency and SLOs Latency and SLOs
Rollout and rollback Rollout and Rollback
Safety and guardrails Safety Surface
Human oversight Human in the Loop

Compliance

Topic Document
Audit preparation Audit Preparation
Auditability requirements Auditability
Data sovereignty Sovereignty
Operational independence Operational Independence
Data privacy Data Privacy

Templates

Situation Template
System failing Crisis Playbook
Costs spiked Cost Spike Runbook
Weekly review Weekly Ops Checklist
After incident Incident Post-Mortem
Before shipping Pre-Ship Checklist
Architecture decision Decision Record

Contributing

Found a bug? Have a war story to share? See CONTRIBUTING.md.


About the Author

Rade Joksimovic - Principal engineer focused on AI systems at scale.

  • 15+ years building SaaS systems
  • Recent focus: LLM-driven products and agentic infrastructure
  • Scale: 1.5M+ MAU, 30M+ monthly API calls, 50K+ orchestrated agents

TwitterLinkedInEmail


Acknowledgments

Thanks to the teams at Filekit, ShortlistIQ, Olovka, and Rumora for battle-testing these patterns across diverse agentic architectures—from document generation to autonomous interviewing to social media orchestration.

Special thanks to everyone who shared war stories, reported issues, and contributed patterns from their own production systems.


If you can explain the output, you have control.
If you cannot, you are negotiating with your own product.


⬆ Back to Top

About

Production operations framework for AI-powered SaaS. The architectural patterns, failure modes, and operational playbooks that determine whether your AI systems scale profitably or fail expensively.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published