Agent Diff

Interactive environments for evaluating AI agents & RL training on replicas of 3rd party APIs like Linear or Slack.

Run it locally (or deploy it). Agents call sandboxed replicas of APIs that behave like the real ones, and you get deterministic diffs of every state change — no external services, no side effects, no rate limits.

Website • Docs • Paper • Feedback

Try it now

	Description
LangChain Agent	Run AgentDiff Benchmark (LangChain Agents)
ReAct Agent (Paper)	AgentDiff Benchmark (ReAct)
Custom Evaluations Demo	Write your own assertions & evaluate agents
Prime Intellect	Run evals or RL training

Quick Start

1. Install SDK

Python: Python SDK docs

uv add agent-diff

TypeScript: TS SDK docs

npm install agent-diff

2. Configure

Hosted

Sign up at agentdiff.dev and get your API key
Set environment variables:

export AGENT_DIFF_API_KEY="ad_live_sk_..."
export AGENT_DIFF_BASE_URL="https://api.agentdiff.dev"

Self-Hosted

git clone https://github.com/agent-diff-bench/agent-diff.git
cd agent-diff/ops
docker-compose up --build
# Backend runs on http://localhost:8000

3. Use

from agent_diff import AgentDiff

client = AgentDiff()

# Create an isolated environment from a template
env = client.init_env(
    templateService="slack",
    templateName="slack_default",
    impersonateUserId="U01AGENBOT9",
)

# Snapshot before agent runs
run = client.start_run(envId=env.environmentId)

# --- Your agent interacts with the API here ---
# SDK provides code execution proxies (Python/Bash) for OpenAI Agents, LangChain, etc.
# Agent writes normal code (e.g. requests.post('https://slack.com/api/chat.postMessage', ...))
# which is automatically intercepted and routed to the sandboxed environment.

from agent_diff import BashExecutorProxy, create_openai_tool
bash = BashExecutorProxy(env.environmentId)
tool = create_openai_tool(bash)  # also: create_langchain_tool, create_smolagents_tool

# Compute state diff and inspect changes
diff = client.diff_run(runId=run.runId)
print(diff.diff['inserts'])   # new records created by agent
print(diff.diff['updates'])   # modified records
print(diff.diff['deletes'])   # deleted records

# Clean up
client.delete_env(envId=env.environmentId)

See the Python SDK and TS SDK for full reference.

Supported APIs

Service	Type	Endpoints	Coverage
Box	REST	27	Files, folders, search, comments, tags, shared links, hubs, versioning
Google Calendar	REST	37	Calendars, events, recurring series, free/busy, ACL, push notifications
Linear	GraphQL	19	Issues, teams, workflow states, labels, comments, relations, memberships
Slack	Web API	25	Conversations, messaging, reactions, threading, users, channels

108 unique endpoints across all 4 services.

Templates, Seeds & Environments

Templates are pre-configured database schemas that serve as the starting point for test environments. Think of them as snapshots of a service's state:

Location: Templates live in PostgreSQL schemas (e.g., slack_default, box_default, linear_expanded, calendar_base)
Content: Seeded with realistic data — users, channels, messages, files, folders, issues, calendar events, etc.
Seeds: box | calendar | linear | slack

Environments are isolated, temporary copies of a template schema:

URL: Each environment has a unique service URL (e.g., http://localhost:8000/api/env/{env_id}/services/slack)
Creation: client.init_env(templateService="slack", templateName="slack_default", impersonateUserId="U01AGENBOT9")
Cleanup: client.delete_env(envId) or auto-expires after TTL

Run Evaluations

Prime Intellect — Run evals or RL training with no setup required
Colab Notebooks — Run locally with the example notebooks above
Dataset — 224 tasks across all 4 services (80/20 train/test split)

Benchmark

The Agent-Diff benchmark comprises 224 tasks across four enterprise services, each evaluated via deterministic state-diff contracts. Tasks span single-step CRUD operations to long-horizon, multi-entity workflows requiring search, conditional logic, and coordinated state changes.

Task Distribution

Metric	Box	Calendar	Linear	Slack	Total
Tasks	48	60	57	59	224
Task horizon n* (range)	1–13	1–24	1–13	1–14	1–24
Task horizon n* (mean)	4.6	5.9	5.2	5.6	5.3

Operation profile (% of tasks, non-exclusive)
Search	92	77	89	64	80
Create	58	78	63	88	73
Read	54	82	14	68	55
Update	62	93	70	37	66
Delete	19	53	7	24	26

Tasks are characterized along five dimensions: task horizon (minimum API calls under an optimal policy), operation profile (which CRUD primitives are required), entity scope (single vs. multi-entity state changes), information availability (whether identifiers are given explicitly or must be discovered), and prompt ambiguity (how underspecified the target is).

Results (No-Docs Baseline)

Model	Box	Calendar	Linear	Slack	Overall	Pass %	Cost/test	Score/$
deepseek-v3.2	76.6	87.5	94.8	86.1	88.1	76	$0.03	2,938
devstral-2512	79.0	80.0	91.5	85.7	86.0	74	$0.08	1,075
qwen3-vl-235b	68.4	71.0	82.0	75.8	79.2	65	$0.02	3,959
kimi-k2-0905	66.5	72.3	88.2	82.2	75.4	64	$0.04	1,885
grok-4.1-fast	58.5	75.7	66.0	77.1	74.9	52	$0.01	7,489
gemini-3-flash	80.3	62.2	84.0	77.5	73.8	67	$0.05	1,477
gpt-oss-120b	70.1	68.4	79.5	69.1	68.5	60	$0.02	3,428
claude-haiku-4.5	45.1	57.8	35.6	57.3	49.3	50	$0.22	224
llama-4-scout	33.7	41.4	20.9	42.9	38.0	29	$0.02	1,900

Per-service assertion-weighted scores (95% Bayesian CrI). No-docs baseline: agents receive no API documentation and must discover endpoints through exploration. 3 trials per task. Full methodology and documentation ablation results in the paper.

Test Suites

Service	Test Suite	Tests	Coverage
Box	box_bench.json	48	File/folder ops, search, tags, comments, hubs, versioning
Calendar	calendar_bench.json	60	Event CRUD, recurring events, free/busy, ACL, lifecycle
Linear	linear_bench.json	57	Issues, labels, comments, workflow states, teams
Slack	slack_bench.json	59	Messages, channels, reactions, threading

Each test defines expected state changes via declarative assertions. See the assertions docs for how they work.

Documentation

Python SDK — Full Python SDK reference
TypeScript SDK — Full TypeScript SDK reference
Assertions & Evaluation DSL — Write test assertions
API Reference — REST API documentation
Self-Hosting — Docker setup & configuration

Citation

If you use Agent-Diff in your research, please cite:

@article{pysklo2025agentdiff,
  title={Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation},
  author={Pysklo, Hubert M. and Zhuravel, Artem and Watson, Patrick D.},
  journal={arXiv preprint arXiv:2602.11224},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 665 Commits
.github/workflows		.github/workflows
backend		backend
datasets/agent-diff-bench		datasets/agent-diff-bench
docs		docs
examples		examples
experiments/kdd 2026		experiments/kdd 2026
ops		ops
sdk		sdk
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
pyrightconfig.json		pyrightconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Diff

Try it now

Quick Start

1. Install SDK

2. Configure

3. Use

Supported APIs

Templates, Seeds & Environments

Run Evaluations

Benchmark

Task Distribution

Results (No-Docs Baseline)

Test Suites

Documentation

Citation

About

Uh oh!

Releases

Packages

Contributors 6

Uh oh!

Languages

License

agent-diff-bench/agent-diff

Folders and files

Latest commit

History

Repository files navigation

Agent Diff

Try it now

Quick Start

1. Install SDK

2. Configure

3. Use

Supported APIs

Templates, Seeds & Environments

Run Evaluations

Benchmark

Task Distribution

Results (No-Docs Baseline)

Test Suites

Documentation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Uh oh!

Languages

Packages