Skip to content

Interactive Sandboxes for AI Agent Evaluations & Reinforcement Learning on 3rd party APIs (Slack, Linear, Box, gCalendar)

License

Notifications You must be signed in to change notification settings

agent-diff-bench/agent-diff

Repository files navigation

Agent Diff

Interactive environments for evaluating AI agents & RL training on replicas of 3rd party APIs like Linear or Slack.

Run it locally (or deploy it). Agents call sandboxed replicas of APIs that behave like the real ones, and you get deterministic diffs of every state change — no external services, no side effects, no rate limits.

arXiv HuggingFace

WebsiteDocsPaperFeedback

Try it now

Description
LangChain Agent Run AgentDiff Benchmark (LangChain Agents) Open In Colab
ReAct Agent (Paper) AgentDiff Benchmark (ReAct) Open In Colab
Custom Evaluations Demo Write your own assertions & evaluate agents Open In Colab
Prime Intellect Run evals or RL training Prime Intellect

Quick Start

1. Install SDK

Python: Python SDK docs

uv add agent-diff

TypeScript: TS SDK docs

npm install agent-diff

2. Configure

Hosted
  1. Sign up at agentdiff.dev and get your API key
  2. Set environment variables:
export AGENT_DIFF_API_KEY="ad_live_sk_..."
export AGENT_DIFF_BASE_URL="https://api.agentdiff.dev"
Self-Hosted
git clone https://github.com/agent-diff-bench/agent-diff.git
cd agent-diff/ops
docker-compose up --build
# Backend runs on http://localhost:8000

3. Use

from agent_diff import AgentDiff

client = AgentDiff()

# Create an isolated environment from a template
env = client.init_env(
    templateService="slack",
    templateName="slack_default",
    impersonateUserId="U01AGENBOT9",
)

# Snapshot before agent runs
run = client.start_run(envId=env.environmentId)

# --- Your agent interacts with the API here ---
# SDK provides code execution proxies (Python/Bash) for OpenAI Agents, LangChain, etc.
# Agent writes normal code (e.g. requests.post('https://slack.com/api/chat.postMessage', ...))
# which is automatically intercepted and routed to the sandboxed environment.

from agent_diff import BashExecutorProxy, create_openai_tool
bash = BashExecutorProxy(env.environmentId)
tool = create_openai_tool(bash)  # also: create_langchain_tool, create_smolagents_tool

# Compute state diff and inspect changes
diff = client.diff_run(runId=run.runId)
print(diff.diff['inserts'])   # new records created by agent
print(diff.diff['updates'])   # modified records
print(diff.diff['deletes'])   # deleted records

# Clean up
client.delete_env(envId=env.environmentId)

See the Python SDK and TS SDK for full reference.

Supported APIs

Service Type Endpoints Coverage
Box REST 27 Files, folders, search, comments, tags, shared links, hubs, versioning
Google Calendar REST 37 Calendars, events, recurring series, free/busy, ACL, push notifications
Linear GraphQL 19 Issues, teams, workflow states, labels, comments, relations, memberships
Slack Web API 25 Conversations, messaging, reactions, threading, users, channels

108 unique endpoints across all 4 services.

Templates, Seeds & Environments

Templates are pre-configured database schemas that serve as the starting point for test environments. Think of them as snapshots of a service's state:

  • Location: Templates live in PostgreSQL schemas (e.g., slack_default, box_default, linear_expanded, calendar_base)
  • Content: Seeded with realistic data — users, channels, messages, files, folders, issues, calendar events, etc.
  • Seeds: box | calendar | linear | slack
image

Environments are isolated, temporary copies of a template schema:

  • URL: Each environment has a unique service URL (e.g., http://localhost:8000/api/env/{env_id}/services/slack)
  • Creation: client.init_env(templateService="slack", templateName="slack_default", impersonateUserId="U01AGENBOT9")
  • Cleanup: client.delete_env(envId) or auto-expires after TTL
image

Run Evaluations

  • Prime Intellect — Run evals or RL training with no setup required
  • Colab Notebooks — Run locally with the example notebooks above
  • Dataset — 224 tasks across all 4 services (80/20 train/test split)

Benchmark

The Agent-Diff benchmark comprises 224 tasks across four enterprise services, each evaluated via deterministic state-diff contracts. Tasks span single-step CRUD operations to long-horizon, multi-entity workflows requiring search, conditional logic, and coordinated state changes.

Task Distribution

Metric Box Calendar Linear Slack Total
Tasks 48 60 57 59 224
Task horizon n* (range) 1–13 1–24 1–13 1–14 1–24
Task horizon n* (mean) 4.6 5.9 5.2 5.6 5.3
Operation profile (% of tasks, non-exclusive)
Search 92 77 89 64 80
Create 58 78 63 88 73
Read 54 82 14 68 55
Update 62 93 70 37 66
Delete 19 53 7 24 26

Tasks are characterized along five dimensions: task horizon (minimum API calls under an optimal policy), operation profile (which CRUD primitives are required), entity scope (single vs. multi-entity state changes), information availability (whether identifiers are given explicitly or must be discovered), and prompt ambiguity (how underspecified the target is).

Results (No-Docs Baseline)

Model Box Calendar Linear Slack Overall Pass % Cost/test Score/$
deepseek-v3.2 76.6 87.5 94.8 86.1 88.1 76 $0.03 2,938
devstral-2512 79.0 80.0 91.5 85.7 86.0 74 $0.08 1,075
qwen3-vl-235b 68.4 71.0 82.0 75.8 79.2 65 $0.02 3,959
kimi-k2-0905 66.5 72.3 88.2 82.2 75.4 64 $0.04 1,885
grok-4.1-fast 58.5 75.7 66.0 77.1 74.9 52 $0.01 7,489
gemini-3-flash 80.3 62.2 84.0 77.5 73.8 67 $0.05 1,477
gpt-oss-120b 70.1 68.4 79.5 69.1 68.5 60 $0.02 3,428
claude-haiku-4.5 45.1 57.8 35.6 57.3 49.3 50 $0.22 224
llama-4-scout 33.7 41.4 20.9 42.9 38.0 29 $0.02 1,900

Per-service assertion-weighted scores (95% Bayesian CrI). No-docs baseline: agents receive no API documentation and must discover endpoints through exploration. 3 trials per task. Full methodology and documentation ablation results in the paper.

Test Suites

Service Test Suite Tests Coverage
Box box_bench.json 48 File/folder ops, search, tags, comments, hubs, versioning
Calendar calendar_bench.json 60 Event CRUD, recurring events, free/busy, ACL, lifecycle
Linear linear_bench.json 57 Issues, labels, comments, workflow states, teams
Slack slack_bench.json 59 Messages, channels, reactions, threading

Each test defines expected state changes via declarative assertions. See the assertions docs for how they work.

image

Documentation

Citation

If you use Agent-Diff in your research, please cite:

@article{pysklo2025agentdiff,
  title={Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation},
  author={Pysklo, Hubert M. and Zhuravel, Artem and Watson, Patrick D.},
  journal={arXiv preprint arXiv:2602.11224},
  year={2025}
}

About

Interactive Sandboxes for AI Agent Evaluations & Reinforcement Learning on 3rd party APIs (Slack, Linear, Box, gCalendar)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 6