Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 13 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,16 @@
Run it locally (or deploy it). Agents call sandboxed replicas of APIs that behave like the real ones, and you get deterministic diffs of every state change — no external services, no side effects, no rate limits.

<p align="center">
<a href="https://arxiv.org/abs/2602.11224">Paper (arXiv)</a> •
<a href="https://arxiv.org/abs/2602.11224"><img src="https://img.shields.io/badge/arXiv-2602.11224-b31b1b.svg" alt="arXiv"></a>
<a href="https://huggingface.co/datasets/hubertmarek/agent-diff-bench"><img src="https://img.shields.io/badge/%F0%9F%A4%97-Dataset-yellow.svg" alt="HuggingFace"></a>
<a href="https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench"><img src="https://img.shields.io/badge/Prime%20Intellect-Run%20Evals-blue.svg" alt="Prime Intellect"></a>
<a href="https://colab.research.google.com/github/agent-diff-bench/agent-diff/blob/main/examples/react_agent_benchmark.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
</p>

<p align="center">
<a href="https://agentdiff.dev">Website</a> •
<a href="https://agentdiff.mintlify.app/introduction">Docs</a> •
<a href="https://huggingface.co/datasets/hubertmarek/agent-diff-bench">Dataset</a> •
<a href="https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench">Prime Intellect</a> •
<a href="https://arxiv.org/abs/2602.11224">Paper</a> •
<a href="mailto:hubert@uni.minerva.edu">Feedback</a>
</p>

Expand Down Expand Up @@ -168,6 +173,11 @@ The fastest way to run Agent-Diff evaluations is via **[Prime Intellect](https:/

Alternatively, run locally or self-hosted using the SDK (see [To run evaluations](#to-run-evaluations) below).

### Example Notebooks

- **[ReAct Agent (Paper)](examples/react_agent_benchmark.ipynb)** — Custom ReAct loop matching the paper methodology [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/agent-diff-bench/agent-diff/blob/main/examples/react_agent_benchmark.ipynb)
- **[LangChain Agent](examples/langchain_agent_benchmark.ipynb)** — LangChain agent with tool calling [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/agent-diff-bench/agent-diff/blob/main/examples/langchain_agent_benchmark.ipynb)

**Resources:**
- **Dataset**: [hubertmarek/agent-diff-bench](https://huggingface.co/datasets/hubertmarek/agent-diff-bench) — 224 tasks across all 4 services (80/20 train/test split)
- **Prime Intellect**: [agent-diff-bench on Prime Lab](https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench) — hosted evaluations & RL training
Expand Down
298 changes: 298 additions & 0 deletions examples/langchain_agent_benchmark.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,298 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Agent-Diff Benchmark: LangChain Agent\n",
"\n",
"Run the [Agent-Diff benchmark](https://arxiv.org/abs/2602.11224) using LangChain's built-in agent with tool calling.\n",
"\n",
"Unlike the [ReAct notebook](react_agent_benchmark.ipynb) which uses a custom XML-tag loop, this notebook lets LangChain handle the agent loop via the model's native function-calling protocol. The `BashExecutorProxy` from the `agent-diff` SDK is wrapped as a LangChain tool.\n",
"\n",
"All 4 services (Box, Calendar, Linear, Slack) are evaluated across 224 tasks.\n",
"\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/agent-diff-bench/agent-diff/blob/main/examples/langchain_agent_benchmark.ipynb)\n",
"\n",
"**Links:** [Paper](https://arxiv.org/abs/2602.11224) | [Dataset](https://huggingface.co/datasets/hubertmarek/agent-diff-bench) | [GitHub](https://github.com/agent-diff-bench/agent-diff)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install agent-diff langchain langchain-openai tqdm pandas -q"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from getpass import getpass\n",
"\n",
"if not os.environ.get(\"AGENT_DIFF_API_KEY\"):\n",
" os.environ[\"AGENT_DIFF_API_KEY\"] = getpass(\"Agent-Diff API key: \")\n",
"\n",
"if not os.environ.get(\"AGENT_DIFF_BASE_URL\"):\n",
" os.environ[\"AGENT_DIFF_BASE_URL\"] = \"https://api.agentdiff.dev\"\n",
"\n",
"OPENROUTER_API_KEY = os.environ.get(\"OPENROUTER_API_KEY\") or getpass(\"OpenRouter API key: \")\n",
"\n",
"# --- Settings ---\n",
"MODEL = \"deepseek/deepseek-chat-v3-0324\" # change to any OpenRouter model\n",
"MAX_ITERATIONS = 40 # max agent loop turns per task\n",
"MAX_TESTS = None # None = run all tests; set to e.g. 5 for a quick trial\n",
"TIMEOUT_SECONDS = 480 # per-test timeout"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"SERVICE_CONFIG = {\n",
" \"slack\": {\n",
" \"name\": \"Slack\",\n",
" \"base_url\": \"https://slack.com/api\",\n",
" \"description\": \"Slack workspace messaging and collaboration API\",\n",
" \"extra_context\": \"\",\n",
" \"test_suite_name\": \"Slack Bench v2\",\n",
" },\n",
" \"box\": {\n",
" \"name\": \"Box\",\n",
" \"base_url\": \"https://api.box.com/2.0\",\n",
" \"description\": \"Box cloud storage and file management API\",\n",
" \"extra_context\": \"\",\n",
" \"test_suite_name\": \"Box Bench v2\",\n",
" },\n",
" \"calendar\": {\n",
" \"name\": \"Google Calendar\",\n",
" \"base_url\": \"https://www.googleapis.com/calendar/v3\",\n",
" \"description\": \"Google Calendar scheduling and events API\",\n",
" \"extra_context\": \"Current Date/Time: Sunday, June 17, 2018 at 00:01 (midnight), timezone America/Los_Angeles. Use this as the reference point for all relative date/time expressions like 'today', 'tomorrow', 'this Saturday', etc.\",\n",
" \"test_suite_name\": \"Calendar Bench\",\n",
" },\n",
" \"linear\": {\n",
" \"name\": \"Linear\",\n",
" \"base_url\": \"https://api.linear.app/graphql\",\n",
" \"description\": \"Linear project management and issue tracking API\",\n",
" \"extra_context\": \"\",\n",
" \"test_suite_name\": \"Linear Bench\",\n",
" },\n",
"}\n",
"\n",
"SYSTEM_PROMPT_TEMPLATE = \"\"\"You are an AI assistant that completes tasks by interacting with APIs via bash commands.\n",
"\n",
"Current Session:\n",
"- Service: {service_name}\n",
"- Base URL: {base_url}\n",
"- Description: {service_description}\n",
"{extra_context}\n",
"\n",
"Environment:\n",
"- You are authenticated as a user in the {service_name} workspace/account.\n",
"- Authentication is handled automatically via proxy. Use placeholder tokens like <TOKEN> where credentials would go.\n",
"- Use the execute_bash tool to run bash commands (primarily curl) to interact with the {service_name} API.\n",
"- If you are not sure how to use the {service_name} API, explore the endpoint, parameters, and learn how it works.\n",
"- Parse API responses carefully - extract IDs and data needed for subsequent calls.\n",
"- If a command fails, analyze the error and try a different approach.\n",
"- Only declare completion when the task is fully completed (not just when you've gathered information).\n",
"\"\"\"\n",
"\n",
"\n",
"def build_system_prompt(service: str) -> str:\n",
" config = SERVICE_CONFIG[service]\n",
" return SYSTEM_PROMPT_TEMPLATE.format(\n",
" service_name=config[\"name\"],\n",
" base_url=config[\"base_url\"],\n",
" service_description=config[\"description\"],\n",
" extra_context=config[\"extra_context\"],\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"from langchain_openai import ChatOpenAI\n",
"from langchain.agents import AgentExecutor, create_tool_calling_agent\n",
"from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n",
"from agent_diff import AgentDiff, BashExecutorProxy, create_langchain_tool\n",
"\n",
"\n",
"def create_agent(service: str, bash_executor: BashExecutorProxy, model: str) -> AgentExecutor:\n",
" \"\"\"Create a LangChain agent with the bash tool for a given service.\"\"\"\n",
" llm = ChatOpenAI(\n",
" base_url=\"https://openrouter.ai/api/v1\",\n",
" api_key=OPENROUTER_API_KEY,\n",
" model=model,\n",
" temperature=0,\n",
" )\n",
" tool = create_langchain_tool(bash_executor)\n",
" system_prompt = build_system_prompt(service)\n",
"\n",
" prompt = ChatPromptTemplate.from_messages([\n",
" (\"system\", system_prompt),\n",
" (\"human\", \"{input}\"),\n",
" MessagesPlaceholder(variable_name=\"agent_scratchpad\"),\n",
" ])\n",
"\n",
" agent = create_tool_calling_agent(llm, [tool], prompt)\n",
" return AgentExecutor(\n",
" agent=agent,\n",
" tools=[tool],\n",
" max_iterations=MAX_ITERATIONS,\n",
" handle_parsing_errors=True,\n",
" verbose=False,\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from tqdm.auto import tqdm\n",
"\n",
"\n",
"def run_single_test(client: AgentDiff, model: str, test, service: str) -> dict:\n",
" \"\"\"Run one test: init env -> LangChain agent -> evaluate -> cleanup.\"\"\"\n",
" env = None\n",
" try:\n",
" env = client.init_env(testId=test.id)\n",
" run = client.start_run(envId=env.environmentId, testId=test.id)\n",
" bash_executor = BashExecutorProxy(env.environmentId, base_url=client.base_url, api_key=client.api_key)\n",
"\n",
" agent_executor = create_agent(service, bash_executor, model)\n",
"\n",
" start = time.perf_counter()\n",
" agent_output = agent_executor.invoke({\"input\": test.prompt})\n",
" elapsed = time.perf_counter() - start\n",
"\n",
" client.evaluate_run(runId=run.runId)\n",
" result = client.get_results_for_run(runId=run.runId)\n",
" client.delete_env(envId=env.environmentId)\n",
"\n",
" return {\n",
" \"test_id\": str(test.id),\n",
" \"test_name\": getattr(test, \"name\", \"\"),\n",
" \"passed\": result.passed,\n",
" \"score\": result.score.get(\"percent\", 0) if isinstance(result.score, dict) else 0,\n",
" \"failures\": result.failures,\n",
" \"time\": round(elapsed, 2),\n",
" \"agent_output\": agent_output.get(\"output\", \"\"),\n",
" }\n",
" except Exception as e:\n",
" if env:\n",
" try:\n",
" client.delete_env(envId=env.environmentId)\n",
" except Exception:\n",
" pass\n",
" return {\"test_id\": str(test.id), \"test_name\": getattr(test, \"name\", \"\"), \"passed\": False, \"score\": 0, \"error\": str(e)}\n",
"\n",
"\n",
"def run_benchmark(model: str, services: list[str] | None = None, max_tests: int | None = None) -> list[dict]:\n",
" \"\"\"Run the full benchmark across services using LangChain agent.\"\"\"\n",
" services = services or list(SERVICE_CONFIG.keys())\n",
" client = AgentDiff()\n",
" all_results = []\n",
"\n",
" for service in services:\n",
" config = SERVICE_CONFIG[service]\n",
"\n",
" suite_list = client.list_test_suites(name=config[\"test_suite_name\"])\n",
" if not suite_list.testSuites:\n",
" print(f\"[SKIP] Test suite '{config['test_suite_name']}' not found.\")\n",
" continue\n",
" suite = client.get_test_suite(suite_list.testSuites[0].id, expand=True)\n",
" tests = suite.tests[:max_tests] if max_tests else suite.tests\n",
"\n",
" print(f\"\\n{'='*60}\")\n",
" print(f\" {config['name']} — {len(tests)} tests | model: {model}\")\n",
" print(f\"{'='*60}\")\n",
"\n",
" for test in tqdm(tests, desc=config[\"name\"]):\n",
" result = run_single_test(client, model, test, service)\n",
" result[\"service\"] = service\n",
" result[\"model\"] = model\n",
" all_results.append(result)\n",
"\n",
" status = \"PASS\" if result.get(\"passed\") else \"FAIL\"\n",
" score = result.get(\"score\", 0)\n",
" tqdm.write(f\" [{status}] {result.get('test_name', result['test_id'])[:60]} score={score}\")\n",
"\n",
" return all_results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"results = run_benchmark(\n",
" model=MODEL,\n",
" services=None, # all 4 services; or e.g. [\"slack\", \"box\"]\n",
" max_tests=MAX_TESTS,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.DataFrame(results)\n",
"\n",
"print(\"\\n\" + \"=\" * 60)\n",
"print(f\" Results: {MODEL} (LangChain Agent)\")\n",
"print(\"=\" * 60)\n",
"\n",
"if \"service\" in df.columns and \"score\" in df.columns:\n",
" summary = df.groupby(\"service\").agg(\n",
" tests=(\"score\", \"count\"),\n",
" passed=(\"passed\", \"sum\"),\n",
" mean_score=(\"score\", \"mean\"),\n",
" pass_rate=(\"passed\", \"mean\"),\n",
" ).round(2)\n",
" summary[\"pass_rate\"] = (summary[\"pass_rate\"] * 100).round(1)\n",
" print(\"\\nPer-service summary:\")\n",
" print(summary.to_string())\n",
"\n",
" overall_score = df[\"score\"].mean()\n",
" overall_pass = df[\"passed\"].mean() * 100\n",
" print(f\"\\nOverall: score={overall_score:.1f} pass_rate={overall_pass:.1f}%\")\n",
"\n",
" summary[\"mean_score\"].plot.bar(title=f\"Agent-Diff Score by Service ({MODEL}, LangChain)\", ylabel=\"Score\", xlabel=\"Service\", rot=0)\n",
"else:\n",
" print(df)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.11.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Loading