GitHub - agent0lab/protocol-agent

Protocol Agent

This repository accompanies the paper “Protocol Agent: What If Agents Could Use Cryptography In Everyday Life?” (see paper_preprint.pdf).

Repo layout

arena/: benchmark runner (self-play), providers, scoring, and tool-call protocol
- arena/run.py: main CLI entrypoint for running the arena
- arena/runner.py: core match loop + output writing
- arena/schema.py: JSON schema loader for the benchmark challenges file
- arena/tool_protocol.py: minimal JSON “tool call” protocol used in transcripts
src/crypto-challenges/: the benchmark challenge set
- benchmark_challenges_diverse_v2.json: main benchmark JSON (v2+ includes split)
src/crypto-knowledge/: cryptography reference notes used for data generation / grounding
calculator/: Rust “cryptomath” tool (CLI + AWS Lambda handler)
scripts/: dataset generation/validation, SFT helpers, and evaluation wrappers
config/: defaults for “standard” evaluation runs (config/marco_defaults.yaml)
state/: example outputs and pipeline artifacts (datasets, jobs, reports)

Requirements

Python 3.10+ (some scripts use X | Y union types)
For the cryptomath tool:
- Rust toolchain (for local builds), and/or
- Docker + AWS CLI (for Lambda packaging/deploy via the provided script)

The Python arena and most scripts are intentionally stdlib-only (no requirements.txt).

Quickstart: run the arena (mock mode, no API keys)

This runs plumbing end-to-end with deterministic mock models (useful for sanity checks and inspecting outputs).

python3 arena/run.py \
  --challenges src/crypto-challenges/benchmark_challenges_diverse_v2.json \
  --out state/runs/mock_smoke \
  --participant-provider mock \
  --judge-provider mock \
  --repetitions 1 \
  --max-turns 12 \
  --progress

Outputs are written under state/runs/mock_smoke/ (see “Arena outputs” below).

Real evaluation run (Fireworks participant + OpenAI judge + cryptomath tool)

1) Set environment variables (recommended: repo-root `.env`)

The arena runner auto-loads .env from the repo root if present (arena/env.py).

Minimum env vars for “real” runs:

FIREWORKS_API_KEY
OPENAI_API_KEY
OPENAI_MODEL_JUDGE (model id sent to the OpenAI Responses API)
CRYPTOMATH_LAMBDA_URL (Lambda Function URL or API Gateway URL for the cryptomath tool)
FIREWORKS_MODEL_PARTICIPANT (Fireworks model id used for participants; wrapper scripts set this for you)

Optional knobs:

ARENA_SSL_VERIFY=0: disable SSL verification (avoid if possible)
ARENA_PRINT_HTTP_ERRORS=1: print structured HTTP error payloads
ARENA_JUDGE_MAX_CHARS_PER_MESSAGE, ARENA_JUDGE_MAX_CHARS_TOTAL: transcript truncation limits for judge input
OPENAI_BASE_URL: OpenAI Responses endpoint override (default: https://api.openai.com/v1/responses)
FIREWORKS_BASE_URL: Fireworks chat-completions endpoint override (default: https://api.fireworks.ai/inference/v1/chat/completions)

2) Deploy cryptomath (AWS Lambda)

The recommended path is to deploy cryptomath as a Lambda with a Function URL, then point the arena at it.

The deploy script expects AWS credentials and some Lambda configuration values in .env:

bash scripts/deploy_cryptomath_lambda.sh

Then contract-test it:

python3 scripts/cryptomath_lambda_contract_test.py

3) Run the arena

The easiest “batteries included” runner is:

python3 scripts/run_default_marco_base_arena.py --out-dir state/runs/default_base_run

This wrapper:

Creates a Fireworks deployment for throughput (parallel workers)
Runs arena/run.py with OpenAI judging and the cryptomath tool enabled
Tears down the deployment

If you prefer calling the arena directly (no deployment orchestration), run:

python3 arena/run.py \
  --challenges src/crypto-challenges/benchmark_challenges_diverse_v2.json \
  --challenge-split heldout_eval \
  --out state/runs/manual_real_run \
  --participant-provider fireworks \
  --judge-provider openai \
  --calculator-provider lambda \
  --calculator-lambda-url "$CRYPTOMATH_LAMBDA_URL" \
  --repetitions 2 \
  --workers 1 \
  --max-turns 15 \
  --max-tool-calls-per-turn 10 \
  --progress \
  --continue-on-error

Notes:

Use --resume to resume a partially completed run directory (skips completed matches; retries failed/error matches).
--challenge-split is only available for v2+ benchmark files (those with per-challenge split labels).

Comparing base vs tuned models

To run two arena evaluations (base vs tuned) and generate a text report:

python3 scripts/run_base_vs_tuned_arena_comparison.py \
  --tuned-model accounts/<acct>/models/<tuned_id> \
  --out-dir state/runs/base_vs_tuned_001 \
  --report-path state/reports/base_vs_tuned_report.txt

The report includes bootstrap CIs and permutation tests (implemented without numpy/scipy).

Arena outputs

Each arena run writes a structured directory:

manifest.json: run config + model ids (best-effort)
api_calls.jsonl: HTTP call logs (providers + tool), for audit/debug
matches.jsonl: one JSON record per match, including the judge verdict
summary.json: aggregate scores/outcomes across matches
matches_scores.csv, challenge_summary.csv: tabular exports
role_views/: per-role point-of-view transcripts (<RoleId>.jsonl)
matches/<challenge_id>/rep_<r>/: per-match bundle:
- match.json, transcript.json, role_views/<RoleId>.json

The cryptomath tool (calculator/)

calculator/ is a Rust crate that exposes cryptographic/math operations over a small JSON protocol.

CLI binary reads a JSON request from stdin:

cd calculator
echo '{"op":"ping","args":{}}' | cargo run --quiet --bin crypto_calculator_cli

Lambda binary is built with the lambda feature and is packaged as a custom runtime bootstrap by scripts/deploy_cryptomath_lambda.sh.

The arena talks to cryptomath over HTTP via arena/calculator.py (Lambda Function URL / API Gateway).

Dataset generation + SFT pipeline

The scripts/ folder contains an end-to-end pipeline for producing tool-grounded training data and fine-tuning models. The current checklist is documented in:

state/v3_dataset_pipeline.md

At a high level:

Generate batch requests (scripts/build_dataset.py)
Run OpenAI batch and download results (scripts/submit_openai_batch.py, scripts/openai_batch_status.py, scripts/openai_download_file.py)
Extract + postprocess tool calls by executing cryptomath and enforcing “artifact-real” constraints (scripts/postprocess_tool_transcripts.py)
Validate transcript structure (scripts/validate_tool_transcripts.py)
Sanitize + split into train/val for SFT (scripts/fireworks_sanitize_sft_dataset.py, scripts/finalize_dataset.py)
Run SFT on Fireworks (scripts/fireworks_sft.py) and track jobs (state/sft_jobs/)

Configuration defaults

config/marco_defaults.yaml is the single source of truth for “standard run” defaults used by wrapper scripts under scripts/. It’s parsed without PyYAML (see scripts/marco_defaults.py).

Reference

Paper PDF: paper_preprint.pdf
Licensing: the paper is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license, while the source code in this repository is released under the MIT license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protocol Agent

Repo layout

Requirements

Quickstart: run the arena (mock mode, no API keys)

Real evaluation run (Fireworks participant + OpenAI judge + cryptomath tool)

1) Set environment variables (recommended: repo-root `.env`)

2) Deploy cryptomath (AWS Lambda)

3) Run the arena

Comparing base vs tuned models

Arena outputs

The cryptomath tool (calculator/)

Dataset generation + SFT pipeline

Configuration defaults

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
arena		arena
calculator		calculator
config		config
scripts		scripts
src		src
state		state
.DS_Store		.DS_Store
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
paper_preprint.pdf		paper_preprint.pdf

License

agent0lab/protocol-agent

Folders and files

Latest commit

History

Repository files navigation

Protocol Agent

Repo layout

Requirements

Quickstart: run the arena (mock mode, no API keys)

Real evaluation run (Fireworks participant + OpenAI judge + cryptomath tool)

1) Set environment variables (recommended: repo-root .env)

2) Deploy cryptomath (AWS Lambda)

3) Run the arena

Comparing base vs tuned models

Arena outputs

The cryptomath tool (calculator/)

Dataset generation + SFT pipeline

Configuration defaults

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1) Set environment variables (recommended: repo-root `.env`)

Packages