This repository accompanies the paper “Protocol Agent: What If Agents Could Use Cryptography In Everyday Life?” (see paper_preprint.pdf).
arena/: benchmark runner (self-play), providers, scoring, and tool-call protocolarena/run.py: main CLI entrypoint for running the arenaarena/runner.py: core match loop + output writingarena/schema.py: JSON schema loader for the benchmark challenges filearena/tool_protocol.py: minimal JSON “tool call” protocol used in transcripts
src/crypto-challenges/: the benchmark challenge setbenchmark_challenges_diverse_v2.json: main benchmark JSON (v2+ includessplit)
src/crypto-knowledge/: cryptography reference notes used for data generation / groundingcalculator/: Rust “cryptomath” tool (CLI + AWS Lambda handler)scripts/: dataset generation/validation, SFT helpers, and evaluation wrappersconfig/: defaults for “standard” evaluation runs (config/marco_defaults.yaml)state/: example outputs and pipeline artifacts (datasets, jobs, reports)
- Python 3.10+ (some scripts use
X | Yunion types) - For the cryptomath tool:
- Rust toolchain (for local builds), and/or
- Docker + AWS CLI (for Lambda packaging/deploy via the provided script)
The Python arena and most scripts are intentionally stdlib-only (no requirements.txt).
This runs plumbing end-to-end with deterministic mock models (useful for sanity checks and inspecting outputs).
python3 arena/run.py \
--challenges src/crypto-challenges/benchmark_challenges_diverse_v2.json \
--out state/runs/mock_smoke \
--participant-provider mock \
--judge-provider mock \
--repetitions 1 \
--max-turns 12 \
--progressOutputs are written under state/runs/mock_smoke/ (see “Arena outputs” below).
The arena runner auto-loads .env from the repo root if present (arena/env.py).
Minimum env vars for “real” runs:
FIREWORKS_API_KEYOPENAI_API_KEYOPENAI_MODEL_JUDGE(model id sent to the OpenAI Responses API)CRYPTOMATH_LAMBDA_URL(Lambda Function URL or API Gateway URL for the cryptomath tool)FIREWORKS_MODEL_PARTICIPANT(Fireworks model id used for participants; wrapper scripts set this for you)
Optional knobs:
ARENA_SSL_VERIFY=0: disable SSL verification (avoid if possible)ARENA_PRINT_HTTP_ERRORS=1: print structured HTTP error payloadsARENA_JUDGE_MAX_CHARS_PER_MESSAGE,ARENA_JUDGE_MAX_CHARS_TOTAL: transcript truncation limits for judge inputOPENAI_BASE_URL: OpenAI Responses endpoint override (default:https://api.openai.com/v1/responses)FIREWORKS_BASE_URL: Fireworks chat-completions endpoint override (default:https://api.fireworks.ai/inference/v1/chat/completions)
The recommended path is to deploy cryptomath as a Lambda with a Function URL, then point the arena at it.
The deploy script expects AWS credentials and some Lambda configuration values in .env:
bash scripts/deploy_cryptomath_lambda.shThen contract-test it:
python3 scripts/cryptomath_lambda_contract_test.pyThe easiest “batteries included” runner is:
python3 scripts/run_default_marco_base_arena.py --out-dir state/runs/default_base_runThis wrapper:
- Creates a Fireworks deployment for throughput (parallel workers)
- Runs
arena/run.pywith OpenAI judging and the cryptomath tool enabled - Tears down the deployment
If you prefer calling the arena directly (no deployment orchestration), run:
python3 arena/run.py \
--challenges src/crypto-challenges/benchmark_challenges_diverse_v2.json \
--challenge-split heldout_eval \
--out state/runs/manual_real_run \
--participant-provider fireworks \
--judge-provider openai \
--calculator-provider lambda \
--calculator-lambda-url "$CRYPTOMATH_LAMBDA_URL" \
--repetitions 2 \
--workers 1 \
--max-turns 15 \
--max-tool-calls-per-turn 10 \
--progress \
--continue-on-errorNotes:
- Use
--resumeto resume a partially completed run directory (skips completed matches; retries failed/error matches). --challenge-splitis only available for v2+ benchmark files (those with per-challengesplitlabels).
To run two arena evaluations (base vs tuned) and generate a text report:
python3 scripts/run_base_vs_tuned_arena_comparison.py \
--tuned-model accounts/<acct>/models/<tuned_id> \
--out-dir state/runs/base_vs_tuned_001 \
--report-path state/reports/base_vs_tuned_report.txtThe report includes bootstrap CIs and permutation tests (implemented without numpy/scipy).
Each arena run writes a structured directory:
manifest.json: run config + model ids (best-effort)api_calls.jsonl: HTTP call logs (providers + tool), for audit/debugmatches.jsonl: one JSON record per match, including the judge verdictsummary.json: aggregate scores/outcomes across matchesmatches_scores.csv,challenge_summary.csv: tabular exportsrole_views/: per-role point-of-view transcripts (<RoleId>.jsonl)matches/<challenge_id>/rep_<r>/: per-match bundle:match.json,transcript.json,role_views/<RoleId>.json
calculator/ is a Rust crate that exposes cryptographic/math operations over a small JSON protocol.
- CLI binary reads a JSON request from stdin:
cd calculator
echo '{"op":"ping","args":{}}' | cargo run --quiet --bin crypto_calculator_cli- Lambda binary is built with the
lambdafeature and is packaged as a custom runtimebootstrapbyscripts/deploy_cryptomath_lambda.sh.
The arena talks to cryptomath over HTTP via arena/calculator.py (Lambda Function URL / API Gateway).
The scripts/ folder contains an end-to-end pipeline for producing tool-grounded training data and
fine-tuning models. The current checklist is documented in:
state/v3_dataset_pipeline.md
At a high level:
- Generate batch requests (
scripts/build_dataset.py) - Run OpenAI batch and download results (
scripts/submit_openai_batch.py,scripts/openai_batch_status.py,scripts/openai_download_file.py) - Extract + postprocess tool calls by executing cryptomath and enforcing “artifact-real” constraints (
scripts/postprocess_tool_transcripts.py) - Validate transcript structure (
scripts/validate_tool_transcripts.py) - Sanitize + split into train/val for SFT (
scripts/fireworks_sanitize_sft_dataset.py,scripts/finalize_dataset.py) - Run SFT on Fireworks (
scripts/fireworks_sft.py) and track jobs (state/sft_jobs/)
config/marco_defaults.yaml is the single source of truth for “standard run” defaults used by wrapper
scripts under scripts/. It’s parsed without PyYAML (see scripts/marco_defaults.py).
- Paper PDF:
paper_preprint.pdf - Licensing: the paper is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license, while the source code in this repository is released under the MIT license.