Skip to content

agent0lab/protocol-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Protocol Agent

This repository accompanies the paper “Protocol Agent: What If Agents Could Use Cryptography In Everyday Life?” (see paper_preprint.pdf).


Repo layout

  • arena/: benchmark runner (self-play), providers, scoring, and tool-call protocol
    • arena/run.py: main CLI entrypoint for running the arena
    • arena/runner.py: core match loop + output writing
    • arena/schema.py: JSON schema loader for the benchmark challenges file
    • arena/tool_protocol.py: minimal JSON “tool call” protocol used in transcripts
  • src/crypto-challenges/: the benchmark challenge set
    • benchmark_challenges_diverse_v2.json: main benchmark JSON (v2+ includes split)
  • src/crypto-knowledge/: cryptography reference notes used for data generation / grounding
  • calculator/: Rust “cryptomath” tool (CLI + AWS Lambda handler)
  • scripts/: dataset generation/validation, SFT helpers, and evaluation wrappers
  • config/: defaults for “standard” evaluation runs (config/marco_defaults.yaml)
  • state/: example outputs and pipeline artifacts (datasets, jobs, reports)

Requirements

  • Python 3.10+ (some scripts use X | Y union types)
  • For the cryptomath tool:
    • Rust toolchain (for local builds), and/or
    • Docker + AWS CLI (for Lambda packaging/deploy via the provided script)

The Python arena and most scripts are intentionally stdlib-only (no requirements.txt).


Quickstart: run the arena (mock mode, no API keys)

This runs plumbing end-to-end with deterministic mock models (useful for sanity checks and inspecting outputs).

python3 arena/run.py \
  --challenges src/crypto-challenges/benchmark_challenges_diverse_v2.json \
  --out state/runs/mock_smoke \
  --participant-provider mock \
  --judge-provider mock \
  --repetitions 1 \
  --max-turns 12 \
  --progress

Outputs are written under state/runs/mock_smoke/ (see “Arena outputs” below).


Real evaluation run (Fireworks participant + OpenAI judge + cryptomath tool)

1) Set environment variables (recommended: repo-root .env)

The arena runner auto-loads .env from the repo root if present (arena/env.py).

Minimum env vars for “real” runs:

  • FIREWORKS_API_KEY
  • OPENAI_API_KEY
  • OPENAI_MODEL_JUDGE (model id sent to the OpenAI Responses API)
  • CRYPTOMATH_LAMBDA_URL (Lambda Function URL or API Gateway URL for the cryptomath tool)
  • FIREWORKS_MODEL_PARTICIPANT (Fireworks model id used for participants; wrapper scripts set this for you)

Optional knobs:

  • ARENA_SSL_VERIFY=0: disable SSL verification (avoid if possible)
  • ARENA_PRINT_HTTP_ERRORS=1: print structured HTTP error payloads
  • ARENA_JUDGE_MAX_CHARS_PER_MESSAGE, ARENA_JUDGE_MAX_CHARS_TOTAL: transcript truncation limits for judge input
  • OPENAI_BASE_URL: OpenAI Responses endpoint override (default: https://api.openai.com/v1/responses)
  • FIREWORKS_BASE_URL: Fireworks chat-completions endpoint override (default: https://api.fireworks.ai/inference/v1/chat/completions)

2) Deploy cryptomath (AWS Lambda)

The recommended path is to deploy cryptomath as a Lambda with a Function URL, then point the arena at it.

The deploy script expects AWS credentials and some Lambda configuration values in .env:

bash scripts/deploy_cryptomath_lambda.sh

Then contract-test it:

python3 scripts/cryptomath_lambda_contract_test.py

3) Run the arena

The easiest “batteries included” runner is:

python3 scripts/run_default_marco_base_arena.py --out-dir state/runs/default_base_run

This wrapper:

  • Creates a Fireworks deployment for throughput (parallel workers)
  • Runs arena/run.py with OpenAI judging and the cryptomath tool enabled
  • Tears down the deployment

If you prefer calling the arena directly (no deployment orchestration), run:

python3 arena/run.py \
  --challenges src/crypto-challenges/benchmark_challenges_diverse_v2.json \
  --challenge-split heldout_eval \
  --out state/runs/manual_real_run \
  --participant-provider fireworks \
  --judge-provider openai \
  --calculator-provider lambda \
  --calculator-lambda-url "$CRYPTOMATH_LAMBDA_URL" \
  --repetitions 2 \
  --workers 1 \
  --max-turns 15 \
  --max-tool-calls-per-turn 10 \
  --progress \
  --continue-on-error

Notes:

  • Use --resume to resume a partially completed run directory (skips completed matches; retries failed/error matches).
  • --challenge-split is only available for v2+ benchmark files (those with per-challenge split labels).

Comparing base vs tuned models

To run two arena evaluations (base vs tuned) and generate a text report:

python3 scripts/run_base_vs_tuned_arena_comparison.py \
  --tuned-model accounts/<acct>/models/<tuned_id> \
  --out-dir state/runs/base_vs_tuned_001 \
  --report-path state/reports/base_vs_tuned_report.txt

The report includes bootstrap CIs and permutation tests (implemented without numpy/scipy).


Arena outputs

Each arena run writes a structured directory:

  • manifest.json: run config + model ids (best-effort)
  • api_calls.jsonl: HTTP call logs (providers + tool), for audit/debug
  • matches.jsonl: one JSON record per match, including the judge verdict
  • summary.json: aggregate scores/outcomes across matches
  • matches_scores.csv, challenge_summary.csv: tabular exports
  • role_views/: per-role point-of-view transcripts (<RoleId>.jsonl)
  • matches/<challenge_id>/rep_<r>/: per-match bundle:
    • match.json, transcript.json, role_views/<RoleId>.json

The cryptomath tool (calculator/)

calculator/ is a Rust crate that exposes cryptographic/math operations over a small JSON protocol.

  • CLI binary reads a JSON request from stdin:
cd calculator
echo '{"op":"ping","args":{}}' | cargo run --quiet --bin crypto_calculator_cli
  • Lambda binary is built with the lambda feature and is packaged as a custom runtime bootstrap by scripts/deploy_cryptomath_lambda.sh.

The arena talks to cryptomath over HTTP via arena/calculator.py (Lambda Function URL / API Gateway).


Dataset generation + SFT pipeline

The scripts/ folder contains an end-to-end pipeline for producing tool-grounded training data and fine-tuning models. The current checklist is documented in:

  • state/v3_dataset_pipeline.md

At a high level:

  • Generate batch requests (scripts/build_dataset.py)
  • Run OpenAI batch and download results (scripts/submit_openai_batch.py, scripts/openai_batch_status.py, scripts/openai_download_file.py)
  • Extract + postprocess tool calls by executing cryptomath and enforcing “artifact-real” constraints (scripts/postprocess_tool_transcripts.py)
  • Validate transcript structure (scripts/validate_tool_transcripts.py)
  • Sanitize + split into train/val for SFT (scripts/fireworks_sanitize_sft_dataset.py, scripts/finalize_dataset.py)
  • Run SFT on Fireworks (scripts/fireworks_sft.py) and track jobs (state/sft_jobs/)

Configuration defaults

config/marco_defaults.yaml is the single source of truth for “standard run” defaults used by wrapper scripts under scripts/. It’s parsed without PyYAML (see scripts/marco_defaults.py).


Reference

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors