Skip to content

Releases: OpenAdaptAI/openadapt-evals

v0.3.0

06 Feb 22:03

Choose a tag to compare

v0.3.0 (2026-02-06)

Chores

  • Remove synthetic demos that don't match WAA tasks (#26, 272edcb)

The synthetic_demos/ directory contained 154 generic template demos (e.g., "Open Notepad", "Navigate to example.com") that don't match actual WAA task IDs (UUIDs like 366de66e-cbae-4d72-b042-26390db2b145-WOS).

These were misleading - they suggested we had demo coverage when we didn't. Actual WAA tasks have specific instructions like "create draft.txt, type 'This is a draft.', save to Documents" which the generic demos don't cover.

Also removes stale index/embedding files that referenced the deleted demos.

Keeps demo_library/demos/ (16 example demos) as format reference.

Adds WAA literature review documenting: - No GPT-5.x results published on WAA yet - WAA-V2 exists (141 tasks, stricter eval) but has only 3 GitHub stars - Current SOTA: PC Agent-E at 36% on WAA-V2 - Cost estimates for running evaluations

Co-authored-by: Claude Opus 4.5 noreply@anthropic.com

Features

  • Add WAA demo recording script with guided workflow (36ef2fe)

  • Add record_waa_demos.py for easy demo recording on Windows

  • Rename demo_library/demos → synthetic_demos_legacy (to avoid confusion)

  • Update beads issues (close stale Azure ML issues, add demo generation task)

The script provides a fully guided experience:

  • Shows task instructions and tips
  • Handles recording with openadapt-capture
  • Allows redo if mistakes are made
  • Sends via Magic Wormhole for easy transfer

Usage on Windows:
python scripts/record_waa_demos.py

Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com


Detailed Changes: v0.2.0...v0.3.0

v0.2.0

06 Feb 19:45

Choose a tag to compare

v0.2.0 (2026-02-06)

Features

  • azure: Implement Azure ML parallelization for WAA evaluation (#24, 077f339)
  • docs: replace aspirational claims with honest placeholders
  • Remove unvalidated badges (95%+ success rate, 67% cost savings) - Add "First open-source WAA reproduction" as headline - Move WAA to top as main feature with status indicator - Change "Recent Improvements" to "Roadmap (In Progress)" - Remove v0.2.0 version references (current is v0.1.1) - Add Azure quota requirements note for parallelization - Mark features as [IN PROGRESS] where appropriate

Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com

  • feat(azure): implement Azure ML parallelization for WAA evaluation

Complete the Azure ML parallelization implementation:

  1. Agent config serialization (_serialize_agent_config): - Extracts provider, model, and API keys from agent - Passes OPENAI_API_KEY/ANTHROPIC_API_KEY via env vars - Supports OpenAI and Anthropic agents

  2. Worker command building (_build_worker_command): - Uses vanilla WAA run.py with --worker_id and --num_workers - Matches Microsoft's official Azure deployment pattern - Task distribution handled by WAA internally

  3. Result fetching (_fetch_worker_results, _parse_waa_results): - Downloads job outputs via Azure ML SDK - Parses WAA result.txt files (0.0 or 1.0 score) - Handles partial results for failed jobs

  4. Job status tracking: - Added job_name field to WorkerState - Updated _wait_and_collect_results to poll job status - Fixed: was checking compute status instead of job status

  5. Log fetching (get_job_logs in AzureMLClient): - Downloads logs via az ml job download - Supports tail parameter for last N lines - Updated health_checker to use new method

Uses vanilla windowsarena/winarena:latest with VERSION=11e.

  • docs: fix inaccurate "first reproduction" claim

WAA is already open-source from Microsoft. Changed to accurate claim: "Simplified CLI toolkit for Windows Agent Arena"

Updated value proposition to reflect what we actually provide: - Azure VM setup and SSH tunnel management - Agent adapters for Claude/GPT/custom agents - Results viewer - Parallelization support

  • docs: fix VM size to match code (D4s_v5 not D8ds_v5)

The code uses Standard_D4s_v5 (4 vCPUs) by default, not D8ds_v5. Updated all references to be accurate.

  • feat(cli): add azure-setup command for easy Azure configuration

New command that: - Checks Azure CLI installation and login status - Creates resource group (default: openadapt-agents) - Creates ML workspace (default: openadapt-ml) - Writes config to .env file

Usage: uv run python -m openadapt_evals.benchmarks.cli azure-setup

Also improved azure command error message to guide users to run setup.

  • feat(cli): add waa-image command for building custom Docker image

The vanilla windowsarena/winarena:latest image does NOT work for unattended WAA installation. This adds:

  • waa-image build - Build custom waa-auto image locally - waa-image push - Push to Docker Hub or ACR - waa-image build-push - Build and push in one command - waa-image check - Check if image exists in registry

Also updates azure.py to use openadaptai/waa-auto:latest as default image.

The custom Dockerfile (in waa_deploy/) includes: - Modern dockurr/windows base (auto-downloads Windows 11) - FirstLogonCommands patches for unattended installation - Python 3.9 with transformers 4.46.2 (navi agent compatibility) - api_agent.py for Claude/GPT support

  • feat(cli): add AWS ECR Public support for waa-image command
  • Add ECR as the default registry (ecr, dockerhub, acr options) - Auto-create ECR repository if it doesn't exist - Auto-login to ECR Public using AWS CLI - Update azure.py to use public.ecr.aws/g3w3k7s5/waa-auto:latest as default - Update docs with new default image

ECR Public is preferred because: - No Docker Hub login required - Uses existing AWS credentials - Public access for Azure ML to pull without cross-cloud auth

  • fix(cli): add --platform linux/amd64 flag for Docker build

The windowsarena/winarena base image is only available for linux/amd64. This fixes builds on macOS (arm64) by explicitly specifying the target platform.

  • feat(cli): add aws-costs command and waa-image delete action
  • Add aws-costs command to show AWS cost breakdown using Cost Explorer API - Shows current month costs (total and by service) - Shows historical monthly costs - Shows ECR storage costs specifically

  • Add waa-image delete action to clean up registry resources - ECR: Deletes repository with --force - Docker Hub: Shows manual instructions (free tier) - ACR: Deletes repository

  • Change default registry from ECR to Docker Hub - Docker Hub is free (no storage charges) - Use ECR when rate limiting becomes an issue

  • ci: add auto-release workflow

Automatically bumps version and creates tags on PR merge: - feat: minor version bump - fix/perf: patch version bump - docs/style/refactor/test/chore/ci/build: patch version bump

Triggers publish.yml which deploys to PyPI.

  • fix(azure): use SDK V1 DockerConfiguration for WAA container execution

Root cause: Azure ML compute instances don't have Docker installed. Our code used SDK V2 command jobs which run in bare Python environment, never calling /entry_setup.sh to start QEMU/Windows.

Fix follows Microsoft's official WAA Azure pattern: - Add azureml-core dependency (SDK V1) - Use DockerConfiguration with NET_ADMIN capability for QEMU networking - Create run_entry.py that calls /entry_setup.sh before running client - Create compute-instance-startup.sh to stop conflicting services (DNS, nginx) - Use ScriptRunConfig instead of raw command jobs

  • fix(cli): replace synthetic task IDs with real WAA UUID format
  • Updated CLI help text and examples to use valid WAA task IDs - Fixed smoke-live default task ID (critical: was causing immediate failure) - Updated README examples with real notepad/chrome task IDs - Fixed azure.py comment about WAA task ID format - Fixed retrieval_agent.py docstring example

Real task IDs used from test_all.json: - notepad: 366de66e-cbae-4d72-b042-26390db2b145-WOS - chrome: 2ae9ba84-3a0d-4d4c-8338-3a1478dc5fe3-wos

  • fix(cli): add domain prefix to WAA task IDs

WAA adapter creates task_ids as {domain}_{uuid}-WOS, not just {uuid}-WOS. Updated all examples to use correct format: notepad_366de66e... instead of just 366de66e....

  • fix(azure): enable SSH and fix SSH info detection for Azure ML compute instances
  • Add ssh_public_access_enabled=True when creating compute instances - Fix get_compute_ssh_info() to check network_settings.public_ip_address - Fix type check for compute instance type (lowercase comparison)

This enables VNC access to Azure ML compute instances for debugging WAA evaluation.


Co-authored-by: Claude Opus 4.5 noreply@anthropic.com


Detailed Changes: v0.1.2...v0.2.0

v0.1.2

29 Jan 23:02

Choose a tag to compare

v0.1.2 (2026-01-29)

Bug Fixes

  • ci: Remove build_command from semantic-release config (ed933f6)

The python-semantic-release action runs in a Docker container where uv is not available. Let the workflow handle building instead.

Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com

Continuous Integration

  • Add auto-release workflow (d221c19)

Automatically bumps version and creates tags on PR merge:

  • feat: minor version bump
  • fix/perf: patch version bump
  • docs/style/refactor/test/chore/ci/build: patch version bump

Triggers publish.yml which deploys to PyPI.

Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com

  • Switch to python-semantic-release for automated versioning (7ed3de2)

Replaces manual commit parsing with python-semantic-release:

  • Automatic version bumping based on conventional commits
  • feat: -> minor, fix:/perf: -> patch
  • Creates GitHub releases automatically
  • Publishes to PyPI on release

Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com


Detailed Changes: v0.1.1...v0.1.2

v0.1.1

29 Jan 05:43

Choose a tag to compare

What's Changed

  • feat: P0 fixes - API parsing + evaluate endpoint by @abrichr in #1
  • Add standardized README badges by @abrichr in #2
  • fix: Use filename-based GitHub Actions badge URL by @abrichr in #3
  • Fix deprecated import paths in test fixtures by @abrichr in #4
  • Fix Azure ML compute instance cleanup to prevent quota exhaustion by @abrichr in #5
  • feat: Add benchmark viewer screenshots and auto-screenshot tool (P1 features) by @abrichr in #6
  • [P3] Azure cost optimization (Issue #9) by @abrichr in #13
  • [P1] Fix Azure nested virtualization (Issue #8) by @abrichr in #11
  • Visual Demos: Cost Tracking, Health Checker, and Animated Viewer by @abrichr in #14
  • docs: Add comprehensive screenshot generation documentation by @abrichr in #15
  • docs: reorganize markdown files into docs subdirectories by @abrichr in #18
  • feat(screenshots): add simple screenshot validation by @abrichr in #19
  • feat(dashboard): add Azure monitoring dashboard with real-time costs by @abrichr in #20
  • feat(wandb): add Weights & Biases integration with fixtures and reports by @abrichr in #21
  • feat: consolidate benchmark infrastructure and add results section by @abrichr in #22

New Contributors

Full Changelog: v0.1.0...v0.1.1

v0.1.0

17 Jan 00:04

Choose a tag to compare