Releases: OpenAdaptAI/openadapt-evals
v0.3.0
v0.3.0 (2026-02-06)
Chores
The synthetic_demos/ directory contained 154 generic template demos (e.g., "Open Notepad", "Navigate to example.com") that don't match actual WAA task IDs (UUIDs like 366de66e-cbae-4d72-b042-26390db2b145-WOS).
These were misleading - they suggested we had demo coverage when we didn't. Actual WAA tasks have specific instructions like "create draft.txt, type 'This is a draft.', save to Documents" which the generic demos don't cover.
Also removes stale index/embedding files that referenced the deleted demos.
Keeps demo_library/demos/ (16 example demos) as format reference.
Adds WAA literature review documenting: - No GPT-5.x results published on WAA yet - WAA-V2 exists (141 tasks, stricter eval) but has only 3 GitHub stars - Current SOTA: PC Agent-E at 36% on WAA-V2 - Cost estimates for running evaluations
Co-authored-by: Claude Opus 4.5 noreply@anthropic.com
Features
-
Add WAA demo recording script with guided workflow (
36ef2fe) -
Add record_waa_demos.py for easy demo recording on Windows
-
Rename demo_library/demos → synthetic_demos_legacy (to avoid confusion)
-
Update beads issues (close stale Azure ML issues, add demo generation task)
The script provides a fully guided experience:
- Shows task instructions and tips
- Handles recording with openadapt-capture
- Allows redo if mistakes are made
- Sends via Magic Wormhole for easy transfer
Usage on Windows:
python scripts/record_waa_demos.py
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
Detailed Changes: v0.2.0...v0.3.0
v0.2.0
v0.2.0 (2026-02-06)
Features
- docs: replace aspirational claims with honest placeholders
- Remove unvalidated badges (95%+ success rate, 67% cost savings) - Add "First open-source WAA reproduction" as headline - Move WAA to top as main feature with status indicator - Change "Recent Improvements" to "Roadmap (In Progress)" - Remove v0.2.0 version references (current is v0.1.1) - Add Azure quota requirements note for parallelization - Mark features as [IN PROGRESS] where appropriate
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- feat(azure): implement Azure ML parallelization for WAA evaluation
Complete the Azure ML parallelization implementation:
-
Agent config serialization (_serialize_agent_config): - Extracts provider, model, and API keys from agent - Passes OPENAI_API_KEY/ANTHROPIC_API_KEY via env vars - Supports OpenAI and Anthropic agents
-
Worker command building (_build_worker_command): - Uses vanilla WAA run.py with --worker_id and --num_workers - Matches Microsoft's official Azure deployment pattern - Task distribution handled by WAA internally
-
Result fetching (_fetch_worker_results, _parse_waa_results): - Downloads job outputs via Azure ML SDK - Parses WAA result.txt files (0.0 or 1.0 score) - Handles partial results for failed jobs
-
Job status tracking: - Added job_name field to WorkerState - Updated _wait_and_collect_results to poll job status - Fixed: was checking compute status instead of job status
-
Log fetching (get_job_logs in AzureMLClient): - Downloads logs via az ml job download - Supports tail parameter for last N lines - Updated health_checker to use new method
Uses vanilla windowsarena/winarena:latest with VERSION=11e.
- docs: fix inaccurate "first reproduction" claim
WAA is already open-source from Microsoft. Changed to accurate claim: "Simplified CLI toolkit for Windows Agent Arena"
Updated value proposition to reflect what we actually provide: - Azure VM setup and SSH tunnel management - Agent adapters for Claude/GPT/custom agents - Results viewer - Parallelization support
- docs: fix VM size to match code (D4s_v5 not D8ds_v5)
The code uses Standard_D4s_v5 (4 vCPUs) by default, not D8ds_v5. Updated all references to be accurate.
- feat(cli): add azure-setup command for easy Azure configuration
New command that: - Checks Azure CLI installation and login status - Creates resource group (default: openadapt-agents) - Creates ML workspace (default: openadapt-ml) - Writes config to .env file
Usage: uv run python -m openadapt_evals.benchmarks.cli azure-setup
Also improved azure command error message to guide users to run setup.
- feat(cli): add waa-image command for building custom Docker image
The vanilla windowsarena/winarena:latest image does NOT work for unattended WAA installation. This adds:
waa-image build- Build custom waa-auto image locally -waa-image push- Push to Docker Hub or ACR -waa-image build-push- Build and push in one command -waa-image check- Check if image exists in registry
Also updates azure.py to use openadaptai/waa-auto:latest as default image.
The custom Dockerfile (in waa_deploy/) includes: - Modern dockurr/windows base (auto-downloads Windows 11) - FirstLogonCommands patches for unattended installation - Python 3.9 with transformers 4.46.2 (navi agent compatibility) - api_agent.py for Claude/GPT support
- feat(cli): add AWS ECR Public support for waa-image command
- Add ECR as the default registry (ecr, dockerhub, acr options) - Auto-create ECR repository if it doesn't exist - Auto-login to ECR Public using AWS CLI - Update azure.py to use public.ecr.aws/g3w3k7s5/waa-auto:latest as default - Update docs with new default image
ECR Public is preferred because: - No Docker Hub login required - Uses existing AWS credentials - Public access for Azure ML to pull without cross-cloud auth
- fix(cli): add --platform linux/amd64 flag for Docker build
The windowsarena/winarena base image is only available for linux/amd64. This fixes builds on macOS (arm64) by explicitly specifying the target platform.
- feat(cli): add aws-costs command and waa-image delete action
-
Add
aws-costscommand to show AWS cost breakdown using Cost Explorer API - Shows current month costs (total and by service) - Shows historical monthly costs - Shows ECR storage costs specifically -
Add
waa-image deleteaction to clean up registry resources - ECR: Deletes repository with --force - Docker Hub: Shows manual instructions (free tier) - ACR: Deletes repository -
Change default registry from ECR to Docker Hub - Docker Hub is free (no storage charges) - Use ECR when rate limiting becomes an issue
- ci: add auto-release workflow
Automatically bumps version and creates tags on PR merge: - feat: minor version bump - fix/perf: patch version bump - docs/style/refactor/test/chore/ci/build: patch version bump
Triggers publish.yml which deploys to PyPI.
- fix(azure): use SDK V1 DockerConfiguration for WAA container execution
Root cause: Azure ML compute instances don't have Docker installed. Our code used SDK V2 command jobs which run in bare Python environment, never calling /entry_setup.sh to start QEMU/Windows.
Fix follows Microsoft's official WAA Azure pattern: - Add azureml-core dependency (SDK V1) - Use DockerConfiguration with NET_ADMIN capability for QEMU networking - Create run_entry.py that calls /entry_setup.sh before running client - Create compute-instance-startup.sh to stop conflicting services (DNS, nginx) - Use ScriptRunConfig instead of raw command jobs
- fix(cli): replace synthetic task IDs with real WAA UUID format
- Updated CLI help text and examples to use valid WAA task IDs - Fixed smoke-live default task ID (critical: was causing immediate failure) - Updated README examples with real notepad/chrome task IDs - Fixed azure.py comment about WAA task ID format - Fixed retrieval_agent.py docstring example
Real task IDs used from test_all.json: - notepad: 366de66e-cbae-4d72-b042-26390db2b145-WOS - chrome: 2ae9ba84-3a0d-4d4c-8338-3a1478dc5fe3-wos
- fix(cli): add domain prefix to WAA task IDs
WAA adapter creates task_ids as {domain}_{uuid}-WOS, not just {uuid}-WOS. Updated all examples to use correct format: notepad_366de66e... instead of just 366de66e....
- fix(azure): enable SSH and fix SSH info detection for Azure ML compute instances
- Add ssh_public_access_enabled=True when creating compute instances - Fix get_compute_ssh_info() to check network_settings.public_ip_address - Fix type check for compute instance type (lowercase comparison)
This enables VNC access to Azure ML compute instances for debugging WAA evaluation.
Co-authored-by: Claude Opus 4.5 noreply@anthropic.com
Detailed Changes: v0.1.2...v0.2.0
v0.1.2
v0.1.2 (2026-01-29)
Bug Fixes
- ci: Remove build_command from semantic-release config (
ed933f6)
The python-semantic-release action runs in a Docker container where uv is not available. Let the workflow handle building instead.
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
Continuous Integration
- Add auto-release workflow (
d221c19)
Automatically bumps version and creates tags on PR merge:
- feat: minor version bump
- fix/perf: patch version bump
- docs/style/refactor/test/chore/ci/build: patch version bump
Triggers publish.yml which deploys to PyPI.
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- Switch to python-semantic-release for automated versioning (
7ed3de2)
Replaces manual commit parsing with python-semantic-release:
- Automatic version bumping based on conventional commits
- feat: -> minor, fix:/perf: -> patch
- Creates GitHub releases automatically
- Publishes to PyPI on release
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
Detailed Changes: v0.1.1...v0.1.2
v0.1.1
What's Changed
- feat: P0 fixes - API parsing + evaluate endpoint by @abrichr in #1
- Add standardized README badges by @abrichr in #2
- fix: Use filename-based GitHub Actions badge URL by @abrichr in #3
- Fix deprecated import paths in test fixtures by @abrichr in #4
- Fix Azure ML compute instance cleanup to prevent quota exhaustion by @abrichr in #5
- feat: Add benchmark viewer screenshots and auto-screenshot tool (P1 features) by @abrichr in #6
- [P3] Azure cost optimization (Issue #9) by @abrichr in #13
- [P1] Fix Azure nested virtualization (Issue #8) by @abrichr in #11
- Visual Demos: Cost Tracking, Health Checker, and Animated Viewer by @abrichr in #14
- docs: Add comprehensive screenshot generation documentation by @abrichr in #15
- docs: reorganize markdown files into docs subdirectories by @abrichr in #18
- feat(screenshots): add simple screenshot validation by @abrichr in #19
- feat(dashboard): add Azure monitoring dashboard with real-time costs by @abrichr in #20
- feat(wandb): add Weights & Biases integration with fixtures and reports by @abrichr in #21
- feat: consolidate benchmark infrastructure and add results section by @abrichr in #22
New Contributors
Full Changelog: v0.1.0...v0.1.1
v0.1.0
Full Changelog: https://github.com/OpenAdaptAI/openadapt-evals/commits/v0.1.0