Skip to content

refactor(cli): minimal WAA CLI with vanilla image support#14

Merged
abrichr merged 1 commit intomainfrom
feature/vanilla-waa-cli-clean
Jan 27, 2026
Merged

refactor(cli): minimal WAA CLI with vanilla image support#14
abrichr merged 1 commit intomainfrom
feature/vanilla-waa-cli-clean

Conversation

@abrichr
Copy link
Member

@abrichr abrichr commented Jan 26, 2026

Summary

Complete CLI refactor for WAA benchmark automation. Replaces the 6800-line CLI with a minimal 1300-line implementation that uses the vanilla Microsoft WAA image.

Key Changes

CLI Refactor (-5500 lines)

  • Replaced complex vm subcommand structure with flat commands: create, run, probe, analyze, etc.
  • Removed unused code paths and simplified command routing
  • Added analyze command to parse and summarize benchmark results
  • Added --num-tasks to limit the number of tasks to run

Vanilla WAA Image Support

  • Uses official windowsarena/winarena:latest Docker image
  • Custom Dockerfile copies Python 3.9 from vanilla image (fixes transformers compatibility)
  • IP patches for dockurr/windows compatibility (172.30.0.2)

Python 3.9 Compatibility Fix

  • GroundingDINO requires transformers 4.46.2 (not 5.x)
  • Fixed by copying Python 3.9 and all packages from vanilla WAA image
  • This resolves: AttributeError: 'BertModel' has no attribute 'get_head_mask'

Results Analysis

  • New analyze command parses downloaded benchmark logs
  • Shows success rate by domain
  • Handles ANSI color codes in logs

Commands

# Create VM with Docker and WAA image
uv run python -m openadapt_ml.benchmarks.cli create

# Run benchmark (auto-downloads results)
uv run python -m openadapt_ml.benchmarks.cli run --num-tasks 5 --model gpt-4o

# Check WAA server status
uv run python -m openadapt_ml.benchmarks.cli probe --wait

# Analyze results
uv run python -m openadapt_ml.benchmarks.cli analyze

# View logs
uv run python -m openadapt_ml.benchmarks.cli logs --lines 100

# Clean up
uv run python -m openadapt_ml.benchmarks.cli deallocate -y

Files Changed

  • openadapt_ml/benchmarks/cli.py - Complete refactor (6800 → 1300 lines)
  • openadapt_ml/benchmarks/waa_deploy/Dockerfile - Python 3.9 compatibility
  • docs/CLI_V2_DESIGN.md - Design documentation
  • .gitignore - Coverage and analysis artifacts

Test Plan

  • probe - Correctly detects WAA server status
  • run --num-tasks 2 - Limits tasks correctly
  • analyze - Parses benchmark logs and shows results by domain
  • logs - Shows container logs
  • navi agent runs successfully (Python 3.9 fix working)

🤖 Generated with Claude Code

@abrichr abrichr force-pushed the feature/vanilla-waa-cli-clean branch from fcee717 to 47a4d85 Compare January 26, 2026 20:39
@abrichr abrichr changed the title fix: update default emulator IP to 20.20.20.21 for official WAA fix: CLI improvements for vanilla WAA automation Jan 26, 2026
@abrichr abrichr changed the title fix: CLI improvements for vanilla WAA automation refactor(cli): minimal WAA CLI with vanilla image support Jan 27, 2026
- Refactor CLI from 6800 to ~1300 lines with flat command structure
- Add analyze command to parse and summarize benchmark results
- Add --num-tasks flag to limit number of tasks to run
- Fix Python 3.9 compatibility by copying Python from vanilla WAA image
  (fixes transformers 4.46.2 compatibility with GroundingDINO)
- Add coverage and analysis artifacts to .gitignore

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@abrichr abrichr force-pushed the feature/vanilla-waa-cli-clean branch 2 times, most recently from 5c51626 to 070225b Compare January 27, 2026 22:07
@abrichr abrichr merged commit 5557130 into main Jan 27, 2026
0 of 8 checks passed
abrichr added a commit that referenced this pull request Feb 5, 2026
- Fix broken build badge (publish.yml → release.yml)
- Add prominent "Parallel WAA Benchmark Evaluation" section near top
- Add detailed "WAA Benchmark Workflow" section (#14) with:
  - Single VM and parallel pool workflows
  - VNC access instructions
  - Architecture diagram
  - Cost estimates
- Update section numbering (Limitations → 15, Roadmap → 16)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
abrichr added a commit that referenced this pull request Feb 5, 2026
* docs(readme): add parallel WAA evaluation section, fix build badge

- Fix broken build badge (publish.yml → release.yml)
- Add prominent "Parallel WAA Benchmark Evaluation" section near top
- Add detailed "WAA Benchmark Workflow" section (#14) with:
  - Single VM and parallel pool workflows
  - VNC access instructions
  - Architecture diagram
  - Cost estimates
- Update section numbering (Limitations → 15, Roadmap → 16)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(readme): address self-review feedback

- Fix anchor placement (move before heading for proper navigation)
- Correct pool-delete → pool-cleanup (actual command name)
- Add pool-status example for getting worker IPs
- Add "prices vary by region" caveat

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant