Add nightly agent CLI integration tests by jwiegley · Pull Request #602 · git-ai-project/git-ai

jwiegley · 2026-02-26T23:32:46Z

Summary

Adds .github/workflows/nightly-agent-integration.yml — a two-tier nightly workflow that installs real agent CLI binaries and verifies git-ai hook wiring and attribution end-to-end
Adds scripts/nightly/ with four helper scripts implementing the test logic
Adds NIGHTLY_INTEGRATION_PLAN.md documenting the full design rationale and open questions

Test Architecture

Tier 1 — Hook Wiring (no API keys, free)

Builds git-ai from source, installs each agent CLI (Claude Code, Codex, Gemini, Droid, OpenCode) at both stable and latest versions via a dynamic matrix, then:

Runs git-ai install and verifies the correct checkpoint commands appear in each agent's config file
Exercises the full attribution pipeline with synthetic checkpoint data (via the agent-v1 preset)

Tier 2 — Live Integration (requires API key secrets)

Runs each agent with a minimal deterministic prompt ("create hello.txt, commit it"), then verifies the file was created, a commit landed, and authorship notes are present in refs/notes/ai. Pre-release failures are non-blocking (continue-on-error: true).

Hook config paths (verified against `src/mdm/agents/*.rs`)

Agent	Config file
Claude Code	`~/.claude/settings.json`
Codex	`~/.codex/config.toml`
Gemini CLI	`~/.gemini/settings.json`
Droid	`~/.factory/settings.json`
OpenCode	`~/.config/opencode/plugin/git-ai.ts`

Secrets required (Tier 2 only)

ANTHROPIC_API_KEY, OPENAI_API_KEY, GEMINI_API_KEY, FACTORY_API_KEY, SLACK_BOT_TOKEN, SLACK_CHANNEL_ID

Tier 1 runs without any secrets.

Cost estimate

~$0.05–0.25/night (weekdays only). See NIGHTLY_INTEGRATION_PLAN.md §6 for cost management strategies.

Test plan

Verify workflow YAML parses correctly in Actions UI
Trigger workflow_dispatch with tier: tier1 to validate hook-wiring jobs (no API keys needed)
Add ANTHROPIC_API_KEY secret and trigger tier: both to validate Claude Code Tier 2 end-to-end
Review open questions in NIGHTLY_INTEGRATION_PLAN.md §13 before enabling the nightly schedule

🤖 Generated with Claude Code

Implements a two-tier nightly GitHub Actions workflow that verifies git-ai hooks fire correctly with real agent CLI binaries (Claude Code, Codex, Gemini CLI, Droid, OpenCode) on both stable and latest releases. Tier 1 (no API keys): Installs each agent CLI, runs `git-ai install`, verifies hook config files contain the correct checkpoint commands, then exercises the full attribution pipeline with synthetic checkpoint data via the agent-v1 preset. Tier 2 (live, requires API keys): Runs each agent with a deterministic prompt in a test repo and verifies authorship notes and blame output. New files: - .github/workflows/nightly-agent-integration.yml - scripts/nightly/verify-hook-wiring.sh - scripts/nightly/test-synthetic-checkpoint.sh - scripts/nightly/test-live-agent.sh - scripts/nightly/verify-attribution.sh Hook config paths verified against src/mdm/agents/*.rs: - claude: ~/.claude/settings.json - codex: ~/.codex/config.toml - gemini: ~/.gemini/settings.json - droid: ~/.factory/settings.json - opencode: ~/.config/opencode/plugin/git-ai.ts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

git-ai-cloud-dev · 2026-02-26T23:32:51Z

Stats powered by Git AI

🧠 you    ████░░░░░░░░░░░░░░░░  22%
🤖 ai     ░░░░████████████████  78%

More stats

0.0 lines generated for every 1 accepted
0 seconds waiting for AI
Top model: claude::claude-sonnet-4-6 (263 accepted lines, 0 generated lines)

AI code tracked with git-ai

git-ai-cloud · 2026-02-26T23:34:19Z

Stats powered by Git AI

🧠 you    ████░░░░░░░░░░░░░░░░  22%
🤖 ai     ░░░░████████████████  78%

More stats

0.9 lines generated for every 1 accepted
4 minutes waiting for AI
Top model: claude::claude-sonnet-4-6 (263 accepted lines, 238 generated lines)

AI code tracked with git-ai

Neither file belongs in the repo: .mcp.json is local tooling config and the plan document was a design scratch pad, not a deliverable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

1. scripts/nightly/test-synthetic-checkpoint.sh: Fix transcript message schema in the synthetic checkpoint JSON payload. The Rust Message enum uses `#[serde(tag = "type", rename_all = "snake_case")]`, so messages require `"type"` and `"text"` fields — not `"role"` and `"content"`. The old schema caused deserialization to fail for every Tier 1 run. 2. .github/workflows/nightly-agent-integration.yml: Fix notify-on-failure condition. With `if: failure()`, GitHub Actions skips the job entirely when tier2-live-integration is skipped (e.g. when running tier1-only), silently swallowing Tier 1 failures. Replace with an explicit always() guard that checks each dependency's result individually. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add a pull_request `labeled` event trigger so the full nightly suite runs whenever someone applies the 'Integration' label to any PR — in addition to the existing nightly schedule and workflow_dispatch paths. The gate condition on the resolve-versions job ensures the downstream matrix jobs only run for the correct trigger, not for every label event. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The label is 'integration', not 'Integration'. GitHub label names are case-sensitive in Actions expressions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the placeholder hello.txt smoke test with real end-to-end tests that verify git-ai's entire attribution pipeline: test-live-agent.sh: - Seeds the test repo with a real Python module (utils/math_utils.py) containing add, subtract, and is_prime functions - Runs the real agent CLI with a substantive prompt: add a fibonacci function using an iterative approach and commit it - Falls back to a manual commit if the agent wrote code but didn't commit (post-commit hook still fires and writes the authorship note as long as working log data was captured during the agent run) - Idempotent across retry attempts verify-attribution.sh: - Checks fibonacci function was actually added to the Python file - Verifies ≥3 commits exist (initial + seed + agent) - Fetches and parses the authorship note from refs/notes/ai - Asserts schema_version = "authorship/3.0.0" - Asserts at least one prompt session was recorded (hard fail) - Fuzzy-matches agent_id.tool against the agent name - Checks transcript messages were captured - Verifies utils/math_utils.py appears in the attestation section - Runs git-ai blame and checks AI attribution on fibonacci lines - Saves all artefacts (raw note, parsed metadata, blame output) to RESULTS_DIR for upload Workflow: increase Tier 2 job timeout from 25→45 min and retry timeout from 12→20 min to accommodate seeding + real agent API calls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The install-scripts-local workflow does more than validate install scripts — it verifies full end-to-end hook wiring between git-ai and Claude Code. Rename the workflow and job names to reflect that. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the fake claude binary stub with real npm-installed agent CLIs and add a matrix covering all four supported agents. This makes the End-to-End tests meaningful: install.sh now runs git-ai install-hooks against actual agent binaries, which auto-detect the installed tool and write real hook configuration to each agent's config directory. Verification uses the existing verify-hook-wiring.sh script (Unix) and equivalent inline PowerShell checks (Windows) to confirm hooks were written to the correct agent-specific location. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Two bugs in the E2E test setup: 1. opencode npm package: the package is "opencode-ai" not "opencode". The bare "opencode" name returns a 404 from the npm registry. Fixed in both the E2E install workflow and the nightly agent integration workflow. 2. codex hook verification: grep pattern "checkpoint codex" expects a JSON-style command string, but Codex config uses a TOML array where elements are comma-separated: notify = ["<bin>", "checkpoint", "codex", ...]. Changed to grep for just "checkpoint" which appears in the array and is sufficient to confirm the hook is configured. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The same TOML array format issue that was fixed in verify-hook-wiring.sh for Unix also affects the Windows inline PowerShell check. Codex stores its hook as a TOML array (notify = ["<bin>", "checkpoint", "codex", ...]) so Select-String for "checkpoint codex" never matches. Changed to match just "checkpoint". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…n verify-attribution.sh The `[ $? -eq 0 ] || fail "..."` guard was dead code under `set -euo pipefail`: if the python3 heredoc exits with code 1, `set -e` terminates the script immediately before the guard is reached, producing a silent exit with no diagnostic logged to $LOG. Replace with `if ! python3 ... <<'PYEOF' ... then fail "..." fi`, which is exempt from `set -e` and ensures the descriptive failure message is written to $LOG before exiting. Resolves Devin review comment BUG_pr-review-job-8b70596b_0002. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The Tier 1 and Tier 2 nightly jobs were calling `git-ai install` to set up agent hooks, but never creating the `git` → `git-ai` symlink in the release directory. When test scripts called `git commit`, the system git ran instead of the git-ai proxy, so the post-commit hook never fired and no authorship note was written to refs/notes/ai. Add `ln -sf .../git-ai .../git` in both the Tier 1 and Tier 2 "Install git-ai hooks in test repo" steps so that all `git` invocations inside test scripts (which prepend the release dir to PATH) route through git-ai and trigger the expected hook behaviour. Resolves Devin review comment BUG_pr-review-job-bf54cac596f44273b5f8565f81a63daf_0001. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The previous Lint (ubuntu-latest) check failed on `go-task/setup-task@v1` (not on any code change) — the same action passed on the identical commit via e2e-tests. No code changes; forcing a clean CI run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

1. verify-attribution.sh: guard empty-string fuzzy match `"" in "claude"` is True in Python, so a missing agent_id.tool would always report PASS. Added `if tool and (...)` to require a non-empty tool string before the fuzzy match runs. Resolves Devin BUG_pr-review-job-032b242ab75044ebac035a42020d7fe3_0001. 2. test-live-agent.sh: add `sudo` to ripgrep fallback install `apt-get install` on GitHub Actions ubuntu-latest requires root. Without `sudo` the install failed silently (2>/dev/null || true), leaving `rg` absent and potentially causing the Gemini CLI to hang. Resolves Devin BUG_pr-review-job-6b947f0c5f1e475bb3ffbeba9e6056de_0001. 3. nightly-agent-integration.yml: deduplicate stable/latest matrix entries `npm view <pkg> version` and `npm view <pkg> dist-tags.latest` return the same value, so stable and latest channels always tested the same version, doubling CI cost for zero extra coverage. Now queries `dist-tags.next` for the latest channel (pre-release/canary), falling back to stable_ver if no `next` tag exists, and skips the latest entry entirely when it would duplicate stable. Resolves Devin BUG_pr-review-job-6b947f0c5f1e475bb3ffbeba9e6056de_0002. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@latest

The previous fix queried dist-tags.next for latest_ver but still used @latest in the npm install command, which resolves to the stable release — identical to the stable channel and defeating the entire purpose of the latest matrix entry. Change the npm_pkg construction for the latest channel to use @next so the pre-release/canary version is actually installed when it exists. Resolves Devin BUG_pr-review-job-070479ba6d7041699555d4dfa9779fa3_0001. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

npm view <pkg> dist-tags.next exits with code 0 and returns an empty string (or "undefined") when the tag does not exist in npm 10+, rather than raising a non-zero exit. This meant CalledProcessError was never raised, latest_ver was set to "" or "undefined", the dedup check ("" != stable_ver) didn't fire, and a matrix entry was emitted with npm_pkg="<pkg>@next" — causing npm install to fail with ETARGET. Add an explicit check after .strip(): if the result is empty or equals the string "undefined", fall back to stable_ver, triggering the same deduplication skip as the CalledProcessError path. Resolves Devin BUG_pr-review-job-874dec7614a64a5e952cf18579ebc182_0001. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

devin-ai-integration

Devin Review found 1 new potential issue.

View 28 additional findings in Devin Review.

devin-ai-integration · 2026-02-28T01:17:09Z

.github/workflows/nightly-agent-integration.yml

+          python3 - <<'PY'
+          import json, subprocess, os
+
+          agents = {
+              "claude":   {"pkg": "@anthropic-ai/claude-code", "key": "ANTHROPIC_API_KEY"},
+              "codex":    {"pkg": "@openai/codex",             "key": "OPENAI_API_KEY"},
+              "gemini":   {"pkg": "@google/gemini-cli",        "key": "GEMINI_API_KEY"},
+              "opencode": {"pkg": "opencode-ai",               "key": "ANTHROPIC_API_KEY"},
+          }
+
+          headless_cmds = {
+              "claude":   "claude -p --dangerously-skip-permissions --max-turns 3",
+              "codex":    "codex exec --full-auto",
+              "gemini":   "gemini --approval-mode=yolo",
+              "opencode": "opencode run --command",
+          }
+
+          matrix = {"include": []}
+          for agent, info in agents.items():
+              try:
+                  stable_ver = subprocess.check_output(
+                      ["npm", "view", info["pkg"], "version"],
+                      text=True, stderr=subprocess.DEVNULL
+                  ).strip()
+                  # Try the "next" dist-tag for a pre-release; fall back to stable
+                  # to avoid doubling CI cost when no canary exists
+                  try:
+                      latest_ver = subprocess.check_output(
+                          ["npm", "view", info["pkg"], "dist-tags.next"],
+                          text=True, stderr=subprocess.DEVNULL
+                      ).strip()
+                      # npm 10+ exits 0 with empty output or "undefined" when the
+                      # dist-tag doesn't exist, so CalledProcessError is not raised
+                      if not latest_ver or latest_ver == "undefined":
+                          latest_ver = stable_ver
+                  except subprocess.CalledProcessError:
+                      latest_ver = stable_ver  # No pre-release; skip duplicate
+              except subprocess.CalledProcessError:
+                  print(f"Warning: Could not resolve versions for {info['pkg']}", flush=True)
+                  stable_ver = "latest"
+                  latest_ver = "latest"
+
+              for channel in ["stable", "latest"]:
+                  ver = stable_ver if channel == "stable" else latest_ver
+                  # Skip the latest channel when it resolves to the same version as
+                  # stable — no additional coverage, just wastes CI resources
+                  if channel == "latest" and latest_ver == stable_ver:
+                      continue
+                  npm_pkg = f"{info['pkg']}@{ver}" if channel == "stable" else f"{info['pkg']}@next"
+                  matrix["include"].append({
+                      "agent":       agent,
+                      "channel":     channel,
+                      "npm_pkg":     npm_pkg,
+                      "version":     ver,
+                      "api_key_var": info["key"],
+                      "headless_cmd": headless_cmds[agent],
+                  })
+
+          # Droid uses curl installer (latest only, no npm version pinning)
+          matrix["include"].append({
+              "agent":       "droid",
+              "channel":     "latest",
+              "npm_pkg":     "",
+              "version":     "latest",
+              "api_key_var": "FACTORY_API_KEY",
+              "headless_cmd": "droid exec --auto high",
+          })
+
+          with open(os.environ["GITHUB_OUTPUT"], "a") as f:
+              f.write(f"matrix={json.dumps(matrix)}\n")
+
+          print(f"Matrix built: {len(matrix['include'])} entries", flush=True)
+          PY


🟡 agents workflow_dispatch input is defined but never consumed by the matrix builder

The agents input allows users to specify "claude" or "claude,codex" when manually triggering the workflow, with the documented intent of filtering which agents to test. However, the Python matrix-builder script at lines 44–116 never reads github.event.inputs.agents — it unconditionally builds entries for all four npm agents plus Droid.

Root cause and impact

A user who triggers workflow_dispatch with agents: "claude" expecting to test only Claude will instead run the full matrix (all agents × all channels), wasting CI time and potentially burning API credits in Tier 2. The input parameter at nightly-agent-integration.yml:10-12 has no effect on the matrix output at nightly-agent-integration.yml:44-116.

Actual behavior: All agents are always included in the matrix regardless of the agents input value.

Expected behavior: When agents is not "all", only the specified agents should appear in the matrix.

Prompt for agents

In .github/workflows/nightly-agent-integration.yml, the Python matrix-builder script (lines 44-116) needs to read the agents workflow_dispatch input and filter accordingly. At the top of the Python script (around line 45), read the input: requested = os.environ.get("INPUT_AGENTS", "all").strip() requested_set = None if requested == "all" else set(a.strip() for a in requested.split(",")) Then, inside the for-loop over agents (around line 62), skip agents not in the requested set: if requested_set is not None and agent not in requested_set: continue Similarly, conditionally include the Droid entry (around line 102) only when requested_set is None or "droid" in requested_set. You also need to pass the input as an env var to the step. Add to the step at line 41: env: INPUT_AGENTS: ${{ github.event.inputs.agents || 'all' }}

Was this helpful? React with 👍 or 👎 to provide feedback.

This comment was marked as resolved.

Sign in to view

jwiegley and others added 3 commits February 26, 2026 15:54

Remove .mcp.json and NIGHTLY_INTEGRATION_PLAN.md

f409265

Neither file belongs in the repo: .mcp.json is local tooling config and the plan document was a design scratch pad, not a deliverable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jwiegley added the integration label Feb 27, 2026

Fix integration label name to lowercase

f658605

The label is 'integration', not 'Integration'. GitHub label names are case-sensitive in Actions expressions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Conversation

jwiegley commented Feb 26, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Architecture

Hook config paths (verified against src/mdm/agents/*.rs)

Secrets required (Tier 2 only)

Cost estimate

Test plan

Uh oh!

git-ai-cloud-dev bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

git-ai-cloud bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jwiegley commented Feb 26, 2026 •

edited by devin-ai-integration bot

Loading

Hook config paths (verified against `src/mdm/agents/*.rs`)

git-ai-cloud-dev bot commented Feb 26, 2026 •

edited

Loading

git-ai-cloud bot commented Feb 26, 2026 •

edited

Loading