Conversation
Implements a two-tier nightly GitHub Actions workflow that verifies git-ai hooks fire correctly with real agent CLI binaries (Claude Code, Codex, Gemini CLI, Droid, OpenCode) on both stable and latest releases. Tier 1 (no API keys): Installs each agent CLI, runs `git-ai install`, verifies hook config files contain the correct checkpoint commands, then exercises the full attribution pipeline with synthetic checkpoint data via the agent-v1 preset. Tier 2 (live, requires API keys): Runs each agent with a deterministic prompt in a test repo and verifies authorship notes and blame output. New files: - .github/workflows/nightly-agent-integration.yml - scripts/nightly/verify-hook-wiring.sh - scripts/nightly/test-synthetic-checkpoint.sh - scripts/nightly/test-live-agent.sh - scripts/nightly/verify-attribution.sh Hook config paths verified against src/mdm/agents/*.rs: - claude: ~/.claude/settings.json - codex: ~/.codex/config.toml - gemini: ~/.gemini/settings.json - droid: ~/.factory/settings.json - opencode: ~/.config/opencode/plugin/git-ai.ts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Neither file belongs in the repo: .mcp.json is local tooling config and the plan document was a design scratch pad, not a deliverable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. scripts/nightly/test-synthetic-checkpoint.sh: Fix transcript message schema in the synthetic checkpoint JSON payload. The Rust Message enum uses `#[serde(tag = "type", rename_all = "snake_case")]`, so messages require `"type"` and `"text"` fields — not `"role"` and `"content"`. The old schema caused deserialization to fail for every Tier 1 run. 2. .github/workflows/nightly-agent-integration.yml: Fix notify-on-failure condition. With `if: failure()`, GitHub Actions skips the job entirely when tier2-live-integration is skipped (e.g. when running tier1-only), silently swallowing Tier 1 failures. Replace with an explicit always() guard that checks each dependency's result individually. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a pull_request `labeled` event trigger so the full nightly suite runs whenever someone applies the 'Integration' label to any PR — in addition to the existing nightly schedule and workflow_dispatch paths. The gate condition on the resolve-versions job ensures the downstream matrix jobs only run for the correct trigger, not for every label event. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The label is 'integration', not 'Integration'. GitHub label names are case-sensitive in Actions expressions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the placeholder hello.txt smoke test with real end-to-end tests that verify git-ai's entire attribution pipeline: test-live-agent.sh: - Seeds the test repo with a real Python module (utils/math_utils.py) containing add, subtract, and is_prime functions - Runs the real agent CLI with a substantive prompt: add a fibonacci function using an iterative approach and commit it - Falls back to a manual commit if the agent wrote code but didn't commit (post-commit hook still fires and writes the authorship note as long as working log data was captured during the agent run) - Idempotent across retry attempts verify-attribution.sh: - Checks fibonacci function was actually added to the Python file - Verifies ≥3 commits exist (initial + seed + agent) - Fetches and parses the authorship note from refs/notes/ai - Asserts schema_version = "authorship/3.0.0" - Asserts at least one prompt session was recorded (hard fail) - Fuzzy-matches agent_id.tool against the agent name - Checks transcript messages were captured - Verifies utils/math_utils.py appears in the attestation section - Runs git-ai blame and checks AI attribution on fibonacci lines - Saves all artefacts (raw note, parsed metadata, blame output) to RESULTS_DIR for upload Workflow: increase Tier 2 job timeout from 25→45 min and retry timeout from 12→20 min to accommodate seeding + real agent API calls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The install-scripts-local workflow does more than validate install scripts — it verifies full end-to-end hook wiring between git-ai and Claude Code. Rename the workflow and job names to reflect that. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the fake claude binary stub with real npm-installed agent CLIs and add a matrix covering all four supported agents. This makes the End-to-End tests meaningful: install.sh now runs git-ai install-hooks against actual agent binaries, which auto-detect the installed tool and write real hook configuration to each agent's config directory. Verification uses the existing verify-hook-wiring.sh script (Unix) and equivalent inline PowerShell checks (Windows) to confirm hooks were written to the correct agent-specific location. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs in the E2E test setup: 1. opencode npm package: the package is "opencode-ai" not "opencode". The bare "opencode" name returns a 404 from the npm registry. Fixed in both the E2E install workflow and the nightly agent integration workflow. 2. codex hook verification: grep pattern "checkpoint codex" expects a JSON-style command string, but Codex config uses a TOML array where elements are comma-separated: notify = ["<bin>", "checkpoint", "codex", ...]. Changed to grep for just "checkpoint" which appears in the array and is sufficient to confirm the hook is configured. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The same TOML array format issue that was fixed in verify-hook-wiring.sh for Unix also affects the Windows inline PowerShell check. Codex stores its hook as a TOML array (notify = ["<bin>", "checkpoint", "codex", ...]) so Select-String for "checkpoint codex" never matches. Changed to match just "checkpoint". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n verify-attribution.sh The `[ $? -eq 0 ] || fail "..."` guard was dead code under `set -euo pipefail`: if the python3 heredoc exits with code 1, `set -e` terminates the script immediately before the guard is reached, producing a silent exit with no diagnostic logged to $LOG. Replace with `if ! python3 ... <<'PYEOF' ... then fail "..." fi`, which is exempt from `set -e` and ensures the descriptive failure message is written to $LOG before exiting. Resolves Devin review comment BUG_pr-review-job-8b70596b_0002. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Tier 1 and Tier 2 nightly jobs were calling `git-ai install` to set up agent hooks, but never creating the `git` → `git-ai` symlink in the release directory. When test scripts called `git commit`, the system git ran instead of the git-ai proxy, so the post-commit hook never fired and no authorship note was written to refs/notes/ai. Add `ln -sf .../git-ai .../git` in both the Tier 1 and Tier 2 "Install git-ai hooks in test repo" steps so that all `git` invocations inside test scripts (which prepend the release dir to PATH) route through git-ai and trigger the expected hook behaviour. Resolves Devin review comment BUG_pr-review-job-bf54cac596f44273b5f8565f81a63daf_0001. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous Lint (ubuntu-latest) check failed on `go-task/setup-task@v1` (not on any code change) — the same action passed on the identical commit via e2e-tests. No code changes; forcing a clean CI run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. verify-attribution.sh: guard empty-string fuzzy match `"" in "claude"` is True in Python, so a missing agent_id.tool would always report PASS. Added `if tool and (...)` to require a non-empty tool string before the fuzzy match runs. Resolves Devin BUG_pr-review-job-032b242ab75044ebac035a42020d7fe3_0001. 2. test-live-agent.sh: add `sudo` to ripgrep fallback install `apt-get install` on GitHub Actions ubuntu-latest requires root. Without `sudo` the install failed silently (2>/dev/null || true), leaving `rg` absent and potentially causing the Gemini CLI to hang. Resolves Devin BUG_pr-review-job-6b947f0c5f1e475bb3ffbeba9e6056de_0001. 3. nightly-agent-integration.yml: deduplicate stable/latest matrix entries `npm view <pkg> version` and `npm view <pkg> dist-tags.latest` return the same value, so stable and latest channels always tested the same version, doubling CI cost for zero extra coverage. Now queries `dist-tags.next` for the latest channel (pre-release/canary), falling back to stable_ver if no `next` tag exists, and skips the latest entry entirely when it would duplicate stable. Resolves Devin BUG_pr-review-job-6b947f0c5f1e475bb3ffbeba9e6056de_0002. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous fix queried dist-tags.next for latest_ver but still used @latest in the npm install command, which resolves to the stable release — identical to the stable channel and defeating the entire purpose of the latest matrix entry. Change the npm_pkg construction for the latest channel to use @next so the pre-release/canary version is actually installed when it exists. Resolves Devin BUG_pr-review-job-070479ba6d7041699555d4dfa9779fa3_0001. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
npm view <pkg> dist-tags.next exits with code 0 and returns an empty
string (or "undefined") when the tag does not exist in npm 10+, rather
than raising a non-zero exit. This meant CalledProcessError was never
raised, latest_ver was set to "" or "undefined", the dedup check
("" != stable_ver) didn't fire, and a matrix entry was emitted with
npm_pkg="<pkg>@next" — causing npm install to fail with ETARGET.
Add an explicit check after .strip(): if the result is empty or equals
the string "undefined", fall back to stable_ver, triggering the same
deduplication skip as the CalledProcessError path.
Resolves Devin BUG_pr-review-job-874dec7614a64a5e952cf18579ebc182_0001.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| python3 - <<'PY' | ||
| import json, subprocess, os | ||
|
|
||
| agents = { | ||
| "claude": {"pkg": "@anthropic-ai/claude-code", "key": "ANTHROPIC_API_KEY"}, | ||
| "codex": {"pkg": "@openai/codex", "key": "OPENAI_API_KEY"}, | ||
| "gemini": {"pkg": "@google/gemini-cli", "key": "GEMINI_API_KEY"}, | ||
| "opencode": {"pkg": "opencode-ai", "key": "ANTHROPIC_API_KEY"}, | ||
| } | ||
|
|
||
| headless_cmds = { | ||
| "claude": "claude -p --dangerously-skip-permissions --max-turns 3", | ||
| "codex": "codex exec --full-auto", | ||
| "gemini": "gemini --approval-mode=yolo", | ||
| "opencode": "opencode run --command", | ||
| } | ||
|
|
||
| matrix = {"include": []} | ||
| for agent, info in agents.items(): | ||
| try: | ||
| stable_ver = subprocess.check_output( | ||
| ["npm", "view", info["pkg"], "version"], | ||
| text=True, stderr=subprocess.DEVNULL | ||
| ).strip() | ||
| # Try the "next" dist-tag for a pre-release; fall back to stable | ||
| # to avoid doubling CI cost when no canary exists | ||
| try: | ||
| latest_ver = subprocess.check_output( | ||
| ["npm", "view", info["pkg"], "dist-tags.next"], | ||
| text=True, stderr=subprocess.DEVNULL | ||
| ).strip() | ||
| # npm 10+ exits 0 with empty output or "undefined" when the | ||
| # dist-tag doesn't exist, so CalledProcessError is not raised | ||
| if not latest_ver or latest_ver == "undefined": | ||
| latest_ver = stable_ver | ||
| except subprocess.CalledProcessError: | ||
| latest_ver = stable_ver # No pre-release; skip duplicate | ||
| except subprocess.CalledProcessError: | ||
| print(f"Warning: Could not resolve versions for {info['pkg']}", flush=True) | ||
| stable_ver = "latest" | ||
| latest_ver = "latest" | ||
|
|
||
| for channel in ["stable", "latest"]: | ||
| ver = stable_ver if channel == "stable" else latest_ver | ||
| # Skip the latest channel when it resolves to the same version as | ||
| # stable — no additional coverage, just wastes CI resources | ||
| if channel == "latest" and latest_ver == stable_ver: | ||
| continue | ||
| npm_pkg = f"{info['pkg']}@{ver}" if channel == "stable" else f"{info['pkg']}@next" | ||
| matrix["include"].append({ | ||
| "agent": agent, | ||
| "channel": channel, | ||
| "npm_pkg": npm_pkg, | ||
| "version": ver, | ||
| "api_key_var": info["key"], | ||
| "headless_cmd": headless_cmds[agent], | ||
| }) | ||
|
|
||
| # Droid uses curl installer (latest only, no npm version pinning) | ||
| matrix["include"].append({ | ||
| "agent": "droid", | ||
| "channel": "latest", | ||
| "npm_pkg": "", | ||
| "version": "latest", | ||
| "api_key_var": "FACTORY_API_KEY", | ||
| "headless_cmd": "droid exec --auto high", | ||
| }) | ||
|
|
||
| with open(os.environ["GITHUB_OUTPUT"], "a") as f: | ||
| f.write(f"matrix={json.dumps(matrix)}\n") | ||
|
|
||
| print(f"Matrix built: {len(matrix['include'])} entries", flush=True) | ||
| PY |
There was a problem hiding this comment.
🟡 agents workflow_dispatch input is defined but never consumed by the matrix builder
The agents input allows users to specify "claude" or "claude,codex" when manually triggering the workflow, with the documented intent of filtering which agents to test. However, the Python matrix-builder script at lines 44–116 never reads github.event.inputs.agents — it unconditionally builds entries for all four npm agents plus Droid.
Root cause and impact
A user who triggers workflow_dispatch with agents: "claude" expecting to test only Claude will instead run the full matrix (all agents × all channels), wasting CI time and potentially burning API credits in Tier 2. The input parameter at nightly-agent-integration.yml:10-12 has no effect on the matrix output at nightly-agent-integration.yml:44-116.
Actual behavior: All agents are always included in the matrix regardless of the agents input value.
Expected behavior: When agents is not "all", only the specified agents should appear in the matrix.
Prompt for agents
In .github/workflows/nightly-agent-integration.yml, the Python matrix-builder script (lines 44-116) needs to read the agents workflow_dispatch input and filter accordingly. At the top of the Python script (around line 45), read the input:
requested = os.environ.get("INPUT_AGENTS", "all").strip()
requested_set = None if requested == "all" else set(a.strip() for a in requested.split(","))
Then, inside the for-loop over agents (around line 62), skip agents not in the requested set:
if requested_set is not None and agent not in requested_set:
continue
Similarly, conditionally include the Droid entry (around line 102) only when requested_set is None or "droid" in requested_set.
You also need to pass the input as an env var to the step. Add to the step at line 41:
env:
INPUT_AGENTS: ${{ github.event.inputs.agents || 'all' }}
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
.github/workflows/nightly-agent-integration.yml— a two-tier nightly workflow that installs real agent CLI binaries and verifies git-ai hook wiring and attribution end-to-endscripts/nightly/with four helper scripts implementing the test logicNIGHTLY_INTEGRATION_PLAN.mddocumenting the full design rationale and open questionsTest Architecture
Tier 1 — Hook Wiring (no API keys, free)
Builds
git-aifrom source, installs each agent CLI (Claude Code, Codex, Gemini, Droid, OpenCode) at bothstableandlatestversions via a dynamic matrix, then:git-ai installand verifies the correct checkpoint commands appear in each agent's config fileagent-v1preset)Tier 2 — Live Integration (requires API key secrets)
Runs each agent with a minimal deterministic prompt ("create hello.txt, commit it"), then verifies the file was created, a commit landed, and authorship notes are present in
refs/notes/ai. Pre-release failures are non-blocking (continue-on-error: true).Hook config paths (verified against
src/mdm/agents/*.rs)~/.claude/settings.json~/.codex/config.toml~/.gemini/settings.json~/.factory/settings.json~/.config/opencode/plugin/git-ai.tsSecrets required (Tier 2 only)
ANTHROPIC_API_KEY,OPENAI_API_KEY,GEMINI_API_KEY,FACTORY_API_KEY,SLACK_BOT_TOKEN,SLACK_CHANNEL_IDTier 1 runs without any secrets.
Cost estimate
~$0.05–0.25/night (weekdays only). See
NIGHTLY_INTEGRATION_PLAN.md§6 for cost management strategies.Test plan
workflow_dispatchwithtier: tier1to validate hook-wiring jobs (no API keys needed)ANTHROPIC_API_KEYsecret and triggertier: bothto validate Claude Code Tier 2 end-to-endNIGHTLY_INTEGRATION_PLAN.md§13 before enabling the nightly schedule🤖 Generated with Claude Code