Auto-generate agent-optimized CLI docs from --help output — verified, compressed, ready for AGENTS.md
AI agents guess at CLI flags from training data instead of reading accurate docs. Hand-written tool docs go stale as CLIs change. A typical --help page is 48KB — that's ~12K tokens per context load.
A three-stage pipeline turns raw --help output into a verified, compressed skill doc:
CLI --help
↓
extract → raw docs + structured JSON (~48KB)
↓
distill → agent-optimized SKILL.md (~2KB)
↓
validate → multi-model score 9/10+
↓
SKILL.md → drop into AGENTS.md, CLAUDE.md, OpenClaw skills
The generate step (extracting --help output) works without an LLM. The distill and validate steps require one.
If any of these are installed and logged in, it just works — no config needed:
- Claude Code —
npm install -g @anthropic-ai/claude-code→claude /login - Codex CLI —
npm install -g @openai/codex→codex login - Gemini CLI —
npm install -g @google/gemini-cli→gemini(auth on first run)
Verify you're logged in:
echo 'say ok' | claude -p --output-format text # should print "ok"
echo 'say ok' | codex exec # should print "ok"
gemini -p 'say ok' # should print "ok"The tool tries them in this order: Claude Code → Codex → Gemini. The first one found on your PATH is used.
Set any of these environment variables and the tool will call the API directly:
export ANTHROPIC_API_KEY=sk-ant-... # → uses Anthropic API
export OPENAI_API_KEY=sk-... # → uses OpenAI API
export GEMINI_API_KEY=... # → uses Google Gemini API
export OPENROUTER_API_KEY=sk-or-... # → uses OpenRouter APIAPI keys are checked only if no CLI is found. Each provider uses a sensible default model (e.g., claude-opus-4-6 for Anthropic, gpt-5.2 for OpenAI).
To pin a specific provider and model, create ~/.agent-tool-docs/config.yaml:
provider: claude-cli # claude-cli | codex-cli | gemini-cli | anthropic | openai | gemini | openrouter
model: claude-opus-4-6 # optional — overrides the provider's default model
apiKey: sk-ant-... # optional — overrides env var for this providerConfig file takes priority over auto-detection. You can also override per-run with --model <model>.
For validation, --models <m1,m2> accepts a comma-separated list to test across multiple models.
# npm
npx agent-tool-docs run railway
# pnpm
pnpx agent-tool-docs run railway
# bun
bunx agent-tool-docs run railway
# Homebrew (macOS / Linux)
brew tap BrennerSpear/tap
brew install agent-tool-docs
tool-docs run railway# Full pipeline in one shot: generate → distill → validate
tool-docs run railway
# Your agent-optimized skill is at ~/.agents/skills/railway/SKILL.mdDrop ~/.agents/skills/railway/SKILL.md into your AGENTS.md, CLAUDE.md, or OpenClaw skills directory. Your agent has verified docs instead of guessing from training data.
You can also run each step individually:
tool-docs generate railway # extract raw docs from --help
tool-docs distill railway # compress into agent-optimized SKILL.md
tool-docs validate railway # score quality with multi-model evaluationRailway v4 overhauled its CLI — models trained on v3 still hallucinate railway run for deployments and miss the new variable set subcommand syntax. Here's the generated SKILL.md (~1.5KB, distilled from 52KB of --help):
# Railway CLI
Deploy and manage cloud applications with projects, services, environments, and databases.
## Critical Distinctions
- `up` uploads and deploys your code from the current directory
- `deploy` provisions a *template* (e.g., Postgres, Redis) — NOT for deploying your code
- `run` executes a local command with Railway env vars injected — it does NOT deploy anything
## Quick Reference
railway up # Deploy current directory
railway up -s my-api # Deploy to specific service
railway logs -s my-api # View deploy logs
railway variable set KEY=VAL # Set env var
railway connect # Open database shell (psql, mongosh, etc.)
## Key Commands
| Command | Purpose |
|---------|---------|
| `up [-s service] [-d]` | Deploy from current dir; `-d` to detach from log stream |
| `variable set KEY=VAL` | Set env var; add `--skip-deploys` to skip redeployment |
| `variable list [-s svc]` | List variables; `--json` for JSON output |
| `link [-p project] [-s svc]` | Link current directory to a project/service |
| `service status` | Show deployment status across services |
| `logs [-s service]` | View build/deploy logs |
| `connect` | Open database shell (auto-detects Postgres, MongoDB, Redis) |
| `domain` | Add custom domain or generate a Railway-provided domain |
## Common Patterns
Deploy with message: `railway up -m "fix auth bug"`
Set var without redeploying: `railway variable set API_KEY=sk-123 --skip-deploys`
Stream build logs then exit: `railway up --ci`
Run local dev with Railway env: `railway run npm start`See examples/ for real generated output for railway, jq, gh, curl, ffmpeg, and rg.
Runs each tool's --help (and subcommand help) with LANG=C NO_COLOR=1 PAGER=cat for stable, deterministic output. Parses usage lines, flags, subcommands, examples, and env vars into structured JSON + Markdown. Stores a SHA-256 hash for change detection.
~/.agents/docs/tool-docs/<tool-id>/
tool.json # structured parse
tool.md # rendered markdown
commands/ # per-subcommand docs
<command>/
command.json
command.md
Passes raw docs to an LLM with a task-focused prompt. Output is a SKILL.md optimized for agents: quick reference, key flags, common patterns. Target size ~2KB. Skips re-distillation if help output is unchanged.
Requires Claude Code (claude) or Gemini CLI (gemini) installed — see Prerequisites. Default model: claude-opus-4-6; override with --model.
Runs scenario-based evaluation across multiple LLM models. Each model attempts realistic tasks using only the SKILL.md, then scores itself 1–10 on accuracy, completeness, and absence of hallucinations. Threshold: 9/10.
tool-docs validate railway --models claude-sonnet-4-6,claude-opus-4-6 --threshold 9
Re-runs generate + distill only for tools whose --help output has changed (by hash). Use --diff to see what changed in the SKILL.md.
tool-docs refresh --diff| Tool | Binary | Category |
|---|---|---|
| ✅ Git | git |
Version control |
| ✅ GitHub CLI | gh |
Code review / CI |
| ✅ ripgrep | rg |
Search |
| Tool | Binary | Category |
|---|---|---|
| ✅ jq | jq |
JSON processing |
| ✅ curl | curl |
HTTP requests |
| Tool | Binary | Category |
|---|---|---|
| ✅ uv | uv |
Package management |
| ✅ uvx | uvx |
Ephemeral tool runner |
| Tool | Binary | Category |
|---|---|---|
| ✅ Railway CLI | railway |
Cloud deployment |
| ✅ Vercel CLI | vercel |
Frontend deployment |
| ✅ Supabase CLI | supabase |
Database / backend |
| Tool | Binary | Category |
|---|---|---|
| ✅ FFmpeg | ffmpeg |
Video / audio processing |
| Tool | Binary | Category |
|---|---|---|
| ✅ Claude Code | claude |
AI coding agent |
| ✅ agent-browser | agent-browser |
Browser automation |
| ✅ Ralphy | ralphy |
AI coding loop runner |
Works with any CLI that has --help output. Add custom tools via registry entry (see Configuration).
Skills are evaluated by asking an LLM to complete realistic tasks using only the generated SKILL.md. Each scenario is graded 1–10 for correctness and absence of hallucinations.
Example report for railway:
validate railway (claude-sonnet-4-6, claude-opus-4-6)
claude-sonnet-4-6 average: 9.3/10
Scenario 1: "deploy the current directory to a specific service" → 10/10
Scenario 2: "set an env var without triggering a redeploy" → 9/10
Scenario 3: "connect to the project's Postgres database" → 9/10
claude-opus-4-6 average: 9.7/10
Scenario 1: "deploy the current directory to a specific service" → 10/10
Scenario 2: "set an env var without triggering a redeploy" → 10/10
Scenario 3: "connect to the project's Postgres database" → 9/10
overall: 9.5/10 — PASSED (threshold: 9)
If validation fails, --auto-redist re-runs distillation with feedback and you can re-validate.
~/.agents/skills/<tool-id>/
SKILL.md # compressed, agent-optimized (drop into AGENTS.md)
docs/
advanced.md # extended reference
recipes.md # common patterns
troubleshooting.md
SKILL.md is the primary file — small enough to include inline in any agent system prompt. The docs/ subfolder holds overflow content for tools with complex help text.
For batch operations across many tools, use a registry file at ~/.agents/tool-docs/registry.yaml:
tool-docs init # create a starter registry with common tools
tool-docs run # full pipeline for all registry toolsYou can also run individual steps across the registry:
tool-docs generate # extract docs for all registry tools
tool-docs distill # distill all into agent-optimized skillsRegistry format:
version: 1
tools:
- id: jq
binary: jq
displayName: jq (JSON processor)
category: cli
homepage: https://jqlang.github.io/jq
useCases:
- filter and transform JSON data
- extract fields from API responses
- id: git
binary: git
displayName: Git
helpArgs: ["-h"]
commandHelpArgs: ["help", "{command}"]
useCases:
- version control and branchingFields:
| Field | Required | Description |
|---|---|---|
id |
yes | Unique identifier, used as directory name |
binary |
yes | Executable name on PATH |
helpArgs |
no | Args to invoke help (default: ["--help"]) |
commandHelpArgs |
no | Args for subcommand help; {command} is replaced |
useCases |
no | Hints for distillation prompt |
enabled |
no | Set false to skip a tool without removing it |
Run tool-docs generate --only jq to process a single tool.
tool-docs run <binary> # full pipeline, score must be ≥ 9/10Or add an entry to ~/.agents/tool-docs/registry.yaml for batch operations with custom helpArgs.
bun testbun run build # outputs bin/tool-docs.jsMIT