oldsj · oldsj · Jan 16, 2026 · Jan 15, 2026 · Jan 15, 2026 · Jan 15, 2026
diff --git a/.claude/skills/validate-docs/SKILL.md b/.claude/skills/validate-docs/SKILL.md
@@ -0,0 +1,137 @@
+---
+name: validate-docs
+description: Validate README features are documented in specs and covered by e2e tests. Use when checking documentation coverage or before merging docs changes.
+---
+
+# Validate Documentation
+
+Validate that ALL features claimed in README are documented in specs and tested.
+
+## Philosophy
+
+The README is the product's promise to users. Every feature advertised must be:
+
+1. Specified in `docs/specs/` (source of truth for behavior)
+2. Tested in `frontend/tests/` (proof it works)
+
+## Workflow
+
+### Step 1: Extract ALL features from README
+
+Read README.md and identify every feature claim. Look in:
+
+- **How It Works** section - core functionality
+- **UI** section - user-facing features
+- **Agent Workflow** section - automation features
+- Any diagrams, bullet points, or descriptions that promise functionality
+
+For each feature, note:
+
+- What it claims to do
+- Whether it has a linked spec (some won't)
+
+### Step 2: Map features to specs
+
+Check if each README feature has a corresponding spec:
+
+| README Feature                             | Expected Spec                                    |
+| ------------------------------------------ | ------------------------------------------------ |
+| Main thread conversation                   | `docs/specs/chat.md`                             |
+| Sessions/background work                   | `docs/specs/sessions.md`                         |
+| Mobile/desktop layout                      | `docs/specs/layout.md`                           |
+| Agent workflow (spawn → work → PR → close) | `docs/specs/agent-workflow.md`                   |
+| Notifications                              | `docs/specs/sessions.md` (notifications section) |
+
+**Flag any README feature without a spec as ERROR.**
+
+### Step 3: Verify specs have testable assertions
+
+For each spec file:
+
+- Check it contains specific, testable statements
+- Assertions should mention exact UI text, behaviors, or states
+- Vague specs like "works well" are not testable
+
+### Step 4: Map spec assertions to tests
+
+For each bullet point/assertion in a spec, search for a test that verifies it:
+
+```bash
+# Example: search for test covering "No sessions yet" message
+grep -r "No sessions yet" frontend/tests/
+```
+
+Track coverage for each spec assertion.
+
+### Step 5: Output report
+
+```markdown
+## Documentation Validation Report
+
+### README Features → Specs
+
+- [x] Main thread conversation → docs/specs/chat.md
+- [x] Sessions → docs/specs/sessions.md
+- [x] Layout → docs/specs/layout.md
+- [ ] **Agent Workflow → NO SPEC** ← ERROR
+
+### Spec Assertions → Tests
+
+#### docs/specs/chat.md (7/9 = 78%)
+
+- [x] "Input field with placeholder" → e2e/user-journey.spec.ts
+- [ ] "Messages persist across reloads" → NO TEST
+
+#### docs/specs/sessions.md (23/23 = 100%)
+
+- [x] "No sessions yet" → sessions/01-session-list-empty.spec.ts
+      ...
+
+### Summary
+
+| Category                   | Coverage    |
+| -------------------------- | ----------- |
+| README features with specs | 3/4 (75%)   |
+| Spec assertions with tests | 38/40 (95%) |
+
+### Errors
+
+1. README "Agent Workflow" section has no spec
+2. chat.md "Messages persist across reloads" has no test
+```
+
+### Step 6: Exit status
+
+- **ERROR** if any README feature lacks a spec
+- **ERROR** if any spec assertion lacks test coverage
+- **SUCCESS** only if fully covered
+
+## Common Gaps to Watch For
+
+1. **Advertised but unspecified** - README promises feature, no spec exists
+2. **Specified but untested** - Spec describes behavior, no test verifies it
+3. **Workflow features** - Multi-step flows (like agent lifecycle) often lack e2e coverage
+
+## Tests to Flag for Removal
+
+Not all tests are good tests. Flag these as problems:
+
+1. **Tests without specs** - If a test exists but no spec describes the behavior, either:
+   - The spec is missing (add it), or
+   - The test is testing implementation details (remove it)
+
+2. **Too technical / low-level** - Tests should verify user-facing behavior, not internals:
+   - ❌ "store updates when API returns data"
+   - ❌ "component re-renders on state change"
+   - ✅ "user sees updated session status after refresh"
+
+3. **Testing implementation, not behavior** - Tests coupled to code structure:
+   - ❌ "calls fetchSessions with correct params"
+   - ❌ "dispatches ACTION_TYPE to store"
+   - ✅ "sessions list shows new session after spawning"
+
+4. **Duplicate coverage** - Multiple tests verifying the same user behavior
+
+5. **Tests for removed features** - Spec was removed but test remains
+
+The test suite should read like a user manual, not a code audit.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -46,9 +46,10 @@ make test-reset   # Clear DB + namespaces between runs
 
 **Playwright agents for test maintenance:**
 
-- Don't manually tweak tests - use `playwright-test-healer` to auto-fix failures
-- For new features, use `playwright-test-planner` to explore and generate plans
+- Specs in `docs/specs/` are the source of truth for app behavior
+- Use `playwright-test-planner` to generate test plans from specs
 - Use `playwright-test-generator` to create tests from plans
+- Use `playwright-test-healer` to auto-fix failing tests
 
 ### Test Architecture (Flakiness Prevention)
 
@@ -115,6 +116,7 @@ Tests are organized into projects by execution mode:
 - **Svelte 5 runes**: `$state`, `$derived`, `$effect`, `$props`
 - **API calls**: Use `$lib/api.ts`, never hardcode URLs
 - **DBOS workflows**: Bump `WORKFLOW_VERSION` in `dbos_config.py` when changing workflow logic
+- **External docs**: Use context7 MCP to fetch up-to-date documentation for any library (DBOS, Svelte, Playwright, etc.)
 - **HTML**: Be explicit, don't rely on browser defaults (`type="button"`, `rel="noopener"`, etc.)
 - **Responsive layouts**: Use `isMobile` store to conditionally render, not CSS hide (avoids duplicate DOM elements)
 - **K8s scripts**: Always use explicit `--context kind-${KIND_CLUSTER_NAME:-mainloop-test}` in kubectl commands to avoid targeting wrong cluster
@@ -129,10 +131,10 @@ make setup-claude-creds  # Extract Claude credentials from Keychain
 ## Documentation Philosophy
 
 ```text
-README.md → docs/ → specs/ → tests/
+docs/specs/*.md → playwright-test-planner → tests/*.spec.ts
 ```
 
-Specs define behavior. Tests are the source of truth. Keep docs in sync by running planner agent after feature changes.
+Specs are the source of truth. They describe user-visible behavior in human-readable form, detailed enough for `playwright-test-planner` to generate tests. If tests fail: spec is wrong, code is wrong, or feature is in active development.
 
 ## Important
 

diff --git a/README.md b/README.md
@@ -36,7 +36,7 @@ You (phone/laptop)
 - **Main thread**: One continuous conversation — sessions spawn inline and surface results back
 - **Sessions**: Background AI work with their own conversations; appear as colored threads in your timeline
 - **Notifications**: Slack-style thread replies notify you when sessions need attention or complete
-- **Persistence**: Conversations and sessions survive restarts via compaction + [DBOS](docs/DBOS.md)
+- **Persistence**: Conversations and sessions survive restarts via compaction + [DBOS](https://docs.dbos.dev/)
 
 ## Quick Start
 
@@ -100,48 +100,38 @@ mainloop/
 
 ## Agent Workflow
 
-Agents follow a structured workflow: **plan in issue → implement in draft PR → iterate until CI green → ready for human review**.
+Agents are sessions spawned for development tasks. Each agent gets its own K8s namespace for isolated iteration.
 
 ```text
 ┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
-│   Planning  │────►│    Draft    │────►│  Iteration  │────►│   Review    │
-│  (GH Issue) │     │    (PR)     │     │  (CI Loop)  │     │   (Human)   │
+│   Spawn     │────►│    Work     │────►│     PR      │────►│    Close    │
+│   (main)    │     │  (k8s ns)   │     │  (GitHub)   │     │  (summary)  │
 └─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
+       ▲                   │
+       └───────────────────┘
+         check in / spawn more
 ```
 
-### Phases
+1. **Spawn** - Main thread creates agent for a task
+2. **Work** - Agent iterates in its own K8s namespace (build, test, debug)
+3. **PR** - Agent creates and merges GitHub PR when ready
+4. **Close** - Agent posts summary back to main thread
 
-1. **Planning (GitHub Issue)** - Agent creates/updates an issue with problem analysis, proposed approach, and implementation plan. The issue is the "thinking out loud" space before code.
+You stay in main thread, checking in on agents and spawning new ones as needed.
 
-2. **Draft PR** - Agent creates a draft PR linked to the issue. Implements in small, logical commits. Uses PR comments to narrate progress and decisions.
-
-3. **Iteration (CI Loop)** - Agent polls GitHub Actions after each push. On failure: analyzes logs, fixes, commits. Continues until green checkmark.
-
-4. **Ready for Review** - Agent marks PR ready and adds summary comment. Human reviewer steps in for final approval.
-
-### Verification
-
-Agents use these tools to verify work before marking ready:
-
-- **LSP server integration** - Real-time type/lint errors
-- **`trunk` CLI** - Unified super-linter
-- **Project test suites** - Via GitHub Actions
+## Documentation
 
-### Project Template (Future)
+**Specs** (source of truth for app behavior):
 
-| Component      | Purpose                              |
-| -------------- | ------------------------------------ |
-| GitHub Actions | CI pipeline (lint, type-check, test) |
-| K8s/Helm       | Preview environments per PR          |
-| CNPG operator  | Dynamic test databases               |
-| trunk.yaml     | Unified linter config                |
+- [Chat](docs/specs/chat.md) - Main thread conversation
+- [Sessions](docs/specs/sessions.md) - Background work and status
+- [Layout](docs/specs/layout.md) - Mobile and desktop views
 
-## Documentation
+**Guides**:
 
 - [Architecture](docs/architecture.md) - System design and data flow
 - [Development](docs/development.md) - Local setup and commands
-- [DBOS Workflows](docs/DBOS.md) - Durable task orchestration
-- [Contributing](CONTRIBUTING.md) - How to contribute to mainloop
+- [Contributing](CONTRIBUTING.md) - How to contribute
 
 ## License
 

diff --git a/ROADMAP.md b/ROADMAP.md
@@ -0,0 +1,30 @@
+# Roadmap
+
+Future ideas and features under consideration.
+
+## CI Loop Automation
+
+Agent automatically iterates on CI failures:
+
+- Poll GitHub Actions after each push
+- On failure: analyze logs, fix, commit
+- Continue until green checkmark
+
+## GitHub Issue Planning
+
+Agent creates/updates issues before coding:
+
+- Problem analysis and proposed approach
+- Implementation plan as "thinking out loud" space
+- Links PR back to issue for context
+
+## Project Template
+
+Standardized repo setup for mainloop agents:
+
+| Component      | Purpose                              |
+| -------------- | ------------------------------------ |
+| GitHub Actions | CI pipeline (lint, type-check, test) |
+| K8s/Helm       | Preview environments per PR          |
+| CNPG operator  | Dynamic test databases               |
+| trunk.yaml     | Unified linter config                |