Skip to content

Commit d365c8f

Browse files
author
StackMemory Bot (CLI)
committed
chore(gepa): update GEPA evals, fixtures, and optimization state
Add eval task fixtures, update generation variants and results, enhance optimize script and reflect hook.
1 parent 7ca7b3a commit d365c8f

27 files changed

+2138
-542
lines changed

CLAUDE.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,38 @@ railway up
130130
# Pre-publish checks require clean git status — stash GEPA files first
131131
```
132132

133+
## Task Delegation Model
134+
135+
Route effort by task complexity — not all code changes deserve equal scrutiny:
136+
137+
**AUTOMATE** — Execute immediately, lint+test is sufficient:
138+
- CRUD operations, boilerplate, formatting, simple transforms
139+
- Adding a tool handler following existing switch/case pattern
140+
- Config additions (new env var, feature flag)
141+
142+
**STANDARD** — Normal workflow, lint+test+build:
143+
- Feature implementation, bug fixes, refactoring
144+
- New test coverage, documentation updates
145+
- Integration wiring (adding handler to server.ts dispatch)
146+
147+
**CAREFUL** — Review approach before implementation:
148+
- API/schema changes, database migrations, auth flows
149+
- New integration patterns (MCP tools, webhook handlers)
150+
- Changes to frame-manager, sqlite-adapter, or daemon lifecycle
151+
- Anything touching error handling chains
152+
153+
**ARCHITECT** — Plan mode required, explore existing patterns first:
154+
- New service boundaries, system integrations
155+
- Performance-critical paths (FTS5 queries, search scoring)
156+
- Breaking changes to MCP protocol or CLI interface
157+
158+
**HUMAN** — Explicit user approval before any changes:
159+
- Security-critical decisions, secret handling
160+
- Irreversible operations (data migrations, schema drops)
161+
- Publishing (npm publish, Railway deploy)
162+
163+
Quality gates scale with tier — don't over-engineer AUTOMATE tasks, don't under-review CAREFUL ones.
164+
133165
## Workflow
134166

135167
- Check .env for API keys before asking

scripts/gepa/.before-optimize.md

Lines changed: 140 additions & 164 deletions
Original file line numberDiff line numberDiff line change
@@ -1,164 +1,140 @@
1-
AGENTS.md
2-
3-
Purpose
4-
- A minimal, agent-friendly reference so code-generation agents (Codex, Claude Code, etc.) can work effectively in this repository.
5-
- Explains key docs, the /designs/ folder, agent responsibilities, and quick operational notes (how to run tests, what to update, and commit expectations).
6-
7-
Repo doc descriptions
8-
- docs/PROMPT_PLAN.md
9-
- The agent-driven plan that sequences work into small, testable prompts and steps.
10-
- Contains per-step prompts, expected artifacts, tests, rollback/idempotency notes, and a TODO checklist using Markdown checkboxes.
11-
- This is the canonical agent workflow driver — update it as you make progress (see Agent responsibility rules below).
12-
13-
- docs/DEV_SPEC.md
14-
- The minimal functional & technical specification that defines APIs, data models, and acceptance criteria.
15-
- Includes the concise Definition of Done that must be satisfied for each plan step before marking it complete.
16-
17-
- idea.md
18-
- Free-form brainstorming, assumptions, notes, research links, and open questions.
19-
- Useful for context but not authoritative — always follow docs/DEV_SPEC.md and docs/PROMPT_PLAN.md for implementation decisions.
20-
21-
- idea_one_pager.md
22-
- A short summary / one‑pager capturing Problem, Audience, Platform, Core Flow, and MVP Features (and optional Non‑Goals).
23-
- Good for quick alignment and to confirm that work stays within scope.
24-
25-
What lives in /designs/
26-
- UI/UX artifacts and visual assets that inform implementation:
27-
- wireframes (PNG/SVG), Figma exports (.fig, .pdf), sequence diagrams, architecture diagrams (PNG/PDF/SVG), and annotated screenshots.
28-
- Naming conventions: keep filenames short, include version/date and owner, e.g., dashboard_v1_2025-11-01.png or seq_query_flow_v2.pdf.
29-
- Large source Figma files may live externally; include an export + a small README describing where the canonical design is stored and any viewing permissions required.
30-
31-
How agents should interact (summary)
32-
- Treat docs/PROMPT_PLAN.md as the authoritative workflow: follow the listed prompts in order and mark checklist items as you finish them.
33-
- Always follow TDD: write tests first, make the minimal change to pass tests, then refactor while keeping tests green.
34-
- After any code/test change, update the matching TODO checkbox in docs/PROMPT_PLAN.md using the same Markdown checkbox format ('- [x]') and commit the change alongside code and tests.
35-
- Make the smallest change that passes tests and improves code. Do not introduce new public APIs without updating docs/DEV_SPEC.md and tests.
36-
- Don't duplicate templates/files to work around errors — fix the original.
37-
- Suggest a clear manual test path for every change (even when tests cover it).
38-
- If you cannot open a file or content is missing, say so explicitly and stop. Do not guess.
39-
40-
Quick operational commands (expect these to exist; if not, ask)
41-
- npm run dev — start local dev server
42-
- npm test — run unit + integration test suite
43-
- npm run lint — run linting
44-
- npm run build — build TypeScript
45-
- npm run migrate:up / migrate:down — database migrations
46-
47-
Commit & PR expectations
48-
- Each prompt/plan step should result in a single, focused commit/PR with:
49-
- Code + tests + docs/PROMPT_PLAN.md checklist update.
50-
- A short, copy-pasteable commit summary in the docs/PROMPT_PLAN.md step completion entry.
51-
- Clear CHANGELOG or Release notes entry if user-facing behavior changed (or explicitly state "No user-facing changes").
52-
- Use atomic commits. Include test run results in PR description.
53-
54-
Include this governance / workflow block verbatim (do not modify)
55-
## Repository docs
56-
- 'ONE_PAGER.md' - Captures Problem, Audience, Platform, Core Flow, MVP Features; Non-Goals optional.
57-
- 'docs/DEV_SPEC.md' - Minimal functional and technical specification consistent with prior docs, including a concise **Definition of Done**.
58-
- 'docs/PROMPT_PLAN.md' - Agent-Ready Planner with per-step prompts, expected artifacts, tests, rollback notes, idempotency notes, and a TODO checklist using Markdown checkboxes. This file drives the agent workflow.
59-
- 'docs/STYLE.md' - Unified design system reference. Typography, layout, color tokens, component patterns. Inspired by Hatchet (structural layout, inset panels) and Outliner (clean hierarchy, whitespace). **All dashboard UI changes must follow this guide.**
60-
- 'AGENTS.md' - This file.
61-
62-
### Agent responsibility
63-
- After completing any coding, refactor, or test step, **immediately update the corresponding TODO checklist item in 'docs/PROMPT_PLAN.md'**.
64-
- Use the same Markdown checkbox format ('- [x]') to mark completion.
65-
- When creating new tasks or subtasks, add them directly under the appropriate section anchor in 'docs/PROMPT_PLAN.md'.
66-
- Always commit changes to 'docs/PROMPT_PLAN.md' alongside the code and tests that fulfill them.
67-
- Do not consider work "done" until the matching checklist item is checked and all related tests are green.
68-
- When a stage (plan step) is complete with green tests, update the README "Release notes" section with any user-facing impact (or explicitly state "No user-facing changes" if applicable).
69-
- Even when automated coverage exists, always suggest a feasible manual test path so the human can exercise the feature end-to-end.
70-
- After a plan step is finished, document its completion state with a short checklist. Include: step name & number, test results, 'docs/PROMPT_PLAN.md' status, manual checks performed (mark as complete only after the human confirms they ran to their satisfaction), release notes status, and an inline commit summary string the human can copy & paste.
71-
72-
#### Guardrails for agents
73-
- Make the smallest change that passes tests and improves the code.
74-
- Do not introduce new public APIs without updating 'docs/DEV_SPEC.md' and relevant tests.
75-
- Do not duplicate templates or files to work around issues. Fix the original.
76-
- If a file cannot be opened or content is missing, say so explicitly and stop. Do not guess.
77-
- Respect privacy and logging policy: do not log secrets, prompts, completions, or PII.
78-
79-
#### Deferred-work notation
80-
- When a task is intentionally paused, keep its checkbox unchecked and prepend '(Deferred)' to the TODO label in 'docs/PROMPT_PLAN.md', followed by a short reason.
81-
- Apply the same '(Deferred)' tag to every downstream checklist item that depends on the paused work.
82-
- Remove the tag only after the work resumes; this keeps the outstanding scope visible without implying completion.
83-
84-
85-
86-
#### When the prompt plan is fully satisfied
87-
- Once every Definition of Done task in 'docs/PROMPT_PLAN.md' is either checked off or explicitly marked '(Deferred)', the plan is considered **complete**.
88-
- After that point, you no longer need to update prompt-plan TODOs or reference 'docs/PROMPT_PLAN.md', 'docs/DEV_SPEC.md', 'idea_one_pager.md', or other upstream docs to justify changes.
89-
- All other guardrails, testing requirements, and agent responsibilities in this file continue to apply unchanged.
90-
91-
#### On task completion — always suggest next actions
92-
- When the current task (or set of tasks) is finished, **always** suggest 2-4 concrete next actions the human could take.
93-
- Pull suggestions from: memory files, branch/git state, plan docs, deploy status, or known blockers.
94-
- Prioritize by impact: ship-blocking items first, then quick wins, then nice-to-haves.
95-
- If nothing obvious remains, suggest: commit/push, deploy, test manually, or review related areas.
96-
97-
---
98-
99-
## Testing policy (non-negotiable)
100-
- Tests **MUST** cover the functionality being implemented.
101-
- **NEVER** ignore the output of the system or the tests - logs and messages often contain **CRITICAL** information.
102-
- **TEST OUTPUT MUST BE PRISTINE TO PASS.**
103-
- If logs are **supposed** to contain errors, capture and test it.
104-
- **NO EXCEPTIONS POLICY:** Under no circumstances should you mark any test type as "not applicable". Every project, regardless of size or complexity, **MUST** have unit tests, integration tests, **AND** end-to-end tests. If you believe a test type doesn't apply, you need the human to say exactly **"I AUTHORIZE YOU TO SKIP WRITING TESTS THIS TIME"**.
105-
106-
### TDD (how we work)
107-
- Write tests **before** implementation.
108-
- Only write enough code to make the failing test pass.
109-
- Refactor continuously while keeping tests green.
110-
111-
**TDD cycle**
112-
1. Write a failing test that defines a desired function or improvement.
113-
2. Run the test to confirm it fails as expected.
114-
3. Write minimal code to make the test pass.
115-
4. Run the test to confirm success.
116-
5. Refactor while keeping tests green.
117-
6. Repeat for each new feature or bugfix.
118-
119-
---
120-
121-
## Important checks
122-
- **NEVER** disable functionality to hide a failure. Fix root cause.
123-
- **NEVER** create duplicate templates or files. Fix the original.
124-
- **NEVER** claim something is "working" when any functionality is disabled or broken.
125-
- If you can't open a file or access something requested, say so. Do not assume contents.
126-
- **ALWAYS** identify and fix the root cause of template or compilation errors.
127-
- If git is initialized, ensure a '.gitignore' exists and contains at least:
128-
129-
.env
130-
.env.local
131-
.env.*
132-
133-
Ask the human whether additional patterns should be added, and suggest any that you think are important given the project.
134-
135-
## When to ask for human input
136-
Ask the human if any of the following is true:
137-
- A test type appears "not applicable". Use the exact phrase request: **"I AUTHORIZE YOU TO SKIP WRITING TESTS THIS TIME"**.
138-
- Required anchors conflict or are missing from upstream docs.
139-
- You need new environment variables or secrets.
140-
- An external dependency or major architectural change is required.
141-
- Design files are missing, unsupported or oversized
142-
143-
(End of verbatim block)
144-
145-
Minimal examples for checklist updates (copy/pasteable)
146-
- After completing a prompt step, add an entry under that step in docs/PROMPT_PLAN.md similar to:
147-
- [x] Step 5 — Implement POST /api/v1/query — tests green — manual checks: cURL example tested — README Release Notes updated — commit: "query: add /api/v1/query route, adapter integration, tests"
148-
- If pausing work:
149-
- - [ ] (Deferred) Step 7.3 — Implement real Pinecone adapter — blocked on PINECONE_API_KEY (reason: waiting for dev key from infra)
150-
151-
If anything is missing
152-
- If you cannot open docs/PROMPT_PLAN.md, docs/DEV_SPEC.md, idea.md, idea_one_pager.md, or any design file, stop and report exactly which file and why (permission/absent/parse error).
153-
- Ask for required secrets or permissions rather than guessing. Use the "When to ask for human input" rules above.
154-
155-
Contact & escalation
156-
- When blocked on infra/secrets/design files, create a short note in docs/PROMPT_PLAN.md under the current step and ping the human with:
157-
- What I need: (e.g., PINECONE_API_KEY, AWS dev creds)
158-
- Why I need it: (which step/blocker)
159-
- Recommended minimal next action & fallback
160-
161-
Notes
162-
- Keep AGENTS.md and the rest of the repo docs in sync. Update this file if workflow expectations change.
163-
164-
End.
1+
# StackMemory - Project Configuration
2+
3+
## Project Structure
4+
5+
```
6+
src/
7+
cli/ # CLI commands and entry point
8+
core/ # Core business logic
9+
context/ # Frame and context management
10+
database/ # Database adapters (SQLite, ParadeDB)
11+
digest/ # Digest generation
12+
query/ # Query parsing and routing
13+
integrations/ # External integrations (Linear, MCP)
14+
services/ # Business services
15+
skills/ # Claude Code skills
16+
utils/ # Shared utilities
17+
scripts/ # Build and utility scripts
18+
config/ # Configuration files
19+
docs/ # Documentation
20+
```
21+
22+
## Key Files
23+
24+
- Entry: src/cli/index.ts
25+
- MCP Server: src/integrations/mcp/server.ts
26+
- Frame Manager: src/core/context/frame-manager.ts
27+
- Database: src/core/database/sqlite-adapter.ts
28+
29+
## Detailed Guides
30+
31+
Quick reference (agent_docs/):
32+
- linear_integration.md - Linear sync
33+
- mcp_server.md - MCP tools
34+
- database_storage.md - Storage
35+
- claude_hooks.md - Hooks
36+
37+
Full documentation (docs/):
38+
- principles.md - Agent programming paradigm
39+
- architecture.md - Extension model and browser sandbox
40+
- SPEC.md - Technical specification
41+
- API_REFERENCE.md - API docs
42+
- DEVELOPMENT.md - Dev guide
43+
- SETUP.md - Installation
44+
45+
## Commands
46+
47+
```bash
48+
npm run build # Compile TypeScript (esbuild)
49+
npm run lint # ESLint check
50+
npm run lint:fix # Auto-fix lint issues
51+
npm test # Run Vitest (watch)
52+
npm run test:run # Run tests once
53+
npm run linear:sync # Sync with Linear
54+
55+
# StackMemory CLI
56+
stackmemory capture # Save session state for handoff
57+
stackmemory restore # Restore from captured state
58+
```
59+
60+
## Working Directory
61+
62+
- PRIMARY: /Users/jwu/Dev/stackmemory
63+
- ALLOWED: All subdirectories
64+
- TEMP: /tmp for temporary operations
65+
66+
## Validation (MUST DO)
67+
68+
After code changes:
69+
1. `npm run lint` - fix any errors AND warnings
70+
2. `npm run test:run` - verify no regressions
71+
3. `npm run build` - ensure compilation
72+
4. Run code to verify it works
73+
74+
Test coverage:
75+
- New features require tests in `src/**/__tests__/`
76+
- Maintain or improve coverage (no untested code paths)
77+
- Critical paths: context management, handoff, Linear sync
78+
79+
Never: Assume success | Skip testing | Use mock data as fallback
80+
81+
## Git Rules (CRITICAL)
82+
83+
- NEVER use `--no-verify` on git push or commit
84+
- ALWAYS fix lint/test errors before pushing
85+
- If pre-push hooks fail, fix the underlying issue
86+
- Run `npm run lint && npm run test:run` before pushing
87+
- Commit message format: `type(scope): message`
88+
- Branch naming: `feature/STA-XXX-description` | `fix/STA-XXX-description` | `chore/description`
89+
90+
## Task Management
91+
92+
- Use TodoWrite for 3+ steps or multiple requests
93+
- Keep one task in_progress at a time
94+
- Update task status immediately on completion
95+
96+
## Security
97+
98+
NEVER hardcode secrets - use process.env with dotenv/config
99+
100+
```javascript
101+
import 'dotenv/config';
102+
const API_KEY = process.env.LINEAR_API_KEY;
103+
if (!API_KEY) {
104+
console.error('LINEAR_API_KEY not set');
105+
process.exit(1);
106+
}
107+
```
108+
109+
Environment sources (check in order):
110+
1. .env file
111+
2. .env.local
112+
3. ~/.zshrc
113+
4. Process environment
114+
115+
Secret patterns to block: lin_api_* | lin_oauth_* | sk-* | npm_*
116+
117+
## Deploy
118+
119+
```bash
120+
# npm publish (uses NPM_TOKEN from .env, no OTP needed)
121+
git stash -- scripts/gepa/ # stash GEPA state (dirties working tree)
122+
NPM_TOKEN=$(grep '^NPM_TOKEN=' .env | cut -d= -f2) \
123+
npm publish --registry https://registry.npmjs.org/ \
124+
--//registry.npmjs.org/:_authToken="$NPM_TOKEN"
125+
git stash pop # restore GEPA state
126+
127+
# Railway
128+
railway up
129+
130+
# Pre-publish checks require clean git status — stash GEPA files first
131+
```
132+
133+
## Workflow
134+
135+
- Check .env for API keys before asking
136+
- Run npm run linear:sync after task completion
137+
- Use browser MCP for visual testing
138+
- Review recent commits and stackmemory.json on session start
139+
- Use subagents for multi-step tasks
140+
- Ask 1-3 clarifying questions for complex commands (one at a time)

scripts/gepa/config.json

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424

2525
"evals": {
2626
"directory": "./evals",
27-
"minSamplesPerVariant": 5,
27+
"minSamplesPerVariant": 8,
2828
"timeout": 120000,
2929
"metrics": [
3030
"task_completion",
@@ -34,6 +34,12 @@
3434
]
3535
},
3636

37+
"judge": {
38+
"model": "claude-haiku-4-5-20251001",
39+
"maxOutputTokens": 2000,
40+
"timeoutMs": 30000
41+
},
42+
3743
"scoring": {
3844
"weights": {
3945
"task_completion": 0.4,
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
// Simple API endpoint that needs pagination added
2+
import express from 'express';
3+
4+
const router = express.Router();
5+
6+
interface User {
7+
id: number;
8+
name: string;
9+
email: string;
10+
}
11+
12+
// In-memory store
13+
const users: User[] = Array.from({ length: 100 }, (_, i) => ({
14+
id: i + 1,
15+
name: `User ${i + 1}`,
16+
email: `user${i + 1}@example.com`,
17+
}));
18+
19+
// GET /users - returns ALL users (no pagination)
20+
router.get('/users', (req, res) => {
21+
res.json(users);
22+
});
23+
24+
// GET /users/:id
25+
router.get('/users/:id', (req, res) => {
26+
const user = users.find((u) => u.id === parseInt(req.params.id));
27+
if (!user) return res.status(404).json({ error: 'Not found' });
28+
res.json(user);
29+
});
30+
31+
export default router;

0 commit comments

Comments
 (0)