feat(core): EvoSkill task-accuracy evaluator — Waves 0–1 (SMI-3284) by wrsmith108 · Pull Request #310 · smith-horn/skillsmith

wrsmith108 · 2026-03-12T06:57:06Z

Summary

Wave 0 (Schema — SMI-3289–3292): Adds 3 tables (benchmark_results, skill_variants, failure_patterns) via migration v11 with CHECK constraints (correct_count <= task_count, is_frontier IN (0,1), UNIQUE(skill_id, content_hash)). BenchmarkRepository with full CRUD.
Wave 1 (Failure Analyzer — SMI-3293–3295): Heuristic failure categorization into 5 categories (wrong_format, missing_context, reasoning_error, tool_misuse, hallucination) with suggestedFix templates per category. LLM mode stub for future implementation.
48 new tests, 0 regressions (3020 total tests pass)

Dependency gate: Waves 2–3 depend on Study A (benchmark harness). This PR covers the independent Waves 0–1.

Test plan

All 48 new tests pass (27 repository CRUD + constraint enforcement, 21 analyzer categorization)
All 3020 existing tests pass (0 regressions)
TypeScript strict mode — no errors
ESLint — no warnings
Prettier — all files formatted
All files under 500 lines
DB constraint enforcement tested: CHECK, UNIQUE, FK
Hallucination false-positive guard tested (must not dominate mixed-category sets)

🤖 Generated with claude-flow

…MI-3255 Wave 1 of the EvoSkill Benchmark Harness (Study A): - IR metrics: nDCG, MRR, MAP@k, Precision@k, Recall@k (SMI-3263) - Multi-tolerance exact-match scorer + LLM-judge DI interface (SMI-3266) - Core types: BenchmarkTask, ConditionConfig, HarnessConfig (SMI-3269) - Barrel exports from evoskill/index.ts + parent benchmarks/index.ts (SMI-3276) - 41 tests: known-answer IR metrics + scorer edge cases (SMI-3265) - Worktree Docker compose override for port isolation Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>

…284 Waves 0–1 Wave 0 (Schema & Data Model — SMI-3289, SMI-3290, SMI-3291, SMI-3292): - Add benchmark_results table with CHECK (correct_count <= task_count) - Add skill_variants table with UNIQUE(skill_id, content_hash) and CHECK (is_frontier IN (0,1)) - Add failure_patterns table with category CHECK constraint - Migration v11, SCHEMA_VERSION bumped from 10 to 11 - BenchmarkRepository with full CRUD for all 3 tables Wave 1 (Failure Analyzer — SMI-3293, SMI-3294, SMI-3295): - FailureAnalyzer with heuristic mode: 5 categories (wrong_format, missing_context, reasoning_error, tool_misuse, hallucination) - suggestedFix templates per category - LLM mode stub (falls back to heuristic for now) - Hallucination documented as best-effort (detection-by-absence) Tests: 48 new tests (27 repository + 21 analyzer), 0 regressions. Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>

Wave 2 of the EvoSkill Benchmark Harness (Study A): - Dataset loader: CSV/JSON parsing, seeded train/val/test splits (SMI-3269) - Skill selector: conditions 1–6, 8–9 factories; condition 7 throws NotImplementedError (SMI-3270) - Agent runner: Claude API DI interface, exponential backoff [1s,2s,4s], per-task timeout (SMI-3271) - Evaluator: scorer-based accuracy, token cost aggregation, IR metrics (SMI-3272) - Harness orchestrator: concurrent conditions per seed, serial seeds (SMI-3273) - Report generator: markdown tables, JSON export, Pareto frontier (SMI-3274) - Barrel exports updated for all new modules (SMI-3276) - 34 new tests: dataset-loader (16), evaluator (9), skill-selector (9) Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>

- CLI command: `skillsmith benchmark evoskill` with options for benchmark, condition, seeds, sample fraction, dry-run, model, and output directory - Register benchmark command group in CLI main entry - Add all EvoSkill exports to @skillsmith/core index.ts for public API - Placeholder AgentClient/LlmJudgeClient — real SDK integration in Wave 3 Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>

github-actions · 2026-03-12T07:09:55Z

Performance Benchmark Results

╔═══════════════════════════════════════════════════════════════╗
║ SMI-1537: V3 Migration Performance Benchmarks ║
╠═══════════════════════════════════════════════════════════════╣
║ Memory Operations: 40x target ║
║ Embedding Search: 150x target ║
║ Recommendation Pipeline: 4x target ║
╚═══════════════════════════════════════════════════════════════╝

Running 50 iterations with 10 warmup...

--- Memory Operations ---

--- Embedding Search ---
Indexing 10K vectors... done

--- Recommendation Pipeline ---
Initializing recommendation pipeline with 1000 skills... done

═══════════════════════════════════════════════════════════════

V3 Migration Benchmark Report

Date: 2026-03-12T07:09:55.011Z
Node.js: v22.22.1

Results

Operation	V2 Baseline	V3 Result	Speedup	Target	Status
Memory Store	200ms	0.00ms	174672x	40x	✅
Memory Get	150ms	0.00ms	270758x	40x	✅
Memory Delete	180ms	0.00ms	318584x	40x	✅
Embedding Search (10K vectors)	500ms	0.14ms	3463x	150x	✅
Recommendation Pipeline	800ms	0.28ms	2841x	4x	✅

Summary

Total Benchmarks: 5
Passed: 5
Failed: 0
All Targets Met: ✅ Yes

Notes

V2 baselines are from pre-migration measurements (simulated for this benchmark)
Target threshold includes 20% tolerance for environmental variance
Memory operations use in-memory Map (real V3 uses optimized SQLite)
Embedding search simulates HNSW algorithm efficiency (real V3 uses onnxruntime-node)

github-actions · 2026-03-12T07:17:43Z

E2E Test Results

E2E Test Results - March 12, 2026

Summary

Status: ✅ PASSED
Total Duration: 0.00s
Generated: 2026-03-12T07:17:41.777Z

Test Results

Phase	Status	Duration
CLI E2E	⏭️ Skipped	-
MCP E2E	⏭️ Skipped	-

Generated by skillsmith E2E test suite

Critical: - C1: Validate JSON dataset fields (type + non-empty) in loadJSONDataset - C2: Per-benchmark scorer via getScorer() instead of single shared scorer High: - H1: Pass seed to loadDataset() inside seed loop for proper split variance - H2: Add explicit SkillSelectorFn type annotation - H3: Use fs/promises.readFile + path traversal guard in evolvedSelector - H4: Add 21 tests for agent-runner (8), harness (4), report (9) Medium: - M2: Configurable scoreThreshold in EvaluatorConfig (default 0.5) - M3: Throw on unconfigured conditions 2/5/6/8 instead of silent fallback - M4: Rename runTask→runEvoSkillTask, runBatch→runEvoSkillBatch - M5: Replace mutable object accumulator with primitive variables Low: - L1: Include typeof in JSON parse error message - L2: Add datasetDir to HarnessConfig for absolute path resolution - L3: Add tests for conditions 5, 8, and path traversal guard 100 tests pass across 8 test files. M1 (docker-compose.override.yml) is worktree-local and will be excluded from PR. Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>

… — SMI-3266 Add 200 fixture samples (100 OfficeQA via reward.py, 100 DABStep via dabstep_scorer.py) generated from EvoSkill's Python scorers with real dataset entries. Cross-validation test asserts ≤5% divergence excluding documented known divergences (substring matching, list reordering, fuzzy string similarity). Fix: scorer normalize() now strips surrounding quotes to match Python behavior. Documented divergences: - Python supports substring matching ("Paris" in "The answer is Paris") - Python strips parentheticals for text comparison - Python dabstep_scorer supports list reordering (semicolon/comma) - Python dabstep_scorer uses SequenceMatcher (>0.95 similarity) Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>

…-evaluator

…MI-3297 Wave 2 of Study B (task-accuracy evaluator): - SkillVariantGenerator: 4 strategies (decompose, augment, specialize, LLM rewrite) with SHA-256 dedup - VariantSelector: Pareto frontier selection (accuracy vs cost, tiebreak on skillSize) - 25 tests covering all strategies, dedup, edge cases, and dominance logic Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>

…00, SMI-3301, SMI-3302 Wave 3 of Study B (task-accuracy evaluator): - IterativeEvaluator: evaluate → analyze → generate → select loop with early stopping and cost guard - Condition 7 (Skillsmith-Iterative) integrated into skill-selector.ts - 7 tests covering baseline seeding, early stopping, budget enforcement, convergence tracking Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>

packages/core/src/benchmarks/evoskill/scorers.ts

+  return s
+    .trim()
+    .toLowerCase()
+    .replace(/^["']+|["']+$/g, '') // strip surrounding quotes


github-actions · 2026-03-12T16:40:20Z

Performance Benchmark Results

⚠️ Benchmark results not available

github-actions · 2026-03-12T16:43:38Z

E2E Test Results

Phase	Status
CLI E2E	❌ failure
MCP E2E	❌ skipped

Export FailureAnalyzer, SkillVariantGenerator, VariantSelector, IterativeEvaluator and all evaluation types from @skillsmith/core. Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>

packages/core/src/evaluation/SkillVariantGenerator.ts

+
+    // Replace existing section if present
+    const sectionRegex = /\n## Skill Improvement Notes\n[\s\S]*?(?=\n## |\n*$)/
+    if (sectionRegex.test(content)) {


packages/core/src/evaluation/SkillVariantGenerator.ts

+    // Replace existing section if present
+    const sectionRegex = /\n## Skill Improvement Notes\n[\s\S]*?(?=\n## |\n*$)/
+    if (sectionRegex.test(content)) {
+      return content.replace(sectionRegex, section).trim()


github-actions · 2026-03-12T17:21:17Z

Performance Benchmark Results

⚠️ Benchmark results not available

github-actions · 2026-03-12T17:24:43Z

E2E Test Results

Phase	Status
CLI E2E	❌ failure
MCP E2E	❌ skipped

wrsmith108 and others added 4 commits March 11, 2026 23:55

wrsmith108 and others added 5 commits March 12, 2026 08:27

Merge branch 'evoskill-benchmark-harness' into evoskill-task-accuracy…

280da78

…-evaluator

github-advanced-security bot found potential problems Mar 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): EvoSkill task-accuracy evaluator — Waves 0–1 (SMI-3284)#310

feat(core): EvoSkill task-accuracy evaluator — Waves 0–1 (SMI-3284)#310
wrsmith108 wants to merge 10 commits intomainfrom
evoskill-task-accuracy-evaluator

wrsmith108 commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

Check failure

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

Check failure

Check failure

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wrsmith108 commented Mar 12, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Mar 12, 2026

Performance Benchmark Results

V3 Migration Benchmark Report

Results

Summary

Notes

Uh oh!

github-actions bot commented Mar 12, 2026

E2E Test Results

E2E Test Results - March 12, 2026

Summary

Test Results

Uh oh!

Check failure

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 12, 2026

Performance Benchmark Results

Uh oh!

github-actions bot commented Mar 12, 2026

E2E Test Results

Uh oh!

Check failure

Uh oh!

Uh oh!

Check failure

Uh oh!

Uh oh!

github-actions bot commented Mar 12, 2026

Performance Benchmark Results

Uh oh!

github-actions bot commented Mar 12, 2026

E2E Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant