feat(core): EvoSkill task-accuracy evaluator — Waves 0–1 (SMI-3284)#310
feat(core): EvoSkill task-accuracy evaluator — Waves 0–1 (SMI-3284)#310wrsmith108 wants to merge 10 commits intomainfrom
Conversation
…MI-3255 Wave 1 of the EvoSkill Benchmark Harness (Study A): - IR metrics: nDCG, MRR, MAP@k, Precision@k, Recall@k (SMI-3263) - Multi-tolerance exact-match scorer + LLM-judge DI interface (SMI-3266) - Core types: BenchmarkTask, ConditionConfig, HarnessConfig (SMI-3269) - Barrel exports from evoskill/index.ts + parent benchmarks/index.ts (SMI-3276) - 41 tests: known-answer IR metrics + scorer edge cases (SMI-3265) - Worktree Docker compose override for port isolation Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>
…284 Waves 0–1 Wave 0 (Schema & Data Model — SMI-3289, SMI-3290, SMI-3291, SMI-3292): - Add benchmark_results table with CHECK (correct_count <= task_count) - Add skill_variants table with UNIQUE(skill_id, content_hash) and CHECK (is_frontier IN (0,1)) - Add failure_patterns table with category CHECK constraint - Migration v11, SCHEMA_VERSION bumped from 10 to 11 - BenchmarkRepository with full CRUD for all 3 tables Wave 1 (Failure Analyzer — SMI-3293, SMI-3294, SMI-3295): - FailureAnalyzer with heuristic mode: 5 categories (wrong_format, missing_context, reasoning_error, tool_misuse, hallucination) - suggestedFix templates per category - LLM mode stub (falls back to heuristic for now) - Hallucination documented as best-effort (detection-by-absence) Tests: 48 new tests (27 repository + 21 analyzer), 0 regressions. Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>
Wave 2 of the EvoSkill Benchmark Harness (Study A): - Dataset loader: CSV/JSON parsing, seeded train/val/test splits (SMI-3269) - Skill selector: conditions 1–6, 8–9 factories; condition 7 throws NotImplementedError (SMI-3270) - Agent runner: Claude API DI interface, exponential backoff [1s,2s,4s], per-task timeout (SMI-3271) - Evaluator: scorer-based accuracy, token cost aggregation, IR metrics (SMI-3272) - Harness orchestrator: concurrent conditions per seed, serial seeds (SMI-3273) - Report generator: markdown tables, JSON export, Pareto frontier (SMI-3274) - Barrel exports updated for all new modules (SMI-3276) - 34 new tests: dataset-loader (16), evaluator (9), skill-selector (9) Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>
- CLI command: `skillsmith benchmark evoskill` with options for benchmark, condition, seeds, sample fraction, dry-run, model, and output directory - Register benchmark command group in CLI main entry - Add all EvoSkill exports to @skillsmith/core index.ts for public API - Placeholder AgentClient/LlmJudgeClient — real SDK integration in Wave 3 Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>
Performance Benchmark Results╔═══════════════════════════════════════════════════════════════╗ Running 50 iterations with 10 warmup... --- Memory Operations --- --- Embedding Search --- --- Recommendation Pipeline --- ═══════════════════════════════════════════════════════════════ V3 Migration Benchmark ReportDate: 2026-03-12T07:09:55.011Z Results
Summary
Notes
|
E2E Test ResultsE2E Test Results - March 12, 2026Summary
Test Results
Generated by skillsmith E2E test suite |
Critical: - C1: Validate JSON dataset fields (type + non-empty) in loadJSONDataset - C2: Per-benchmark scorer via getScorer() instead of single shared scorer High: - H1: Pass seed to loadDataset() inside seed loop for proper split variance - H2: Add explicit SkillSelectorFn type annotation - H3: Use fs/promises.readFile + path traversal guard in evolvedSelector - H4: Add 21 tests for agent-runner (8), harness (4), report (9) Medium: - M2: Configurable scoreThreshold in EvaluatorConfig (default 0.5) - M3: Throw on unconfigured conditions 2/5/6/8 instead of silent fallback - M4: Rename runTask→runEvoSkillTask, runBatch→runEvoSkillBatch - M5: Replace mutable object accumulator with primitive variables Low: - L1: Include typeof in JSON parse error message - L2: Add datasetDir to HarnessConfig for absolute path resolution - L3: Add tests for conditions 5, 8, and path traversal guard 100 tests pass across 8 test files. M1 (docker-compose.override.yml) is worktree-local and will be excluded from PR. Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>
… — SMI-3266
Add 200 fixture samples (100 OfficeQA via reward.py, 100 DABStep via
dabstep_scorer.py) generated from EvoSkill's Python scorers with real
dataset entries. Cross-validation test asserts ≤5% divergence excluding
documented known divergences (substring matching, list reordering,
fuzzy string similarity).
Fix: scorer normalize() now strips surrounding quotes to match Python
behavior.
Documented divergences:
- Python supports substring matching ("Paris" in "The answer is Paris")
- Python strips parentheticals for text comparison
- Python dabstep_scorer supports list reordering (semicolon/comma)
- Python dabstep_scorer uses SequenceMatcher (>0.95 similarity)
Co-Authored-By: claude-flow <ruv@ruv.net>
Co-Authored-By: Claude <noreply@anthropic.com>
…MI-3297 Wave 2 of Study B (task-accuracy evaluator): - SkillVariantGenerator: 4 strategies (decompose, augment, specialize, LLM rewrite) with SHA-256 dedup - VariantSelector: Pareto frontier selection (accuracy vs cost, tiebreak on skillSize) - 25 tests covering all strategies, dedup, edge cases, and dominance logic Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>
…00, SMI-3301, SMI-3302 Wave 3 of Study B (task-accuracy evaluator): - IterativeEvaluator: evaluate → analyze → generate → select loop with early stopping and cost guard - Condition 7 (Skillsmith-Iterative) integrated into skill-selector.ts - 7 tests covering baseline seeding, early stopping, budget enforcement, convergence tracking Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>
Performance Benchmark Results |
E2E Test Results
|
Export FailureAnalyzer, SkillVariantGenerator, VariantSelector, IterativeEvaluator and all evaluation types from @skillsmith/core. Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>
|
|
||
| // Replace existing section if present | ||
| const sectionRegex = /\n## Skill Improvement Notes\n[\s\S]*?(?=\n## |\n*$)/ | ||
| if (sectionRegex.test(content)) { |
Check failure
Code scanning / CodeQL
Polynomial regular expression used on uncontrolled data High
| // Replace existing section if present | ||
| const sectionRegex = /\n## Skill Improvement Notes\n[\s\S]*?(?=\n## |\n*$)/ | ||
| if (sectionRegex.test(content)) { | ||
| return content.replace(sectionRegex, section).trim() |
Check failure
Code scanning / CodeQL
Polynomial regular expression used on uncontrolled data High
Performance Benchmark Results |
E2E Test Results
|
Summary
benchmark_results,skill_variants,failure_patterns) via migration v11 with CHECK constraints (correct_count <= task_count,is_frontier IN (0,1),UNIQUE(skill_id, content_hash)). BenchmarkRepository with full CRUD.Dependency gate: Waves 2–3 depend on Study A (benchmark harness). This PR covers the independent Waves 0–1.
Test plan
🤖 Generated with claude-flow