Skip to content

feat(core): EvoSkill task-accuracy evaluator — Waves 0–1 (SMI-3284)#310

Open
wrsmith108 wants to merge 10 commits intomainfrom
evoskill-task-accuracy-evaluator
Open

feat(core): EvoSkill task-accuracy evaluator — Waves 0–1 (SMI-3284)#310
wrsmith108 wants to merge 10 commits intomainfrom
evoskill-task-accuracy-evaluator

Conversation

@wrsmith108
Copy link
Member

Summary

  • Wave 0 (Schema — SMI-3289–3292): Adds 3 tables (benchmark_results, skill_variants, failure_patterns) via migration v11 with CHECK constraints (correct_count <= task_count, is_frontier IN (0,1), UNIQUE(skill_id, content_hash)). BenchmarkRepository with full CRUD.
  • Wave 1 (Failure Analyzer — SMI-3293–3295): Heuristic failure categorization into 5 categories (wrong_format, missing_context, reasoning_error, tool_misuse, hallucination) with suggestedFix templates per category. LLM mode stub for future implementation.
  • 48 new tests, 0 regressions (3020 total tests pass)

Dependency gate: Waves 2–3 depend on Study A (benchmark harness). This PR covers the independent Waves 0–1.

Test plan

  • All 48 new tests pass (27 repository CRUD + constraint enforcement, 21 analyzer categorization)
  • All 3020 existing tests pass (0 regressions)
  • TypeScript strict mode — no errors
  • ESLint — no warnings
  • Prettier — all files formatted
  • All files under 500 lines
  • DB constraint enforcement tested: CHECK, UNIQUE, FK
  • Hallucination false-positive guard tested (must not dominate mixed-category sets)

🤖 Generated with claude-flow

wrsmith108 and others added 4 commits March 11, 2026 23:55
…MI-3255

Wave 1 of the EvoSkill Benchmark Harness (Study A):
- IR metrics: nDCG, MRR, MAP@k, Precision@k, Recall@k (SMI-3263)
- Multi-tolerance exact-match scorer + LLM-judge DI interface (SMI-3266)
- Core types: BenchmarkTask, ConditionConfig, HarnessConfig (SMI-3269)
- Barrel exports from evoskill/index.ts + parent benchmarks/index.ts (SMI-3276)
- 41 tests: known-answer IR metrics + scorer edge cases (SMI-3265)
- Worktree Docker compose override for port isolation

Co-Authored-By: claude-flow <ruv@ruv.net>
Co-Authored-By: Claude <noreply@anthropic.com>
…284 Waves 0–1

Wave 0 (Schema & Data Model — SMI-3289, SMI-3290, SMI-3291, SMI-3292):
- Add benchmark_results table with CHECK (correct_count <= task_count)
- Add skill_variants table with UNIQUE(skill_id, content_hash) and CHECK (is_frontier IN (0,1))
- Add failure_patterns table with category CHECK constraint
- Migration v11, SCHEMA_VERSION bumped from 10 to 11
- BenchmarkRepository with full CRUD for all 3 tables

Wave 1 (Failure Analyzer — SMI-3293, SMI-3294, SMI-3295):
- FailureAnalyzer with heuristic mode: 5 categories (wrong_format,
  missing_context, reasoning_error, tool_misuse, hallucination)
- suggestedFix templates per category
- LLM mode stub (falls back to heuristic for now)
- Hallucination documented as best-effort (detection-by-absence)

Tests: 48 new tests (27 repository + 21 analyzer), 0 regressions.

Co-Authored-By: claude-flow <ruv@ruv.net>
Co-Authored-By: Claude <noreply@anthropic.com>
Wave 2 of the EvoSkill Benchmark Harness (Study A):
- Dataset loader: CSV/JSON parsing, seeded train/val/test splits (SMI-3269)
- Skill selector: conditions 1–6, 8–9 factories; condition 7 throws NotImplementedError (SMI-3270)
- Agent runner: Claude API DI interface, exponential backoff [1s,2s,4s], per-task timeout (SMI-3271)
- Evaluator: scorer-based accuracy, token cost aggregation, IR metrics (SMI-3272)
- Harness orchestrator: concurrent conditions per seed, serial seeds (SMI-3273)
- Report generator: markdown tables, JSON export, Pareto frontier (SMI-3274)
- Barrel exports updated for all new modules (SMI-3276)
- 34 new tests: dataset-loader (16), evaluator (9), skill-selector (9)

Co-Authored-By: claude-flow <ruv@ruv.net>
Co-Authored-By: Claude <noreply@anthropic.com>
- CLI command: `skillsmith benchmark evoskill` with options for benchmark,
  condition, seeds, sample fraction, dry-run, model, and output directory
- Register benchmark command group in CLI main entry
- Add all EvoSkill exports to @skillsmith/core index.ts for public API
- Placeholder AgentClient/LlmJudgeClient — real SDK integration in Wave 3

Co-Authored-By: claude-flow <ruv@ruv.net>
Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link

Performance Benchmark Results

╔═══════════════════════════════════════════════════════════════╗
║ SMI-1537: V3 Migration Performance Benchmarks ║
╠═══════════════════════════════════════════════════════════════╣
║ Memory Operations: 40x target ║
║ Embedding Search: 150x target ║
║ Recommendation Pipeline: 4x target ║
╚═══════════════════════════════════════════════════════════════╝

Running 50 iterations with 10 warmup...

--- Memory Operations ---

--- Embedding Search ---
Indexing 10K vectors... done

--- Recommendation Pipeline ---
Initializing recommendation pipeline with 1000 skills... done

═══════════════════════════════════════════════════════════════

V3 Migration Benchmark Report

Date: 2026-03-12T07:09:55.011Z
Node.js: v22.22.1

Results

Operation V2 Baseline V3 Result Speedup Target Status
Memory Store 200ms 0.00ms 174672x 40x
Memory Get 150ms 0.00ms 270758x 40x
Memory Delete 180ms 0.00ms 318584x 40x
Embedding Search (10K vectors) 500ms 0.14ms 3463x 150x
Recommendation Pipeline 800ms 0.28ms 2841x 4x

Summary

  • Total Benchmarks: 5
  • Passed: 5
  • Failed: 0
  • All Targets Met: ✅ Yes

Notes

  • V2 baselines are from pre-migration measurements (simulated for this benchmark)
  • Target threshold includes 20% tolerance for environmental variance
  • Memory operations use in-memory Map (real V3 uses optimized SQLite)
  • Embedding search simulates HNSW algorithm efficiency (real V3 uses onnxruntime-node)

@github-actions
Copy link

E2E Test Results

E2E Test Results - March 12, 2026

Summary

  • Status: ✅ PASSED
  • Total Duration: 0.00s
  • Generated: 2026-03-12T07:17:41.777Z

Test Results

Phase Status Duration
CLI E2E ⏭️ Skipped -
MCP E2E ⏭️ Skipped -

Generated by skillsmith E2E test suite

wrsmith108 and others added 5 commits March 12, 2026 08:27
Critical:
- C1: Validate JSON dataset fields (type + non-empty) in loadJSONDataset
- C2: Per-benchmark scorer via getScorer() instead of single shared scorer

High:
- H1: Pass seed to loadDataset() inside seed loop for proper split variance
- H2: Add explicit SkillSelectorFn type annotation
- H3: Use fs/promises.readFile + path traversal guard in evolvedSelector
- H4: Add 21 tests for agent-runner (8), harness (4), report (9)

Medium:
- M2: Configurable scoreThreshold in EvaluatorConfig (default 0.5)
- M3: Throw on unconfigured conditions 2/5/6/8 instead of silent fallback
- M4: Rename runTask→runEvoSkillTask, runBatch→runEvoSkillBatch
- M5: Replace mutable object accumulator with primitive variables

Low:
- L1: Include typeof in JSON parse error message
- L2: Add datasetDir to HarnessConfig for absolute path resolution
- L3: Add tests for conditions 5, 8, and path traversal guard

100 tests pass across 8 test files. M1 (docker-compose.override.yml)
is worktree-local and will be excluded from PR.

Co-Authored-By: claude-flow <ruv@ruv.net>
Co-Authored-By: Claude <noreply@anthropic.com>
… — SMI-3266

Add 200 fixture samples (100 OfficeQA via reward.py, 100 DABStep via
dabstep_scorer.py) generated from EvoSkill's Python scorers with real
dataset entries. Cross-validation test asserts ≤5% divergence excluding
documented known divergences (substring matching, list reordering,
fuzzy string similarity).

Fix: scorer normalize() now strips surrounding quotes to match Python
behavior.

Documented divergences:
- Python supports substring matching ("Paris" in "The answer is Paris")
- Python strips parentheticals for text comparison
- Python dabstep_scorer supports list reordering (semicolon/comma)
- Python dabstep_scorer uses SequenceMatcher (>0.95 similarity)

Co-Authored-By: claude-flow <ruv@ruv.net>
Co-Authored-By: Claude <noreply@anthropic.com>
…MI-3297

Wave 2 of Study B (task-accuracy evaluator):
- SkillVariantGenerator: 4 strategies (decompose, augment, specialize, LLM rewrite) with SHA-256 dedup
- VariantSelector: Pareto frontier selection (accuracy vs cost, tiebreak on skillSize)
- 25 tests covering all strategies, dedup, edge cases, and dominance logic

Co-Authored-By: claude-flow <ruv@ruv.net>
Co-Authored-By: Claude <noreply@anthropic.com>
…00, SMI-3301, SMI-3302

Wave 3 of Study B (task-accuracy evaluator):
- IterativeEvaluator: evaluate → analyze → generate → select loop with early stopping and cost guard
- Condition 7 (Skillsmith-Iterative) integrated into skill-selector.ts
- 7 tests covering baseline seeding, early stopping, budget enforcement, convergence tracking

Co-Authored-By: claude-flow <ruv@ruv.net>
Co-Authored-By: Claude <noreply@anthropic.com>
Comment on lines +23 to +26
return s
.trim()
.toLowerCase()
.replace(/^["']+|["']+$/g, '') // strip surrounding quotes

Check failure

Code scanning / CodeQL

Polynomial regular expression used on uncontrolled data High test

This
regular expression
that depends on
library input
may run slow on strings with many repetitions of '"'.
This
regular expression
that depends on
library input
may run slow on strings with many repetitions of '"'.
@github-actions
Copy link

Performance Benchmark Results

⚠️ Benchmark results not available

@github-actions
Copy link

E2E Test Results

Phase Status
CLI E2E ❌ failure
MCP E2E ❌ skipped

Export FailureAnalyzer, SkillVariantGenerator, VariantSelector,
IterativeEvaluator and all evaluation types from @skillsmith/core.

Co-Authored-By: claude-flow <ruv@ruv.net>
Co-Authored-By: Claude <noreply@anthropic.com>

// Replace existing section if present
const sectionRegex = /\n## Skill Improvement Notes\n[\s\S]*?(?=\n## |\n*$)/
if (sectionRegex.test(content)) {

Check failure

Code scanning / CodeQL

Polynomial regular expression used on uncontrolled data High

This
regular expression
that depends on
library input
may run slow on strings starting with '\n## Skill Improvement Notes\n' and with many repetitions of '\n## Skill Improvement Notes\n'.
// Replace existing section if present
const sectionRegex = /\n## Skill Improvement Notes\n[\s\S]*?(?=\n## |\n*$)/
if (sectionRegex.test(content)) {
return content.replace(sectionRegex, section).trim()

Check failure

Code scanning / CodeQL

Polynomial regular expression used on uncontrolled data High

This
regular expression
that depends on
library input
may run slow on strings starting with '\n## Skill Improvement Notes\n' and with many repetitions of '\n## Skill Improvement Notes\n'.
@github-actions
Copy link

Performance Benchmark Results

⚠️ Benchmark results not available

@github-actions
Copy link

E2E Test Results

Phase Status
CLI E2E ❌ failure
MCP E2E ❌ skipped

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant