Add EvoSkill benchmark evaluation research plan by wrsmith108 · Pull Request #307 · smith-horn/skillsmith

wrsmith108 · 2026-03-12T05:00:48Z

Summary

Add a comprehensive research plan for evaluating Skillsmith against the EvoSkill benchmark framework. This document outlines the methodology, experimental design, and implementation roadmap for comparing Skillsmith's curated skill discovery approach against EvoSkill's evolutionary skill generation on three established benchmarks (OfficeQA, SEAL-QA, BrowseComp).

Ticket

[To be assigned]

Changes

Add .claude/development/evoskill-benchmark-plan.md with complete research plan including:
- Executive summary and EvoSkill methodology overview
- Mapping of EvoSkill concepts to Skillsmith equivalents
- Experimental design with 6 test conditions (baseline, EvoSkill-evolved, Skillsmith-search, Skillsmith-recommend, Skillsmith-curated, hybrid)
- 5-phase implementation plan (environment setup, harness development, coverage audit, execution, analysis)
- Technical integration points and required Skillsmith APIs
- Codebase audit identifying 4 infrastructure gaps (offline evaluation dataset, IR metrics, embedding fallback behavior, existing mock benchmarks)
- Risk mitigation strategies and success criteria
- Open questions for stakeholder review

Context

This plan establishes the foundation for a rigorous comparative evaluation of Skillsmith's skill discovery capabilities. The research addresses a critical question: does curated skill discovery (Skillsmith's registry + semantic search) match or exceed evolutionary skill discovery (EvoSkill's automated generation) for downstream task accuracy?

The document identifies that Skillsmith and EvoSkill occupy the same problem space but use fundamentally different approaches:

EvoSkill: Generates new skills through iterative failure analysis and evolutionary optimization
Skillsmith: Retrieves existing skills from a curated registry via semantic search and recommendation

The plan is structured to enable fair comparison by:

Using identical benchmark datasets and evaluation splits as EvoSkill
Testing multiple discovery strategies (search, recommend, curated)
Measuring both accuracy and cost (API calls, tokens, wall-clock time)
Assessing zero-shot transfer capabilities

Checklist

Code Quality

No code changes (documentation only)

Documentation

Comprehensive research plan with clear methodology
Identified infrastructure gaps and remediation strategies
Success criteria and deliverables defined
Open questions for review documented

Testing

No testing needed — this is a research planning document. Implementation will follow in subsequent PRs with corresponding test coverage.

Notes for Reviewers

This plan is marked as "Draft — pending review" and includes an "Open Questions for Review" section (§11) with three key decisions:

Registry seeding strategy: Should benchmark-domain skills be added to the registry beforehand, or test against the current registry as-is?
Hybrid condition scope: Is the EvoSkill-seeded-with-Skillsmith condition worth the compute cost?
Model selection: Should we benchmark both Opus and Sonnet, or standardize on one?

The codebase audit (§7) surfaces four infrastructure gaps that will need to be addressed during Phase 2 implementation. These are not blockers but define the scope of work required to build a production-quality benchmark harness.

https://claude.ai/code/session_01X9mTCSoACzJNWQPv2HdeGq

Analyzes EvoSkill (arxiv 2603.02766) methodology for automated skill discovery and designs an apples-to-apples comparison against Skillsmith's registry-based search and recommendation pipeline using the same datasets (OfficeQA, SEAL-QA, BrowseComp) and scoring methods. https://claude.ai/code/session_01X9mTCSoACzJNWQPv2HdeGq

Codebase exploration identified 4 gaps that affect benchmark design: no offline evaluation dataset, no IR metrics (nDCG/MRR/MAP), SkillMatcher always uses mock embeddings in offline path, and existing benchmarks measure latency not quality. Added Section 7 documenting these with remediation steps. https://claude.ai/code/session_01X9mTCSoACzJNWQPv2HdeGq

Relocate from .claude/development/ to docs/internal/research/ where research documents belong. Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>

github-actions · 2026-03-12T05:19:31Z

E2E Test Results

E2E Test Results - March 12, 2026

Summary

Status: ✅ PASSED
Total Duration: 0.00s
Generated: 2026-03-12T05:19:30.383Z

Test Results

Phase	Status	Duration
CLI E2E	⏭️ Skipped	-
MCP E2E	⏭️ Skipped	-

Generated by skillsmith E2E test suite

…ion research Update submodule to include two new research documents: - Skillsmith generative capabilities vs EvoSkill comparison - Task-accuracy evaluation design (Study A harness + Study B iterative evaluator) Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>

Update submodule: registry seeding (as-is first), hybrid (include), model choice (Sonnet primary + Opus ablation), publication (arXiv + blog). Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>

Three wave-based implementation plans with cross-links to research: - Benchmark harness (Study A): 4 waves, 12-18 days - Task-accuracy evaluator (Study B): 4 waves, 11-17 days - Paper & publication: 4 waves, 7-10 days Total: 22-30 days with parallelization, $250-500 budget Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>

github-actions · 2026-03-12T05:56:34Z

E2E Test Results

E2E Test Results - March 12, 2026

Summary

Status: ✅ PASSED
Total Duration: 0.00s
Generated: 2026-03-12T05:56:33.425Z

Test Results

Phase	Status	Duration
CLI E2E	⏭️ Skipped	-
MCP E2E	⏭️ Skipped	-

Generated by skillsmith E2E test suite

github-actions · 2026-03-12T06:01:02Z

E2E Test Results

E2E Test Results - March 12, 2026

Summary

Status: ✅ PASSED
Total Duration: 0.00s
Generated: 2026-03-12T06:01:01.000Z

Test Results

Phase	Status	Duration
CLI E2E	⏭️ Skipped	-
MCP E2E	⏭️ Skipped	-

Generated by skillsmith E2E test suite

claude and others added 3 commits March 12, 2026 04:45

Move EvoSkill benchmark plan to docs/internal/research

6b9da38

Relocate from .claude/development/ to docs/internal/research/ where research documents belong. Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>

wrsmith108 and others added 3 commits March 11, 2026 22:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add EvoSkill benchmark evaluation research plan#307

Add EvoSkill benchmark evaluation research plan#307
wrsmith108 wants to merge 6 commits intomainfrom
claude/skillsmith-benchmark-research-0EKAX

wrsmith108 commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wrsmith108 commented Mar 12, 2026

Summary

Ticket

Changes

Context

Checklist

Code Quality

Documentation

Testing

Notes for Reviewers

Uh oh!

github-actions bot commented Mar 12, 2026

E2E Test Results

E2E Test Results - March 12, 2026

Summary

Test Results

Uh oh!

github-actions bot commented Mar 12, 2026

E2E Test Results

E2E Test Results - March 12, 2026

Summary

Test Results

Uh oh!

github-actions bot commented Mar 12, 2026

E2E Test Results

E2E Test Results - March 12, 2026

Summary

Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants