Skip to content

Add EvoSkill benchmark evaluation research plan#307

Open
wrsmith108 wants to merge 6 commits intomainfrom
claude/skillsmith-benchmark-research-0EKAX
Open

Add EvoSkill benchmark evaluation research plan#307
wrsmith108 wants to merge 6 commits intomainfrom
claude/skillsmith-benchmark-research-0EKAX

Conversation

@wrsmith108
Copy link
Member

Summary

Add a comprehensive research plan for evaluating Skillsmith against the EvoSkill benchmark framework. This document outlines the methodology, experimental design, and implementation roadmap for comparing Skillsmith's curated skill discovery approach against EvoSkill's evolutionary skill generation on three established benchmarks (OfficeQA, SEAL-QA, BrowseComp).

Ticket

[To be assigned]

Changes

  • Add .claude/development/evoskill-benchmark-plan.md with complete research plan including:
    • Executive summary and EvoSkill methodology overview
    • Mapping of EvoSkill concepts to Skillsmith equivalents
    • Experimental design with 6 test conditions (baseline, EvoSkill-evolved, Skillsmith-search, Skillsmith-recommend, Skillsmith-curated, hybrid)
    • 5-phase implementation plan (environment setup, harness development, coverage audit, execution, analysis)
    • Technical integration points and required Skillsmith APIs
    • Codebase audit identifying 4 infrastructure gaps (offline evaluation dataset, IR metrics, embedding fallback behavior, existing mock benchmarks)
    • Risk mitigation strategies and success criteria
    • Open questions for stakeholder review

Context

This plan establishes the foundation for a rigorous comparative evaluation of Skillsmith's skill discovery capabilities. The research addresses a critical question: does curated skill discovery (Skillsmith's registry + semantic search) match or exceed evolutionary skill discovery (EvoSkill's automated generation) for downstream task accuracy?

The document identifies that Skillsmith and EvoSkill occupy the same problem space but use fundamentally different approaches:

  • EvoSkill: Generates new skills through iterative failure analysis and evolutionary optimization
  • Skillsmith: Retrieves existing skills from a curated registry via semantic search and recommendation

The plan is structured to enable fair comparison by:

  1. Using identical benchmark datasets and evaluation splits as EvoSkill
  2. Testing multiple discovery strategies (search, recommend, curated)
  3. Measuring both accuracy and cost (API calls, tokens, wall-clock time)
  4. Assessing zero-shot transfer capabilities

Checklist

Code Quality

  • No code changes (documentation only)

Documentation

  • Comprehensive research plan with clear methodology
  • Identified infrastructure gaps and remediation strategies
  • Success criteria and deliverables defined
  • Open questions for review documented

Testing

No testing needed — this is a research planning document. Implementation will follow in subsequent PRs with corresponding test coverage.

Notes for Reviewers

This plan is marked as "Draft — pending review" and includes an "Open Questions for Review" section (§11) with three key decisions:

  1. Registry seeding strategy: Should benchmark-domain skills be added to the registry beforehand, or test against the current registry as-is?
  2. Hybrid condition scope: Is the EvoSkill-seeded-with-Skillsmith condition worth the compute cost?
  3. Model selection: Should we benchmark both Opus and Sonnet, or standardize on one?

The codebase audit (§7) surfaces four infrastructure gaps that will need to be addressed during Phase 2 implementation. These are not blockers but define the scope of work required to build a production-quality benchmark harness.

https://claude.ai/code/session_01X9mTCSoACzJNWQPv2HdeGq

claude and others added 3 commits March 12, 2026 04:45
Analyzes EvoSkill (arxiv 2603.02766) methodology for automated skill
discovery and designs an apples-to-apples comparison against Skillsmith's
registry-based search and recommendation pipeline using the same datasets
(OfficeQA, SEAL-QA, BrowseComp) and scoring methods.

https://claude.ai/code/session_01X9mTCSoACzJNWQPv2HdeGq
Codebase exploration identified 4 gaps that affect benchmark design:
no offline evaluation dataset, no IR metrics (nDCG/MRR/MAP),
SkillMatcher always uses mock embeddings in offline path, and
existing benchmarks measure latency not quality. Added Section 7
documenting these with remediation steps.

https://claude.ai/code/session_01X9mTCSoACzJNWQPv2HdeGq
Relocate from .claude/development/ to docs/internal/research/ where
research documents belong.

Co-Authored-By: claude-flow <ruv@ruv.net>
Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link

E2E Test Results

E2E Test Results - March 12, 2026

Summary

  • Status: ✅ PASSED
  • Total Duration: 0.00s
  • Generated: 2026-03-12T05:19:30.383Z

Test Results

Phase Status Duration
CLI E2E ⏭️ Skipped -
MCP E2E ⏭️ Skipped -

Generated by skillsmith E2E test suite

wrsmith108 and others added 3 commits March 11, 2026 22:35
…ion research

Update submodule to include two new research documents:
- Skillsmith generative capabilities vs EvoSkill comparison
- Task-accuracy evaluation design (Study A harness + Study B iterative evaluator)

Co-Authored-By: claude-flow <ruv@ruv.net>
Co-Authored-By: Claude <noreply@anthropic.com>
Update submodule: registry seeding (as-is first), hybrid (include),
model choice (Sonnet primary + Opus ablation), publication (arXiv + blog).

Co-Authored-By: claude-flow <ruv@ruv.net>
Co-Authored-By: Claude <noreply@anthropic.com>
Three wave-based implementation plans with cross-links to research:
- Benchmark harness (Study A): 4 waves, 12-18 days
- Task-accuracy evaluator (Study B): 4 waves, 11-17 days
- Paper & publication: 4 waves, 7-10 days
Total: 22-30 days with parallelization, $250-500 budget

Co-Authored-By: claude-flow <ruv@ruv.net>
Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link

E2E Test Results

E2E Test Results - March 12, 2026

Summary

  • Status: ✅ PASSED
  • Total Duration: 0.00s
  • Generated: 2026-03-12T05:56:33.425Z

Test Results

Phase Status Duration
CLI E2E ⏭️ Skipped -
MCP E2E ⏭️ Skipped -

Generated by skillsmith E2E test suite

@github-actions
Copy link

E2E Test Results

E2E Test Results - March 12, 2026

Summary

  • Status: ✅ PASSED
  • Total Duration: 0.00s
  • Generated: 2026-03-12T06:01:01.000Z

Test Results

Phase Status Duration
CLI E2E ⏭️ Skipped -
MCP E2E ⏭️ Skipped -

Generated by skillsmith E2E test suite

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants