Add EvoSkill benchmark evaluation research plan#307
Open
wrsmith108 wants to merge 6 commits intomainfrom
Open
Conversation
Analyzes EvoSkill (arxiv 2603.02766) methodology for automated skill discovery and designs an apples-to-apples comparison against Skillsmith's registry-based search and recommendation pipeline using the same datasets (OfficeQA, SEAL-QA, BrowseComp) and scoring methods. https://claude.ai/code/session_01X9mTCSoACzJNWQPv2HdeGq
Codebase exploration identified 4 gaps that affect benchmark design: no offline evaluation dataset, no IR metrics (nDCG/MRR/MAP), SkillMatcher always uses mock embeddings in offline path, and existing benchmarks measure latency not quality. Added Section 7 documenting these with remediation steps. https://claude.ai/code/session_01X9mTCSoACzJNWQPv2HdeGq
Relocate from .claude/development/ to docs/internal/research/ where research documents belong. Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>
E2E Test ResultsE2E Test Results - March 12, 2026Summary
Test Results
Generated by skillsmith E2E test suite |
…ion research Update submodule to include two new research documents: - Skillsmith generative capabilities vs EvoSkill comparison - Task-accuracy evaluation design (Study A harness + Study B iterative evaluator) Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>
Update submodule: registry seeding (as-is first), hybrid (include), model choice (Sonnet primary + Opus ablation), publication (arXiv + blog). Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>
Three wave-based implementation plans with cross-links to research: - Benchmark harness (Study A): 4 waves, 12-18 days - Task-accuracy evaluator (Study B): 4 waves, 11-17 days - Paper & publication: 4 waves, 7-10 days Total: 22-30 days with parallelization, $250-500 budget Co-Authored-By: claude-flow <ruv@ruv.net> Co-Authored-By: Claude <noreply@anthropic.com>
E2E Test ResultsE2E Test Results - March 12, 2026Summary
Test Results
Generated by skillsmith E2E test suite |
E2E Test ResultsE2E Test Results - March 12, 2026Summary
Test Results
Generated by skillsmith E2E test suite |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a comprehensive research plan for evaluating Skillsmith against the EvoSkill benchmark framework. This document outlines the methodology, experimental design, and implementation roadmap for comparing Skillsmith's curated skill discovery approach against EvoSkill's evolutionary skill generation on three established benchmarks (OfficeQA, SEAL-QA, BrowseComp).
Ticket
[To be assigned]
Changes
.claude/development/evoskill-benchmark-plan.mdwith complete research plan including:Context
This plan establishes the foundation for a rigorous comparative evaluation of Skillsmith's skill discovery capabilities. The research addresses a critical question: does curated skill discovery (Skillsmith's registry + semantic search) match or exceed evolutionary skill discovery (EvoSkill's automated generation) for downstream task accuracy?
The document identifies that Skillsmith and EvoSkill occupy the same problem space but use fundamentally different approaches:
The plan is structured to enable fair comparison by:
Checklist
Code Quality
Documentation
Testing
No testing needed — this is a research planning document. Implementation will follow in subsequent PRs with corresponding test coverage.
Notes for Reviewers
This plan is marked as "Draft — pending review" and includes an "Open Questions for Review" section (§11) with three key decisions:
The codebase audit (§7) surfaces four infrastructure gaps that will need to be addressed during Phase 2 implementation. These are not blockers but define the scope of work required to build a production-quality benchmark harness.
https://claude.ai/code/session_01X9mTCSoACzJNWQPv2HdeGq