-
Notifications
You must be signed in to change notification settings - Fork 0
Add nabledge-6-test skill and execute first test scenario #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Added nabledge-6-test skill for automated scenario testing - Configured skill permissions in .claude/settings.json - Executed test scenario handlers-001 (DataReadHandler file reading) - Generated evaluation report with 5/6 criteria passed (83%) - Identified token optimization as primary improvement area Test Results: - Workflow execution: PASS - Keyword matching: PASS (80%) - Section relevance: PASS - Knowledge file only: PASS - Token efficiency: FAIL (20k/15k target) - Tool call efficiency: PASS (9 calls) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace nabledge-6-test with nabledge-test, a unified test framework for both nabledge-6 and nabledge-5, powered by skill-creator's evaluation engine. ## Architecture Changes **Before**: - nabledge-6-test: Monolithic test framework - Manual workflow execution and evaluation - Custom reporting logic **After**: - nabledge-test: Interface layer (Nablarch-specific scenarios) - skill-creator: Evaluation engine (executor/grader/analyzer agents) - Clear separation of concerns ## Key Changes 1. **Added skill-creator** (Claude.ai environment version) - Full evaluation framework with executor/grader/analyzer agents - Eval Mode and Benchmark Mode support - Structured outputs (transcript, metrics, grading) 2. **Renamed nabledge-6-test → nabledge-test** - Unified interface for nabledge-6 and nabledge-5 - Version-specific scenarios: scenarios/nabledge-6/, nabledge-5/ - Simplified command: nabledge-test 6 handlers-001 3. **New workflow**: - Load scenario from scenarios.json - Convert to skill-creator eval format - Execute via skill-creator (executor + grader) - Generate nabledge-format report 4. **Added convert_scenario.py** - Converts scenarios.json to eval expectations - Maps keywords/sections to assertions 5. **Updated settings.json** - Removed: Skill(nabledge-6-test) - Added: Skill(nabledge-test), Skill(skill-creator) ## Benefits - ✅ Avoids reinventing the wheel (uses Anthropic's eval framework) - ✅ Leverages skill-creator's proven evaluation engine - ✅ Unified testing for nabledge-6 and nabledge-5 - ✅ Clear architectural layers (interface vs engine) - ✅ Future-proof (skill-creator updates automatically benefit us) ## Migration Notes - 30 existing scenarios preserved in scenarios/nabledge-6/ - Report format unchanged (compatibility maintained) - Old nabledge-6-test invocations need update Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
skill-creator is now included in the repository (.claude/skills/skill-creator/) rather than being installed from an external marketplace, so the extraKnownMarketplaces configuration is no longer needed.
Major simplification of nabledge-test by using skill-creator's eval format directly instead of custom scenario format. ## Changes 1. **Converted scenarios to eval format** - scenarios.json now uses skill-creator's native format - No conversion needed - Direct prompt + expectations 2. **Removed convert_scenario.py** - No longer needed with eval format - One less layer of complexity 3. **Simplified SKILL.md** (115 lines, down from ~300) - Thin wrapper description - Load scenario → invoke skill-creator → save results - No complex workflow documentation ## Before (complex) scenarios.json (custom format) → convert_scenario.py → eval format → skill-creator ## After (simple) scenarios.json (eval format) → skill-creator ## Benefits - ✅ No format conversion - ✅ Direct skill-creator integration - ✅ Simpler to maintain - ✅ Easier to add new scenarios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deleted 7 unnecessary files to keep nabledge-test minimal:
Removed:
- *.backup (2 files) - Backup files no longer needed
- INSTALL-SKILL-CREATOR.md - skill-creator is in repo, not external
- README.md - SKILL.md is sufficient
- templates/ (3 files) - skill-creator generates reports directly
- scripts/ (empty directory)
- workflows/ (empty directory)
Final structure (minimal):
nabledge-test/
├── SKILL.md (115 lines)
└── scenarios/
└── nabledge-6/scenarios.json (30 scenarios in eval format)
Benefits:
- Cleaner structure
- Less maintenance burden
- Obvious what's essential
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Applied prompt engineering improvements: Changes: - Removed redundant explanations - Condensed step descriptions - Removed example markdown template (obvious from description) - Tightened language throughout Before: 115 lines (verbose) After: 72 lines (concise) Result: More focused, easier to understand. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Selected 5 representative scenarios covering core Nablarch 6 functionality: 1. handlers-001: DataReadHandler file reading (batch basics) 2. libraries-001: UniversalDao paging (database access) 3. tools-001: NTF test data preparation (testing) 4. processing-001: Nablarch batch architecture (architecture) 5. code-analysis-001: Existing code analysis (code understanding) Removed 25 scenarios that were: - Redundant (multiple similar tests per category) - Less critical for initial validation - Can be added back if needed Benefits: - Faster test execution - Focus on essentials - Easier maintenance File size: 432 lines → 80 lines (-81%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Select 5 most frequently asked scenarios based on practical usage analysis: 1. processing-005: Batch startup (beginners' first hurdle) 2. libraries-001: UniversalDao paging (highest implementation frequency) 3. handlers-001: DataReadHandler file reading (batch basics) 4. processing-004: Error handling (implementation essential) 5. processing-002: BatchAction implementation (business logic) Reduced from 30 scenarios to focus on most valuable test cases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change from verbose expectations to simple keywords+sections:
- Before: Full expectation strings per scenario
- After: Simple keywords array + sections array
- nabledge-test converts to expectations at runtime
Benefits:
- Easier scenario authoring (just list keywords/sections)
- No repetitive "Response includes" strings
- Default metrics auto-added (tokens 5000-15000, tools 10-20)
Example:
{
"question": "データリードハンドラでファイルを読み込むには?",
"keywords": ["DataReadHandler", "DataReader"],
"sections": ["overview", "usage"]
}
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change from delegating to skill-creator to manually executing eval procedures: Before: - Call skill-creator as skill (doesn't work - no results returned) After: - Read skill-creator documentation (eval-mode.md, executor.md, grader.md) - Step 6: Execute executor.md procedures manually - Run nabledge-<version> with Skill tool - Record transcript with tool calls, steps, response - Write metrics.json - Write timing.json - Step 7: Execute grader.md procedures manually - Read transcript - Evaluate expectations against transcript - Write grading.json - Step 8-9: Generate report to work/ and display summary Key insight: Skill tool loads instructions, doesn't auto-execute. nabledge-test must follow skill-creator's procedures manually. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change nabledge-test to execute nabledge-6 inline instead of using Skill tool. This prevents execution from stopping between Executor and Grader steps. Changes: - Remove Skill() call in Step 6, execute workflows directly - Add explicit continuity directive between steps - Update transcript template to reflect inline execution - Add test results: processing-005 (4/6 pass, 66.7%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test: バッチの起動方法を教えてください Result: 5/8 expectations passed (62.5%) Duration: 108s (executor: 75s, grader: 33s) Tool calls: 18 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Workspace files should be stored in work/YYYYMMDD/nabledge-test/ instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Update workspace location from nabledge-test-workspace/ to work/YYYYMMDD/nabledge-test/ - Use date-based organization to align with work log structure - Update all workspace path references in SKILL.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test results: 4/8 expectations passed (50.0%) - Successfully answered batch launch method using knowledge files - Identified test design issues (non-existent sections, strict keyword matching) - Duration: 73s, 10 tool calls Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Aligned all 5 test scenarios with actual knowledge file content by: - Verifying section names against actual JSON structure - Extracting keywords from actual section content - Replacing non-existent sections (launch, execution, usage, exception, action-implementation) - Using specific technical terms instead of generic descriptions Changes: - processing-005: sections ["launch","execution"] → ["request-path","batch-types"] - handlers-001: sections ["overview","usage"] → ["overview","processing"] - processing-004: sections ["error-handling","exception"] → ["error-handling","errors"] - processing-002: sections ["action-implementation","business-logic"] → ["actions","responsibility"] - All keywords updated to match exact terminology in knowledge files Expected impact: Test pass rates should improve from 50-83% to 90-100% See work/20260213/scenario-expectations-revision.md for detailed verification Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces a unified test framework nabledge-test for nabledge-6 and nabledge-5 skills, powered by skill-creator's evaluation engine.
Architecture
Key Changes
1. Added skill-creator (Claude.ai Environment Version)
Why Claude.ai version?
2. Refactored nabledge-6-test → nabledge-test
Before: nabledge-6-test (monolithic)
After: nabledge-test (interface) + skill-creator (engine)
New command syntax:
3. Architecture Benefits
4. New Components
nabledge-test/:
SKILL.md- Nablarch-specific test interfacescenarios/nabledge-6/scenarios.json- 30 test scenariosscenarios/nabledge-5/- Future supportscripts/convert_scenario.py- Convert to eval formatskill-creator/ (762 lines):
agents/executor.md- Execute eval with skillagents/grader.md- Evaluate expectationsagents/analyzer.md- Pattern analysisagents/comparator.md- A/B comparisonreferences/eval-mode.md- Eval workflowreferences/benchmark-mode.md- Benchmark workflowscripts/*.py- Automation scripts5. Workflow
Test Results (Pre-Integration)
Using original nabledge-6-test (before refactoring), handlers-001 scenario:
Overall: 5/6 criteria passed (83%)
Migration Impact
For Users
Old command:
New command:
For Developers
scenarios/nabledge-6/scenarios/nabledge-5/Why This Approach?
❌ Initial Approach (nabledge-6-test)
✅ New Approach (nabledge-test + skill-creator)
Future Work
Files Changed
.claude/settings.json: Update skill permissions.claude/skills/nabledge-test/: New unified test framework.claude/skills/skill-creator/: Anthropic's evaluation engine (Claude.ai version).claude/skills/nabledge-6-test/🤖 Generated with Claude Code