Add nabledge-6-test skill and execute first test scenario #17

kiyotis · 2026-02-13T05:49:39Z

Summary

This PR introduces a unified test framework nabledge-test for nabledge-6 and nabledge-5 skills, powered by skill-creator's evaluation engine.

Architecture

nabledge-test (Interface Layer)
  ├── Nablarch-specific scenarios
  ├── Scenario → Eval conversion
  └── Report generation
          ↓ delegates to
skill-creator (Evaluation Engine)
  ├── Executor agent (run skill)
  ├── Grader agent (evaluate expectations)
  └── Analyzer agent (pattern analysis)

Key Changes

1. Added skill-creator (Claude.ai Environment Version)

Source: Claude.ai production environment (not GitHub anthropics/skills)
Features:
- Eval Mode: Execute individual evals with executor/grader agents
- Benchmark Mode: Statistical analysis with 3x runs per configuration
- Complete evaluation framework (transcript, metrics, grading)
- Structured outputs for reproducible testing

Why Claude.ai version?

GitHub version lacks eval/benchmark modes
Claude.ai version has production-tested evaluation agents
Full feature set: executor, grader, comparator, analyzer agents

2. Refactored nabledge-6-test → nabledge-test

Before: nabledge-6-test (monolithic)
After: nabledge-test (interface) + skill-creator (engine)

New command syntax:

# nabledge-6
nabledge-test 6 handlers-001
nabledge-test 6 --all
nabledge-test 6 --category handlers

# nabledge-5 (future)
nabledge-test 5 libraries-001

3. Architecture Benefits

Aspect	Before	After
Evaluation	Manual workflow	skill-creator agents
Scope	nabledge-6 only	nabledge-6 + nabledge-5
Engine	Custom implementation	Anthropic's proven framework
Maintenance	Reinventing wheel	Leverage existing tools
Features	Basic testing	Advanced analysis (variance, comparison)

4. New Components

nabledge-test/:

SKILL.md - Nablarch-specific test interface
scenarios/nabledge-6/scenarios.json - 30 test scenarios
scenarios/nabledge-5/ - Future support
scripts/convert_scenario.py - Convert to eval format

skill-creator/ (762 lines):

agents/executor.md - Execute eval with skill
agents/grader.md - Evaluate expectations
agents/analyzer.md - Pattern analysis
agents/comparator.md - A/B comparison
references/eval-mode.md - Eval workflow
references/benchmark-mode.md - Benchmark workflow
scripts/*.py - Automation scripts

5. Workflow

1. Load scenario (scenarios/nabledge-6/scenarios.json)
   ↓
2. Convert to eval format
   {
     "prompt": "データリードハンドラでファイルを読み込むには？",
     "expectations": [
       "Response includes keyword 'DataReadHandler'",
       "Token usage is between 5000 and 15000"
     ]
   }
   ↓
3. Execute via skill-creator
   - Executor agent: Run nabledge-6 with prompt
   - Grader agent: Evaluate expectations
   ↓
4. Generate nabledge report
   work/YYYYMMDD/test-handlers-001-timestamp.md

Test Results (Pre-Integration)

Using original nabledge-6-test (before refactoring), handlers-001 scenario:

Criterion	Result	Details
Workflow Execution	✅ PASS	keyword-search, section-judgement executed
Keyword Matching	✅ PASS	80% (4/5)
Section Relevance	✅ PASS	Expected sections identified
Knowledge File Only	✅ PASS	Proper citations
Token Efficiency	❌ FAIL	20k tokens (target: 15k)
Tool Call Efficiency	✅ PASS	9 calls (target: 10-20)

Overall: 5/6 criteria passed (83%)

Migration Impact

For Users

Old command:

/nabledge-6-test execute handlers-001

New command:

/nabledge-test 6 handlers-001

For Developers

30 scenarios preserved in scenarios/nabledge-6/
Report format unchanged (compatibility)
Can now add nabledge-5 scenarios in scenarios/nabledge-5/

Why This Approach?

❌ Initial Approach (nabledge-6-test)

Reimplemented test execution from scratch
Custom evaluation logic
Monolithic design
Single version support

✅ New Approach (nabledge-test + skill-creator)

Leverage Anthropic's evaluation framework
Proven executor/grader/analyzer agents
Layered architecture (interface vs engine)
Multi-version support (6 & 5)
Future-proof (skill-creator updates benefit us)

Future Work

nabledge-5 support: Add scenarios/nabledge-5/scenarios.json
Benchmark mode: Use skill-creator's statistical analysis (3x runs, variance)
A/B testing: Compare skill versions using comparator agent
Continuous improvement: Leverage analyzer agent patterns

Files Changed

.claude/settings.json: Update skill permissions
.claude/skills/nabledge-test/: New unified test framework
.claude/skills/skill-creator/: Anthropic's evaluation engine (Claude.ai version)
Removed: .claude/skills/nabledge-6-test/

🤖 Generated with Claude Code

- Added nabledge-6-test skill for automated scenario testing - Configured skill permissions in .claude/settings.json - Executed test scenario handlers-001 (DataReadHandler file reading) - Generated evaluation report with 5/6 criteria passed (83%) - Identified token optimization as primary improvement area Test Results: - Workflow execution: PASS - Keyword matching: PASS (80%) - Section relevance: PASS - Knowledge file only: PASS - Token efficiency: FAIL (20k/15k target) - Tool call efficiency: PASS (9 calls) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace nabledge-6-test with nabledge-test, a unified test framework for both nabledge-6 and nabledge-5, powered by skill-creator's evaluation engine. ## Architecture Changes **Before**: - nabledge-6-test: Monolithic test framework - Manual workflow execution and evaluation - Custom reporting logic **After**: - nabledge-test: Interface layer (Nablarch-specific scenarios) - skill-creator: Evaluation engine (executor/grader/analyzer agents) - Clear separation of concerns ## Key Changes 1. **Added skill-creator** (Claude.ai environment version) - Full evaluation framework with executor/grader/analyzer agents - Eval Mode and Benchmark Mode support - Structured outputs (transcript, metrics, grading) 2. **Renamed nabledge-6-test → nabledge-test** - Unified interface for nabledge-6 and nabledge-5 - Version-specific scenarios: scenarios/nabledge-6/, nabledge-5/ - Simplified command: nabledge-test 6 handlers-001 3. **New workflow**: - Load scenario from scenarios.json - Convert to skill-creator eval format - Execute via skill-creator (executor + grader) - Generate nabledge-format report 4. **Added convert_scenario.py** - Converts scenarios.json to eval expectations - Maps keywords/sections to assertions 5. **Updated settings.json** - Removed: Skill(nabledge-6-test) - Added: Skill(nabledge-test), Skill(skill-creator) ## Benefits - ✅ Avoids reinventing the wheel (uses Anthropic's eval framework) - ✅ Leverages skill-creator's proven evaluation engine - ✅ Unified testing for nabledge-6 and nabledge-5 - ✅ Clear architectural layers (interface vs engine) - ✅ Future-proof (skill-creator updates automatically benefit us) ## Migration Notes - 30 existing scenarios preserved in scenarios/nabledge-6/ - Report format unchanged (compatibility maintained) - Old nabledge-6-test invocations need update Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

skill-creator is now included in the repository (.claude/skills/skill-creator/) rather than being installed from an external marketplace, so the extraKnownMarketplaces configuration is no longer needed.

Major simplification of nabledge-test by using skill-creator's eval format directly instead of custom scenario format. ## Changes 1. **Converted scenarios to eval format** - scenarios.json now uses skill-creator's native format - No conversion needed - Direct prompt + expectations 2. **Removed convert_scenario.py** - No longer needed with eval format - One less layer of complexity 3. **Simplified SKILL.md** (115 lines, down from ~300) - Thin wrapper description - Load scenario → invoke skill-creator → save results - No complex workflow documentation ## Before (complex) scenarios.json (custom format) → convert_scenario.py → eval format → skill-creator ## After (simple) scenarios.json (eval format) → skill-creator ## Benefits - ✅ No format conversion - ✅ Direct skill-creator integration - ✅ Simpler to maintain - ✅ Easier to add new scenarios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Deleted 7 unnecessary files to keep nabledge-test minimal: Removed: - *.backup (2 files) - Backup files no longer needed - INSTALL-SKILL-CREATOR.md - skill-creator is in repo, not external - README.md - SKILL.md is sufficient - templates/ (3 files) - skill-creator generates reports directly - scripts/ (empty directory) - workflows/ (empty directory) Final structure (minimal): nabledge-test/ ├── SKILL.md (115 lines) └── scenarios/ └── nabledge-6/scenarios.json (30 scenarios in eval format) Benefits: - Cleaner structure - Less maintenance burden - Obvious what's essential Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Applied prompt engineering improvements: Changes: - Removed redundant explanations - Condensed step descriptions - Removed example markdown template (obvious from description) - Tightened language throughout Before: 115 lines (verbose) After: 72 lines (concise) Result: More focused, easier to understand. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Selected 5 representative scenarios covering core Nablarch 6 functionality: 1. handlers-001: DataReadHandler file reading (batch basics) 2. libraries-001: UniversalDao paging (database access) 3. tools-001: NTF test data preparation (testing) 4. processing-001: Nablarch batch architecture (architecture) 5. code-analysis-001: Existing code analysis (code understanding) Removed 25 scenarios that were: - Redundant (multiple similar tests per category) - Less critical for initial validation - Can be added back if needed Benefits: - Faster test execution - Focus on essentials - Easier maintenance File size: 432 lines → 80 lines (-81%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Select 5 most frequently asked scenarios based on practical usage analysis: 1. processing-005: Batch startup (beginners' first hurdle) 2. libraries-001: UniversalDao paging (highest implementation frequency) 3. handlers-001: DataReadHandler file reading (batch basics) 4. processing-004: Error handling (implementation essential) 5. processing-002: BatchAction implementation (business logic) Reduced from 30 scenarios to focus on most valuable test cases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Change from verbose expectations to simple keywords+sections: - Before: Full expectation strings per scenario - After: Simple keywords array + sections array - nabledge-test converts to expectations at runtime Benefits: - Easier scenario authoring (just list keywords/sections) - No repetitive "Response includes" strings - Default metrics auto-added (tokens 5000-15000, tools 10-20) Example: { "question": "データリードハンドラでファイルを読み込むには？", "keywords": ["DataReadHandler", "DataReader"], "sections": ["overview", "usage"] } Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Change from delegating to skill-creator to manually executing eval procedures: Before: - Call skill-creator as skill (doesn't work - no results returned) After: - Read skill-creator documentation (eval-mode.md, executor.md, grader.md) - Step 6: Execute executor.md procedures manually - Run nabledge-<version> with Skill tool - Record transcript with tool calls, steps, response - Write metrics.json - Write timing.json - Step 7: Execute grader.md procedures manually - Read transcript - Evaluate expectations against transcript - Write grading.json - Step 8-9: Generate report to work/ and display summary Key insight: Skill tool loads instructions, doesn't auto-execute. nabledge-test must follow skill-creator's procedures manually. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Change nabledge-test to execute nabledge-6 inline instead of using Skill tool. This prevents execution from stopping between Executor and Grader steps. Changes: - Remove Skill() call in Step 6, execute workflows directly - Add explicit continuity directive between steps - Update transcript template to reflect inline execution - Add test results: processing-005 (4/6 pass, 66.7%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Test: バッチの起動方法を教えてください Result: 5/8 expectations passed (62.5%) Duration: 108s (executor: 75s, grader: 33s) Tool calls: 18 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Workspace files should be stored in work/YYYYMMDD/nabledge-test/ instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Update workspace location from nabledge-test-workspace/ to work/YYYYMMDD/nabledge-test/ - Use date-based organization to align with work log structure - Update all workspace path references in SKILL.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Test results: 4/8 expectations passed (50.0%) - Successfully answered batch launch method using knowledge files - Identified test design issues (non-existent sections, strict keyword matching) - Duration: 73s, 10 tool calls Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Aligned all 5 test scenarios with actual knowledge file content by: - Verifying section names against actual JSON structure - Extracting keywords from actual section content - Replacing non-existent sections (launch, execution, usage, exception, action-implementation) - Using specific technical terms instead of generic descriptions Changes: - processing-005: sections ["launch","execution"] → ["request-path","batch-types"] - handlers-001: sections ["overview","usage"] → ["overview","processing"] - processing-004: sections ["error-handling","exception"] → ["error-handling","errors"] - processing-002: sections ["action-implementation","business-logic"] → ["actions","responsibility"] - All keywords updated to match exact terminology in knowledge files Expected impact: Test pass rates should improve from 50-83% to 90-100% See work/20260213/scenario-expectations-revision.md for detailed verification Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kiyotis changed the base branch from main to develop February 13, 2026 05:50

kiyotis and others added 18 commits February 13, 2026 15:33

Add work log for nabledge-test + skill-creator integration

b9abdcb

Remove unnecessary extraKnownMarketplaces from settings.json

e5582ad

skill-creator is now included in the repository (.claude/skills/skill-creator/) rather than being installed from an external marketplace, so the extraKnownMarketplaces configuration is no longer needed.

Add work log: nabledge-test simplification

3df2b7e

Add work log: nabledge-test final cleanup

00fea5c

Execute processing-005 test scenario

4aa91b7

Test: バッチの起動方法を教えてください Result: 5/8 expectations passed (62.5%) Duration: 108s (executor: 75s, grader: 33s) Tool calls: 18 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove nabledge-test-workspace from repository

948fc34

Workspace files should be stored in work/YYYYMMDD/nabledge-test/ instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kiyotis merged commit b7ec9a4 into develop Feb 13, 2026

kiyotis deleted the issue-15 branch February 13, 2026 10:06

kiyotis mentioned this pull request Feb 13, 2026

As a developer, I want a skill to execute and evaluate nabledge-6 scenarios so that I can continuously improve the skill #15

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nabledge-6-test skill and execute first test scenario #17

Add nabledge-6-test skill and execute first test scenario #17

Uh oh!

kiyotis commented Feb 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add nabledge-6-test skill and execute first test scenario #17

Add nabledge-6-test skill and execute first test scenario #17

Uh oh!

Conversation

kiyotis commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Key Changes

1. Added skill-creator (Claude.ai Environment Version)

2. Refactored nabledge-6-test → nabledge-test

3. Architecture Benefits

4. New Components

5. Workflow

Test Results (Pre-Integration)

Migration Impact

For Users

For Developers

Why This Approach?

❌ Initial Approach (nabledge-6-test)

✅ New Approach (nabledge-test + skill-creator)

Future Work

Files Changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kiyotis commented Feb 13, 2026 •

edited

Loading