Skip to content

Conversation

@kiyotis
Copy link
Contributor

@kiyotis kiyotis commented Feb 13, 2026

Summary

This PR introduces a unified test framework nabledge-test for nabledge-6 and nabledge-5 skills, powered by skill-creator's evaluation engine.

Architecture

nabledge-test (Interface Layer)
  ├── Nablarch-specific scenarios
  ├── Scenario → Eval conversion
  └── Report generation
          ↓ delegates to
skill-creator (Evaluation Engine)
  ├── Executor agent (run skill)
  ├── Grader agent (evaluate expectations)
  └── Analyzer agent (pattern analysis)

Key Changes

1. Added skill-creator (Claude.ai Environment Version)

  • Source: Claude.ai production environment (not GitHub anthropics/skills)
  • Features:
    • Eval Mode: Execute individual evals with executor/grader agents
    • Benchmark Mode: Statistical analysis with 3x runs per configuration
    • Complete evaluation framework (transcript, metrics, grading)
    • Structured outputs for reproducible testing

Why Claude.ai version?

  • GitHub version lacks eval/benchmark modes
  • Claude.ai version has production-tested evaluation agents
  • Full feature set: executor, grader, comparator, analyzer agents

2. Refactored nabledge-6-test → nabledge-test

Before: nabledge-6-test (monolithic)
After: nabledge-test (interface) + skill-creator (engine)

New command syntax:

# nabledge-6
nabledge-test 6 handlers-001
nabledge-test 6 --all
nabledge-test 6 --category handlers

# nabledge-5 (future)
nabledge-test 5 libraries-001

3. Architecture Benefits

Aspect Before After
Evaluation Manual workflow skill-creator agents
Scope nabledge-6 only nabledge-6 + nabledge-5
Engine Custom implementation Anthropic's proven framework
Maintenance Reinventing wheel Leverage existing tools
Features Basic testing Advanced analysis (variance, comparison)

4. New Components

nabledge-test/:

  • SKILL.md - Nablarch-specific test interface
  • scenarios/nabledge-6/scenarios.json - 30 test scenarios
  • scenarios/nabledge-5/ - Future support
  • scripts/convert_scenario.py - Convert to eval format

skill-creator/ (762 lines):

  • agents/executor.md - Execute eval with skill
  • agents/grader.md - Evaluate expectations
  • agents/analyzer.md - Pattern analysis
  • agents/comparator.md - A/B comparison
  • references/eval-mode.md - Eval workflow
  • references/benchmark-mode.md - Benchmark workflow
  • scripts/*.py - Automation scripts

5. Workflow

1. Load scenario (scenarios/nabledge-6/scenarios.json)
   ↓
2. Convert to eval format
   {
     "prompt": "データリードハンドラでファイルを読み込むには?",
     "expectations": [
       "Response includes keyword 'DataReadHandler'",
       "Token usage is between 5000 and 15000"
     ]
   }
   ↓
3. Execute via skill-creator
   - Executor agent: Run nabledge-6 with prompt
   - Grader agent: Evaluate expectations
   ↓
4. Generate nabledge report
   work/YYYYMMDD/test-handlers-001-timestamp.md

Test Results (Pre-Integration)

Using original nabledge-6-test (before refactoring), handlers-001 scenario:

Criterion Result Details
Workflow Execution ✅ PASS keyword-search, section-judgement executed
Keyword Matching ✅ PASS 80% (4/5)
Section Relevance ✅ PASS Expected sections identified
Knowledge File Only ✅ PASS Proper citations
Token Efficiency ❌ FAIL 20k tokens (target: 15k)
Tool Call Efficiency ✅ PASS 9 calls (target: 10-20)

Overall: 5/6 criteria passed (83%)

Migration Impact

For Users

Old command:

/nabledge-6-test execute handlers-001

New command:

/nabledge-test 6 handlers-001

For Developers

  • 30 scenarios preserved in scenarios/nabledge-6/
  • Report format unchanged (compatibility)
  • Can now add nabledge-5 scenarios in scenarios/nabledge-5/

Why This Approach?

❌ Initial Approach (nabledge-6-test)

  • Reimplemented test execution from scratch
  • Custom evaluation logic
  • Monolithic design
  • Single version support

✅ New Approach (nabledge-test + skill-creator)

  • Leverage Anthropic's evaluation framework
  • Proven executor/grader/analyzer agents
  • Layered architecture (interface vs engine)
  • Multi-version support (6 & 5)
  • Future-proof (skill-creator updates benefit us)

Future Work

  1. nabledge-5 support: Add scenarios/nabledge-5/scenarios.json
  2. Benchmark mode: Use skill-creator's statistical analysis (3x runs, variance)
  3. A/B testing: Compare skill versions using comparator agent
  4. Continuous improvement: Leverage analyzer agent patterns

Files Changed

  • .claude/settings.json: Update skill permissions
  • .claude/skills/nabledge-test/: New unified test framework
  • .claude/skills/skill-creator/: Anthropic's evaluation engine (Claude.ai version)
  • Removed: .claude/skills/nabledge-6-test/

🤖 Generated with Claude Code

- Added nabledge-6-test skill for automated scenario testing
- Configured skill permissions in .claude/settings.json
- Executed test scenario handlers-001 (DataReadHandler file reading)
- Generated evaluation report with 5/6 criteria passed (83%)
- Identified token optimization as primary improvement area

Test Results:
- Workflow execution: PASS
- Keyword matching: PASS (80%)
- Section relevance: PASS
- Knowledge file only: PASS
- Token efficiency: FAIL (20k/15k target)
- Tool call efficiency: PASS (9 calls)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kiyotis kiyotis changed the base branch from main to develop February 13, 2026 05:50
kiyotis and others added 18 commits February 13, 2026 15:33
Replace nabledge-6-test with nabledge-test, a unified test framework
for both nabledge-6 and nabledge-5, powered by skill-creator's
evaluation engine.

## Architecture Changes

**Before**:
- nabledge-6-test: Monolithic test framework
- Manual workflow execution and evaluation
- Custom reporting logic

**After**:
- nabledge-test: Interface layer (Nablarch-specific scenarios)
- skill-creator: Evaluation engine (executor/grader/analyzer agents)
- Clear separation of concerns

## Key Changes

1. **Added skill-creator** (Claude.ai environment version)
   - Full evaluation framework with executor/grader/analyzer agents
   - Eval Mode and Benchmark Mode support
   - Structured outputs (transcript, metrics, grading)

2. **Renamed nabledge-6-test → nabledge-test**
   - Unified interface for nabledge-6 and nabledge-5
   - Version-specific scenarios: scenarios/nabledge-6/, nabledge-5/
   - Simplified command: nabledge-test 6 handlers-001

3. **New workflow**:
   - Load scenario from scenarios.json
   - Convert to skill-creator eval format
   - Execute via skill-creator (executor + grader)
   - Generate nabledge-format report

4. **Added convert_scenario.py**
   - Converts scenarios.json to eval expectations
   - Maps keywords/sections to assertions

5. **Updated settings.json**
   - Removed: Skill(nabledge-6-test)
   - Added: Skill(nabledge-test), Skill(skill-creator)

## Benefits

- ✅ Avoids reinventing the wheel (uses Anthropic's eval framework)
- ✅ Leverages skill-creator's proven evaluation engine
- ✅ Unified testing for nabledge-6 and nabledge-5
- ✅ Clear architectural layers (interface vs engine)
- ✅ Future-proof (skill-creator updates automatically benefit us)

## Migration Notes

- 30 existing scenarios preserved in scenarios/nabledge-6/
- Report format unchanged (compatibility maintained)
- Old nabledge-6-test invocations need update

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
skill-creator is now included in the repository (.claude/skills/skill-creator/)
rather than being installed from an external marketplace, so the
extraKnownMarketplaces configuration is no longer needed.
Major simplification of nabledge-test by using skill-creator's
eval format directly instead of custom scenario format.

## Changes

1. **Converted scenarios to eval format**
   - scenarios.json now uses skill-creator's native format
   - No conversion needed
   - Direct prompt + expectations

2. **Removed convert_scenario.py**
   - No longer needed with eval format
   - One less layer of complexity

3. **Simplified SKILL.md** (115 lines, down from ~300)
   - Thin wrapper description
   - Load scenario → invoke skill-creator → save results
   - No complex workflow documentation

## Before (complex)

scenarios.json (custom format)
  → convert_scenario.py
  → eval format
  → skill-creator

## After (simple)

scenarios.json (eval format)
  → skill-creator

## Benefits

- ✅ No format conversion
- ✅ Direct skill-creator integration
- ✅ Simpler to maintain
- ✅ Easier to add new scenarios

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deleted 7 unnecessary files to keep nabledge-test minimal:

Removed:
- *.backup (2 files) - Backup files no longer needed
- INSTALL-SKILL-CREATOR.md - skill-creator is in repo, not external
- README.md - SKILL.md is sufficient
- templates/ (3 files) - skill-creator generates reports directly
- scripts/ (empty directory)
- workflows/ (empty directory)

Final structure (minimal):
nabledge-test/
├── SKILL.md (115 lines)
└── scenarios/
    └── nabledge-6/scenarios.json (30 scenarios in eval format)

Benefits:
- Cleaner structure
- Less maintenance burden
- Obvious what's essential

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Applied prompt engineering improvements:

Changes:
- Removed redundant explanations
- Condensed step descriptions
- Removed example markdown template (obvious from description)
- Tightened language throughout

Before: 115 lines (verbose)
After: 72 lines (concise)

Result: More focused, easier to understand.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Selected 5 representative scenarios covering core Nablarch 6 functionality:

1. handlers-001: DataReadHandler file reading (batch basics)
2. libraries-001: UniversalDao paging (database access)
3. tools-001: NTF test data preparation (testing)
4. processing-001: Nablarch batch architecture (architecture)
5. code-analysis-001: Existing code analysis (code understanding)

Removed 25 scenarios that were:
- Redundant (multiple similar tests per category)
- Less critical for initial validation
- Can be added back if needed

Benefits:
- Faster test execution
- Focus on essentials
- Easier maintenance

File size: 432 lines → 80 lines (-81%)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Select 5 most frequently asked scenarios based on practical usage analysis:
1. processing-005: Batch startup (beginners' first hurdle)
2. libraries-001: UniversalDao paging (highest implementation frequency)
3. handlers-001: DataReadHandler file reading (batch basics)
4. processing-004: Error handling (implementation essential)
5. processing-002: BatchAction implementation (business logic)

Reduced from 30 scenarios to focus on most valuable test cases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change from verbose expectations to simple keywords+sections:
- Before: Full expectation strings per scenario
- After: Simple keywords array + sections array
- nabledge-test converts to expectations at runtime

Benefits:
- Easier scenario authoring (just list keywords/sections)
- No repetitive "Response includes" strings
- Default metrics auto-added (tokens 5000-15000, tools 10-20)

Example:
{
  "question": "データリードハンドラでファイルを読み込むには?",
  "keywords": ["DataReadHandler", "DataReader"],
  "sections": ["overview", "usage"]
}

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change from delegating to skill-creator to manually executing eval procedures:

Before:
- Call skill-creator as skill (doesn't work - no results returned)

After:
- Read skill-creator documentation (eval-mode.md, executor.md, grader.md)
- Step 6: Execute executor.md procedures manually
  - Run nabledge-<version> with Skill tool
  - Record transcript with tool calls, steps, response
  - Write metrics.json
  - Write timing.json
- Step 7: Execute grader.md procedures manually
  - Read transcript
  - Evaluate expectations against transcript
  - Write grading.json
- Step 8-9: Generate report to work/ and display summary

Key insight: Skill tool loads instructions, doesn't auto-execute.
nabledge-test must follow skill-creator's procedures manually.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change nabledge-test to execute nabledge-6 inline instead of using Skill tool.
This prevents execution from stopping between Executor and Grader steps.

Changes:
- Remove Skill() call in Step 6, execute workflows directly
- Add explicit continuity directive between steps
- Update transcript template to reflect inline execution
- Add test results: processing-005 (4/6 pass, 66.7%)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test: バッチの起動方法を教えてください
Result: 5/8 expectations passed (62.5%)
Duration: 108s (executor: 75s, grader: 33s)
Tool calls: 18

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Workspace files should be stored in work/YYYYMMDD/nabledge-test/ instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Update workspace location from nabledge-test-workspace/ to work/YYYYMMDD/nabledge-test/
- Use date-based organization to align with work log structure
- Update all workspace path references in SKILL.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test results: 4/8 expectations passed (50.0%)
- Successfully answered batch launch method using knowledge files
- Identified test design issues (non-existent sections, strict keyword matching)
- Duration: 73s, 10 tool calls

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Aligned all 5 test scenarios with actual knowledge file content by:
- Verifying section names against actual JSON structure
- Extracting keywords from actual section content
- Replacing non-existent sections (launch, execution, usage, exception, action-implementation)
- Using specific technical terms instead of generic descriptions

Changes:
- processing-005: sections ["launch","execution"] → ["request-path","batch-types"]
- handlers-001: sections ["overview","usage"] → ["overview","processing"]
- processing-004: sections ["error-handling","exception"] → ["error-handling","errors"]
- processing-002: sections ["action-implementation","business-logic"] → ["actions","responsibility"]
- All keywords updated to match exact terminology in knowledge files

Expected impact: Test pass rates should improve from 50-83% to 90-100%

See work/20260213/scenario-expectations-revision.md for detailed verification

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant