-
Notifications
You must be signed in to change notification settings - Fork 437
Open
Description
Summary
Improve the JSON evaluator to compare full JSON objects (expected answer vs LLM output) and provide per-field match scores instead of requiring users to create separate evaluators for each field.
Problem Statement
Current limitations with JSON evaluators:
- Field Match evaluator compares a single field in LLM output to the entire ground truth column (not a field within it)
- Users must create one evaluator per field they want to validate
- No visibility into which specific fields passed/failed - only aggregate scores
Proposed Solution (Checkpoint 1)
Modify the JSON evaluator to:
- Accept the full expected answer column as JSON (not just a single value)
- Compare each field in the expected JSON against the corresponding field in the LLM output
- Return a score breakdown per field (e.g.,
{"name": 1.0, "email": 1.0, "phone": 0.0}) - Calculate an aggregate score (average of field scores)
Success Criteria
- User can configure a single evaluator that validates multiple JSON fields
- Evaluation results show per-field pass/fail status
- Aggregate score reflects percentage of matching fields
- Works with nested JSON (at least 1 level deep)
Future Checkpoints (Out of Scope)
- Checkpoint 2: Field-to-field mapping UI (when output keys ≠ expected keys)
- Checkpoint 3: Per-field match type configuration (exact, semantic, numeric tolerance)
- Checkpoint 4: Evaluator playground for testing configurations
Technical Notes
Current implementation is in:
- Backend:
api/oss/src/core/evaluators/utils.py(functions:field_match_test,compare_jsons) - Config:
api/oss/src/resources/evaluators/evaluators.py - Frontend:
web/oss/src/components/Evaluators/
dosubot