diff --git a/concepts/data_structur_concept.md b/concepts/data_structur_concept.md new file mode 100644 index 0000000..c56b5d2 --- /dev/null +++ b/concepts/data_structur_concept.md @@ -0,0 +1,1116 @@ +# Data Structure Concept: Migration from JSONL to JSON Schema + +## Executive Summary + +### Problem Statement + +The current testbench pipeline uses flat JSONL (JSON Lines) files for storing test data throughout the 3-phase evaluation workflow. + +### Solution + +Replace the flat JSONL format with a hierarchical JSON Schema-based structure that organizes tests into a three-level hierarchy: + +``` +Experiment → Scenarios → Steps +``` + +This migration introduces formal JSON schemas validated at each pipeline phase, custom RAGAS backend implementations for bidirectional data transformation, and content-based deterministic ID generation for reproducibility. + +### Key Benefits + +1. **Scenario Organization**: Group related test steps into named scenarios (e.g., "weather_queries", "booking_flow") +2. **Hierarchical Evaluation**: Support both step-level and scenario-level metrics +3. **Deterministic IDs**: Content-based hashing ensures reproducible identifiers across runs +4. **Schema Validation**: Catches data contract violations early + +### Impact + +This is a **breaking change** requiring: +- New `JsonSchemaBackend` implementation +- Updates to run.py, evaluate.py, publish.py, visualize.py +- Migration of existing JSONL test data to JSON format +- Updated Grafana dashboards +- No backwards compatibility with JSONL format + +--- + +## Current State: JSONL-Based Pipeline + +### Overview + +The testbench currently uses RAGAS framework's `LocalJSONLBackend` to store evaluation data in flat JSONL format. Each line in the file represents a single, independent test sample with no relationship to other samples. + +### RAGAS Backend Architecture + +RAGAS (Retrieval-Augmented Generation Assessment) uses a backend abstraction layer for data persistence. +The `@experiment()` decorator processes datasets row-by-row asynchronously. Each decorated function: +1. Receives a single row (dict) as input +2. Processes it (queries agent, evaluates metrics, etc.) +3. Returns an enriched row (dict) with added fields +4. RAGAS collects all rows into a list and passes to backend for serialization + +The `LocalJSONLBackend` simply writes each dict as a JSON line without understanding relationships between samples. + +### Data Flow with Concrete Examples + +**Phase 1: Run (run.py)** + +Input: `data/datasets/ragas_dataset.jsonl` + Agent URL +Output: `data/experiments/ragas_experiment.jsonl` + +Example output row: +```json +{ + "user_input": "What's the weather in NYC?", + "retrieved_contexts": ["Weather API returned: sunny, 70F"], + "reference": "The weather in NYC is sunny and 70F.", + "response": "It's currently sunny and 70F in New York City.", + "trace_id": "a1b2c3d4e5f6789012345678901234ab", + "sample_hash": "7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a" +} +``` + +Added fields: +- **`response`** (string): Agent's actual response from A2A protocol +- **`trace_id`** (string): OpenTelemetry trace ID for distributed tracing +- **`sample_hash`** (string): SHA256 hash of sample content for deduplication + +**Phase 2: Evaluate (evaluate.py)** + +Input: `data/experiments/ragas_experiment.jsonl` + LLM model + metrics config +Output: `data/experiments/ragas_evaluation.jsonl` + +Example row: +```json +{ + "user_input": "What's the weather in NYC?", + "retrieved_contexts": ["Weather API returned: sunny, 70F"], + "reference": "The weather in NYC is sunny and 70F.", + "response": "It's currently sunny and 70F in New York City.", + "trace_id": "a1b2c3d4e5f6789012345678901234ab", + "sample_hash": "7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a", + "individual_results": { + "faithfulness": 0.95, + "answer_relevancy": 0.92, + "context_precision": 0.88 + } +} +``` + +Added field: +- **`individual_results`** (dict): Metric name → score (0.0-1.0) + +**Phase 3: Publish (publish.py)** + +Input: `data/experiments/ragas_evaluation.jsonl` +Output: OTLP metrics published to observability backend + +Example OTLP metric: +``` +testbench_evaluation_metric{ + name="faithfulness", + workflow_name="weather-agent-test", + execution_id="exec-001", + execution_number="1", + trace_id="a1b2c3d4e5f6789012345678901234ab", + sample_hash="7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a", + user_input_truncated="What's the weather in NYC?" +} = 0.95 +``` + +--- + +## Target State: JSON Schema-Based Pipeline + +### Overview + +The new system uses hierarchical JSON format validated against formal schemas at each phase. Data transitions through three schemas: + +1. **experiment.schema.json** - User input (test definitions) +2. **executed_experiment.schema.json** - After agent execution +3. **evaluated_experiment.schema.json** - After metric evaluation + +### Schema Hierarchy Visualization + +``` +Experiment +├── id (string) - Unique experiment identifier +├── llm_as_a_judge_model (string) - LLM for evaluation +├── default_threshold (number) - Fallback threshold (0.0-1.0) +└── scenarios[] (array) + ├── id (string) - Unique scenario identifier + ├── trace_id (string) - OpenTelemetry trace + ├── name (string) - Human-readable scenario name + ├── steps[] (array) + │ ├── id (string) - Unique step identifier + │ ├── input (string) - User query to agent + │ ├── turns[] (array) - A2A conversation history + │ │ ├── content (string) - Message content + │ │ ├── type (enum) - "human" | "ai" | "tool" + │ │ └── tool_calls[] (array, optional) - Tool invocations + │ ├── reference (object, optional) + │ │ ├── response (string) - Expected answer + │ │ ├── tool_calls[] (array) - Expected tool usage + │ │ ├── topics[] (array) - Expected topics covered + │ │ └── ... (other reference fields) + │ ├── custom_values (object, optional) - Custom metadata + │ └── evaluations[] (array) - Metric configurations/results + │ ├── metric_name (string) - Metric identifier + │ ├── threshold (number, optional) - Override threshold + │ ├── parameters (object, optional) - Metric config + │ └── result (object) - Evaluation result (added by evaluate.py) + │ ├── result (enum) - "pass" | "fail" + │ ├── score (number) - 0.0-1.0 + │ └── details (object) - Additional breakdown + └── evaluations[] (array, optional) - Scenario-level metrics +``` + +### experiment.schema.json is User Input + +The `experiment.schema.json` defines the **starting point** of the pipeline - the test definitions created manually by users. This is analogous to a test suite configuration file. + +Users create `data/datasets/experiment.json` conforming to this schema BEFORE running the pipeline. This file contains: +- Test scenario definitions +- Expected inputs to the agent +- Reference data (ground truth) +- Metric configurations (which metrics to evaluate, thresholds) + +**Example user-created experiment.json:** +```json +{ + "llm_as_a_judge_model": "gemini-2.5-flash-lite", + "default_threshold": 0.9, + "scenarios": [ + { + "name": "weather_queries", + "steps": [ + { + "input": "What's the weather in NYC?", + "reference": { + "response": "The weather in NYC is sunny and 70F.", + "tool_calls": [{"name": "get_weather", "arguments": {"city": "NYC"}}] + }, + "evaluations": [ + {"metric_name": "AnswerAccuracy", "threshold": 0.9}, + {"metric_name": "ToolCallAccuracy", "threshold": 1.0} + ] + } + ] + } + ] +} +``` + +### Detailed Attribute Descriptions + +#### run.py execution + +**ADDED at experiment level:** +- `id` (string) - Unique experiment identifier generated via content hash + +**ADDED at scenario level:** +- `id` (string) - Unique scenario identifier (hash of experiment_id + scenario_name) +- `trace_id` (string) - OpenTelemetry trace ID for distributed tracing + +**ADDED at step level:** +- `id` (string) - Unique step identifier (Hash of scenario_id + step_input + index) +- `turns[]` (array) - Full A2A conversation history with message objects: + - `content` (string) - Message text or stringified tool result + - `type` (enum) - "human" | "agent" | "tool" + - `tool_calls[]` (array, optional) - Tool invocations made by agent + - `name` (string) - Tool name + - `args` (object) - Tool arguments + +**PRESERVED:** +- All user input data (llm_as_a_judge_model, default_threshold, scenarios, steps, evaluations as metric configurations) + +#### evaluate.py execution + +**ADDED within each metric in evaluations[]:** +- `result` (object) - Evaluation result containing: + - `result` (string) - "pass" or "fail" based on threshold comparison + - `score` (number) - 0.0-1.0 computed metric score from LLM-as-judge + - `details` (object) - Additional evaluation breakdown and reasoning + +**PRESERVED:** +- All data from executed_experiment including IDs, trace_id, turns + +**TRANSFORMED:** +- evaluations[] changes from metric configurations → metric WITH results +- The original metric config fields (metric_name, threshold, parameters) remain intact +- The `result` object is added alongside them + +### Complete Data Flow with Concrete Examples + +#### Example 1: User Input (experiment.json) + +**File**: `data/datasets/experiment.json` +**Conforms to**: `experiment.schema.json` + +```json +{ + "llm_as_a_judge_model": "gemini-2.5-flash-lite", + "default_threshold": 0.9, + "scenarios": [ + { + "name": "weather_queries", + "steps": [ + { + "input": "What's the weather in NYC?", + "reference": { + "response": "The weather in NYC is sunny and 70F.", + "tool_calls": [ + { + "name": "get_weather", + "arguments": {"city": "NYC"} + } + ] + }, + "evaluations": [ + { + "metric_name": "AnswerAccuracy", + "threshold": 0.9 + }, + { + "metric_name": "ToolCallAccuracy", + "threshold": 1.0, + "parameters": {"exact_match": true} + } + ] + }, + { + "input": "And in London?", + "reference": { + "response": "London is rainy with 12C.", + "tool_calls": [ + { + "name": "get_weather", + "arguments": {"city": "London"} + } + ] + }, + "evaluations": [ + { + "metric_name": "AnswerAccuracy" + } + ] + } + ] + }, + { + "name": "booking_flow", + "steps": [ + { + "input": "Book a flight to Paris", + "reference": { + "response": "I'll help you book a flight to Paris. What date would you like to travel?", + "topics": ["booking", "flight", "Paris"] + }, + "custom_values": { + "expected_intent": "flight_booking", + "priority": "high" + }, + "evaluations": [ + { + "metric_name": "IntentClassification", # Custom metric + "threshold": 0.95 + } + ] + } + ], + "evaluations": [ + { + "metric_name": "ScenarioCoherence", # Custom Metric + "threshold": 0.85, + "parameters": {"check_continuity": true} + } + ] + } + ] +} +``` + +#### Example 2: After run.py (executed_experiment.json) + +**File**: `data/experiments/executed_experiment.json` +**Conforms to**: `executed_experiment.schema.json` + +```json +{ + "id": "exp_a7f3d2e9c1b4a8f6", + "llm_as_a_judge_model": "gemini-2.5-flash-lite", + "default_threshold": 0.9, + "scenarios": [ + { + "id": "scn_b2c4e8f1d3a5c7e9", + "trace_id": "a1b2c3d4e5f6789012345678901234ab", + "name": "weather_queries", + "steps": [ + { + "id": "stp_c3d5a9f2e4b6d8fa", + "input": "What's the weather in NYC?", + "turns": [ + { + "content": "What's the weather in NYC?", + "type": "human" + }, + { + "content": "Let me check the current weather in New York City for you.", + "type": "ai", + "tool_calls": [ + { + "name": "get_weather", + "args": {"city": "NYC", "units": "imperial"} + } + ] + }, + { + "content": "{\"temperature\": 70, \"condition\": \"sunny\", \"humidity\": 45}", + "type": "tool" + }, + { + "content": "It's currently sunny and 70°F in New York City with 45% humidity.", + "type": "ai" + } + ], + "reference": { + "response": "The weather in NYC is sunny and 70F.", + "tool_calls": [ + { + "name": "get_weather", + "arguments": {"city": "NYC"} + } + ], + "topics": ["weather", "temperature", "NYC"] + }, + "evaluations": [ + { + "metric_name": "AnswerAccuracy", + "threshold": 0.9 + }, + { + "metric_name": "ToolCallAccuracy", + "threshold": 1.0, + "parameters": {"exact_match": true} + } + ] + }, + { + "id": "stp_d4e6b0a3f5c7d9eb", + "input": "And in London?", + "turns": [ + { + "content": "And in London?", + "type": "human" + }, + { + "content": "I'll get the weather information for London.", + "type": "ai", + "tool_calls": [ + { + "name": "get_weather", + "args": {"city": "London", "units": "metric"} + } + ] + }, + { + "content": "{\"temperature\": 12, \"condition\": \"rainy\", \"humidity\": 85}", + "type": "tool" + }, + { + "content": "London is currently experiencing rainy weather with a temperature of 12°C and 85% humidity.", + "type": "ai" + } + ], + "reference": { + "response": "London is rainy with 12C.", + "tool_calls": [ + { + "name": "get_weather", + "arguments": {"city": "London"} + } + ] + }, + "evaluations": [ + { + "metric_name": "AnswerAccuracy" + } + ] + } + ] + }, + { + "id": "scn_e5f7c1b4d6a8e0fc", + "trace_id": "b2c3d4e5f6a7890123456789012345bc", + "name": "booking_flow", + "steps": [ + { + "id": "stp_f6a8d2c5e7b9f1ad", + "input": "Book a flight to Paris", + "turns": [ + { + "content": "Book a flight to Paris", + "type": "human" + }, + { + "content": "I'd be happy to help you book a flight to Paris! To find the best options, could you please tell me what date you'd like to travel?", + "type": "ai" + } + ], + "reference": { + "response": "I'll help you book a flight to Paris. What date would you like to travel?", + "topics": ["booking", "flight", "Paris"] + }, + "custom_values": { + "expected_intent": "flight_booking", + "priority": "high" + }, + "evaluations": [ + { + "metric_name": "IntentClassification", + "threshold": 0.95 + } + ] + } + ], + "evaluations": [ + { + "metric_name": "ScenarioCoherence", + "threshold": 0.85, + "parameters": {"check_continuity": true} + } + ] + } + ] +} +``` + +**Key Changes from Example 1:** +- Added `id` at experiment, scenario, and step levels (content-based SHA256 hashes) +- Added `trace_id` at scenario level (OpenTelemetry distributed tracing) +- Added `turns[]` at step level with full A2A conversation history +- Preserved all user input data unchanged + +#### Example 3: After evaluate.py (evaluated_experiment.json) + +**File**: `data/experiments/evaluated_experiment.json` +**Conforms to**: `evaluated_experiment.schema.json` + +```json +{ + "id": "exp_a7f3d2e9c1b4a8f6", + "llm_as_a_judge_model": "gemini-2.5-flash-lite", + "default_threshold": 0.9, + "scenarios": [ + { + "id": "scn_b2c4e8f1d3a5c7e9", + "trace_id": "a1b2c3d4e5f6789012345678901234ab", + "name": "weather_queries", + "steps": [ + { + "id": "stp_c3d5a9f2e4b6d8fa", + "input": "What's the weather in NYC?", + "turns": [ + { + "content": "What's the weather in NYC?", + "type": "human" + }, + { + "content": "Let me check the current weather in New York City for you.", + "type": "ai", + "tool_calls": [ + { + "name": "get_weather", + "args": {"city": "NYC", "units": "imperial"} + } + ] + }, + { + "content": "{\"temperature\": 70, \"condition\": \"sunny\", \"humidity\": 45}", + "type": "tool" + }, + { + "content": "It's currently sunny and 70°F in New York City with 45% humidity.", + "type": "ai" + } + ], + "reference": { + "response": "The weather in NYC is sunny and 70F.", + "tool_calls": [ + { + "name": "get_weather", + "arguments": {"city": "NYC"} + } + ], + "topics": ["weather", "temperature", "NYC"] + }, + "evaluations": [ + { + "metric_name": "AnswerAccuracy", + "threshold": 0.9, + "result": { + "result": "pass", + "score": 0.92, + "details": { + "semantic_similarity": 0.94, + "factual_consistency": 0.90, + "reasoning": "Response accurately conveys weather information with additional helpful details" + } + } + }, + { + "metric_name": "ToolCallAccuracy", + "threshold": 1.0, + "parameters": {"exact_match": true}, + "result": { + "result": "pass", + "score": 1.0, + "details": { + "tool_name_match": true, + "required_args_match": true, + "extra_args": ["units"] + } + } + } + ] + }, + { + "id": "stp_d4e6b0a3f5c7d9eb", + "input": "And in London?", + "turns": [ + { + "content": "And in London?", + "type": "human" + }, + { + "content": "I'll get the weather information for London.", + "type": "ai", + "tool_calls": [ + { + "name": "get_weather", + "args": {"city": "London", "units": "metric"} + } + ] + }, + { + "content": "{\"temperature\": 12, \"condition\": \"rainy\", \"humidity\": 85}", + "type": "tool" + }, + { + "content": "London is currently experiencing rainy weather with a temperature of 12°C and 85% humidity.", + "type": "ai" + } + ], + "reference": { + "response": "London is rainy with 12C.", + "tool_calls": [ + { + "name": "get_weather", + "arguments": {"city": "London"} + } + ] + }, + "evaluations": [ + { + "metric_name": "AnswerAccuracy", + "result": { + "result": "fail", + "score": 0.87, + "details": { + "semantic_similarity": 0.91, + "factual_consistency": 0.83, + "reasoning": "Response contains correct information but fails to meet default threshold of 0.9" + } + } + } + ] + } + ] + }, + { + "id": "scn_e5f7c1b4d6a8e0fc", + "trace_id": "b2c3d4e5f6a7890123456789012345bc", + "name": "booking_flow", + "steps": [ + { + "id": "stp_f6a8d2c5e7b9f1ad", + "input": "Book a flight to Paris", + "turns": [ + { + "content": "Book a flight to Paris", + "type": "human" + }, + { + "content": "I'd be happy to help you book a flight to Paris! To find the best options, could you please tell me what date you'd like to travel?", + "type": "ai" + } + ], + "reference": { + "response": "I'll help you book a flight to Paris. What date would you like to travel?", + "topics": ["booking", "flight", "Paris"] + }, + "custom_values": { + "expected_intent": "flight_booking", + "priority": "high" + }, + "evaluations": [ + { + "metric_name": "IntentClassification", + "threshold": 0.95, + "result": { + "result": "pass", + "score": 0.98, + "details": { + "predicted_intent": "flight_booking", + "confidence": 0.98, + "alternative_intents": [] + } + } + } + ] + } + ], + "evaluations": [ + { + "metric_name": "ScenarioCoherence", + "threshold": 0.85, + "parameters": {"check_continuity": true}, + "result": { + "result": "pass", + "score": 0.90, + "details": { + "conversation_flow": 0.92, + "topic_consistency": 0.88, + "reasoning": "Scenario maintains coherent booking flow with appropriate agent responses" + } + } + } + ] + } + ] +} +``` + +**Key Changes from Example 2:** +- Added `result` object to each metric in `evaluations[]` arrays +- Each result contains: + - `result`: "pass" or "fail" based on threshold comparison + - `score`: 0.0-1.0 metric score from LLM-as-judge + - `details`: Additional breakdown and reasoning +- Metric configurations (metric_name, threshold, parameters) remain intact +- Step with score 0.87 fails because it's below default_threshold (0.9) +- All IDs, trace_id, turns preserved unchanged + +### JsonSchemaBackend Architecture + +The new `JsonSchemaBackend` class provides a direct mapping between hierarchical JSON storage and RAGAS processing. + +**Key Design Decision**: Steps within a scenario must run sequentially to maintain conversation context (via `context_id` in A2A protocol). The backend passes scenarios to RAGAS. + +#### Scenario-Level Processing + +The backend loads experiments and passes each scenario to RAGAS. The `@experiment()` decorated functions process all steps within a scenario sequentially. + +**Input**: Hierarchical JSON structure +**Output**: List of scenario dictionaries + +**Example - Loading Scenarios**: + +Input (hierarchical JSON file): +```json +{ + "id": "exp_a7f3", + "llm_as_a_judge_model": "gemini-2.5-flash-lite", + "default_threshold": 0.9, + "scenarios": [ + { + "id": "scn_b2c4", + "trace_id": "a1b2c3d4", + "name": "weather_queries", + "steps": [ + { + "id": "stp_c3d5", + "input": "What's the weather?", + "reference": {"response": "Sunny"}, + "evaluations": [{"metric_name": "AnswerAccuracy"}] + }, + { + "id": "stp_d4e6", + "input": "In NYC?", + "reference": {"response": "70F"} + } + ] + } + ] +} +``` + +Output (list of scenario rows for RAGAS): +```python +[ + { + # Complete scenario structure preserved + "id": "scn_b2c4", + "trace_id": "a1b2c3d4", + "name": "weather_queries", + "steps": [ # Steps stay nested inside scenario + { + "id": "stp_c3d5", + "input": "What's the weather?", + "reference": {"response": "Sunny"}, + "evaluations": [{"metric_name": "AnswerAccuracy"}] + }, + { + "id": "stp_d4e6", + "input": "In NYC?", + "reference": {"response": "70F"} + } + ], + + # Experiment metadata added for experiment function access + "_experiment_meta": { + "id": "exp_a7f3", + "llm_as_a_judge_model": "gemini-2.5-flash-lite", + "default_threshold": 0.9 + } + } +] +``` + +**Key Points**: +- Each scenario becomes a row +- Steps remain nested within the scenario structure +- Experiment metadata added via `_experiment_meta` for easy access + +#### Saving Strategy (Scenarios → JSON) + +When saving data via `save_experiment()`, the backend simply writes scenarios back to JSON, adding any missing IDs. + +**Input**: List of scenario dictionaries (after processing by experiment functions) +**Output**: Hierarchical JSON structure + +**Example - Saving Scenarios**: + +Input (list of scenario rows after RAGAS processing): +```python +[ + { + # Complete scenario with all steps (now with added data) + "id": "scn_b2c4", + "trace_id": "a1b2c3d4", + "name": "weather_queries", + "steps": [ + { + "id": "stp_c3d5", + "input": "What's the weather?", + "turns": [{"content": "What's the weather?", "type": "human"}, {"content": "Sunny!", "type": "ai"}], + "reference": {"response": "Sunny"}, + "evaluations": [{"metric_name": "AnswerAccuracy", "result": {"result": "pass", "score": 0.95, "details": {}}}] + }, + { + "id": "stp_d4e6", + "input": "In NYC?", + "turns": [{"content": "In NYC?", "type": "human"}, {"content": "70F", "type": "ai"}], + "reference": {"response": "70F"} + } + ], + + # Experiment metadata (will be extracted and removed) + "_experiment_meta": { + "id": "exp_a7f3", + "llm_as_a_judge_model": "gemini-2.5-flash-lite", + "default_threshold": 0.9 + } + } +] +``` + +Output (hierarchical JSON file): +```json +{ + "id": "exp_a7f3", + "llm_as_a_judge_model": "gemini-2.5-flash-lite", + "default_threshold": 0.9, + "scenarios": [ + { + "id": "scn_b2c4", + "trace_id": "a1b2c3d4", + "name": "weather_queries", + "steps": [ + { + "id": "stp_c3d5", + "input": "What's the weather?", + "turns": [{"content": "What's the weather?", "type": "human"}, {"content": "Sunny!", "type": "ai"}], + "reference": {"response": "Sunny"}, + "evaluations": [{"metric_name": "AnswerAccuracy", "result": {"result": "pass", "score": 0.95, "details": {}}}] + }, + { + "id": "stp_d4e6", + "input": "In NYC?", + "turns": [{"content": "In NYC?", "type": "human"}, {"content": "70F", "type": "ai"}], + "reference": {"response": "70F"} + } + ] + } + ] +} +``` + +**Key Points**: +- Scenarios maintain their structure +- Only `_experiment_meta` removed (not part of scenario schema) +- No grouping or sorting needed - scenarios already complete +- Missing IDs generated via content hashing + +#### Content-Based ID Generation + +IDs are generated deterministically using hashing, ensuring reproducibility across runs. + +**ID Generation Strategy**: + +- **scenario_id**: Hash of (step_ids) +- **step_id**: Hash of (scenario_id + step_input + step_index) + +**Benefits**: +- Deterministic: Same content always produces same ID +- Traceable: Can verify ID matches content by re-hashing +- Reproducible: Re-running same test produces same IDs for comparison + + +## Grafana Visualization Strategy + +### Overview + +The testbench uses two complementary Grafana dashboards for monitoring and debugging agent quality: + +1. **Trends Dashboard** - Monitor quality trends over time, spot regressions after deployments +2. **Execution Details Dashboard** - Investigate specific execution failures, identify root causes + +This two-dashboard approach separates high-level monitoring from deep debugging, enabling efficient quality assurance workflows. + +### User Workflow: Monitoring → Investigation → Debugging + +``` +Trends Dashboard + │ + ├─ Spot drop in scores after deployment + │ + └─ Click execution row → Execution Details Dashboard + │ + ├─ See which scenarios/steps failed + │ + └─ Click [View Trace] → Tempo + │ + └─ See full agent behavior +``` + +### Current OTLP Metrics + +The current system publishes flat metrics with sample-level labels: + +Example OTLP metrics: +``` +testbench_evaluation_metric{ + name="faithfulness", + workflow_name="weather-agent-test", + execution_id="exec-001", + execution_number="1", + trace_id="a1b2c3d4e5f6789012345678901234ab", + sample_hash="7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a", + user_input_truncated="What's the weather in NYC?" +} = 0.95 + +testbench_evaluation_metric{ + name="answer_relevancy", + workflow_name="weather-agent-test", + execution_id="exec-001", + execution_number="1", + trace_id="a1b2c3d4e5f6789012345678901234ab", + sample_hash="8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a3", + user_input_truncated="And in London?" +} = 0.92 +``` + +### Enhanced OTLP Metrics + +The new system publishes metrics with rich hierarchical labels: + +Example OTLP metrics: +``` +testbench_evaluation_metric{ + name="AnswerAccuracy", + workflow_name="weather-agent-test", + execution_id="exec-001", + execution_number="1", + experiment_id="exp_a7f3d2e9c1b4a8f6", + scenario_id="scn_b2c4e8f1d3a5c7e9", + scenario_name="weather_queries", + step_id="stp_c3d5a9f2e4b6d8fa", + step_index="0", + trace_id="a1b2c3d4e5f6789012345678901234ab", + threshold="0.9", + result="pass", + user_input_truncated="What's the weather in NYC?" +} = 0.92 + +testbench_evaluation_metric{ + name="AnswerAccuracy", + workflow_name="weather-agent-test", + execution_id="exec-001", + execution_number="1", + experiment_id="exp_a7f3d2e9c1b4a8f6", + scenario_id="scn_b2c4e8f1d3a5c7e9", + scenario_name="weather_queries", + step_id="stp_d4e6b0a3f5c7d9eb", + step_index="1", + trace_id="a1b2c3d4e5f6789012345678901234ab", + threshold="0.9", + result="fail", + user_input_truncated="And in London?" +} = 0.87 +``` + +### Dashboard 1: Trends Over Time + +**Purpose**: Monitor agent quality trends, identify regressions correlated with deployments. + +**Use Cases**: +- Continuous monitoring of agent quality metrics +- Spotting degradation after code deployments +- Comparing scenario performance over time +- Identifying which executions need investigation + +**Dashboard Mockup**: + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ RAGAS Trends Dashboard - weather-agent-test │ +│ Time Range: [Last 30 Days ▼] │ +├─────────────────────────────────────────────────────────────────────┤ +│ Overall Pass Rate Over Time │ +│ │ +│ 100% ┤ │ +│ │ ┌─────────────┐ │ +│ 90% │ │ v2.1.0 │ ●─────────● weather_queries │ +│ │ │ deployed │ ╱ ╲ │ +│ 80% │ └─────────────┘ ● ● booking_flow │ +│ │ ╱ ╲ │ +│ 70% ├──────────────────● ● error_handling │ +│ │ │ +│ 60% ┤ │ +│ └───────────────────────────────────────── │ +│ Jan 15 Jan 22 Jan 29 Feb 5 │ +│ │ +├─────────────────────────────────────────────────────────────────────┤ +│ Average Metric Scores Over Time │ +│ │ +│ 1.0 ┤ │ +│ │ ●─────●─────● AnswerAccuracy │ +│ 0.9 │ ●─────●───────╱ │ +│ │ ╱ ╲ │ +│ 0.8 ├───● ●─────────────── ToolCallAccuracy │ +│ │ │ +│ 0.7 ┤ │ +│ └───────────────────────────────────────── │ +│ Jan 15 Jan 22 Jan 29 Feb 5 │ +│ │ +├─────────────────────────────────────────────────────────────────────┤ +│ Recent Executions [Filter ▼] │ +├─────────────────────────────────────────────────────────────────────┤ +│ Exec ID │ Time │ Pass Rate │ Avg Score │ Failed │ Status │ +│──────────┼────────────┼───────────┼───────────┼────────┼──────────│ +│ exec-005 │ Feb 5 14:30│ 85% │ 0.89 │ 3 │ [View] │ ← Click to drill down +│ exec-004 │ Feb 5 12:15│ 90% │ 0.91 │ 2 │ [View] │ +│ exec-003 │ Feb 4 18:45│ 75% │ 0.82 │ 5 │ [View] │ +│ exec-002 │ Feb 4 14:20│ 95% │ 0.94 │ 1 │ [View] │ +│ │ +│ 📊 Click [View] to investigate execution details │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +**Key Features**: +- **Time-series focus**: All data aggregated over time windows +- **Deployment annotations**: Visual markers showing when agent versions deployed +- **Scenario-level trends**: Compare different test scenarios (weather_queries, booking_flow, etc.) +- **Metric-level trends**: Track specific metrics (AnswerAccuracy, ToolCallAccuracy, etc.) +- **Quick drill-down**: Click execution ID to jump to details dashboard + +### Dashboard 2: Execution Details + +**Purpose**: Deep-dive into a specific execution to diagnose failures. + +**Use Cases**: +- Investigating why an execution failed +- Identifying which scenarios/steps caused failures +- Finding patterns in failed evaluations +- Linking to distributed traces for root cause analysis + +**Dashboard Mockup**: + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ Execution Details - exec-005 (Feb 5, 2026 14:30) │ +│ Workflow: weather-agent-test | Experiment: exp_a7f3d2e9 │ +│ [← Back to Trends] │ +├─────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │ Scenarios │ │ Pass Rate │ │ Avg Score │ │ +│ │ 3 │ │ 85% │ │ 0.89 │ │ +│ └──────────────┘ └──────────────┘ └──────────────┘ │ +│ │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │ Failed Steps │ │ Total Steps │ │ Metrics │ │ +│ │ 3 │ │ 20 │ │ 5 │ │ +│ └──────────────┘ └──────────────┘ └──────────────┘ │ +│ │ +├─────────────────────────────────────────────────────────────────────┤ +│ Pass Rate by Scenario (This Execution) │ +├─────────────────────────────────────────────────────────────────────┤ +│ │ +│ weather_queries ██████████ 100% (6/6 passed) │ +│ booking_flow ████████░░ 80% (4/5 passed) ⚠ │ +│ error_handling ███████░░░ 70% (7/10 passed) ⚠ │ +│ │ +├─────────────────────────────────────────────────────────────────────┤ +│ Failed Steps (This Execution) │ +├─────────────────────────────────────────────────────────────────────┤ +│ Scenario │ Step │ Metric │ Score │ Thresh │ Trace │ +│──────────────────┼──────┼─────────────────┼───────┼────────┼───────│ +│ booking_flow │ 2 │ IntentAccuracy │ 0.87 │ 0.90 │[View] │ +│ error_handling │ 1 │ ErrorRecovery │ 0.75 │ 0.80 │[View] │ +│ error_handling │ 5 │ Faithfulness │ 0.82 │ 0.85 │[View] │ +│ │ +│ 🔍 Click [View] to see full trace in Tempo │ +├─────────────────────────────────────────────────────────────────────┤ +│ All Steps Results [Search: ___] [Filter] │ +├─────────────────────────────────────────────────────────────────────┤ +│ Scenario │ Step │ Metric │ Result │ Score │ +│──────────────────┼──────┼─────────────────┼────────┼───────────────│ +│ weather_queries │ 0 │ AnswerAccuracy │ PASS ✓ │ 0.95 │ +│ weather_queries │ 1 │ AnswerAccuracy │ PASS ✓ │ 0.92 │ +│ booking_flow │ 0 │ IntentAccuracy │ PASS ✓ │ 0.98 │ +│ booking_flow │ 2 │ IntentAccuracy │ FAIL ✗ │ 0.87 │ +│ ... (showing 10 of 20) │ +│ │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +**Key Features**: +- **Single execution focus**: All data filtered by `execution_id` +- **Summary cards**: Quick overview of execution health +- **Scenario breakdown**: See which scenarios passed/failed +- **Failed steps detail**: Pinpoint exact failures with threshold comparisons +- **All steps results**: Searchable/filterable table of every evaluation +- **Trace linking**: Direct links to Tempo for deep debugging + +### Trace Linking to Tempo + +Each scenario has a `trace_id` that links to OpenTelemetry traces in Tempo, enabling deep debugging of agent behavior. + +**User Workflow**: +1. Identify failed step in Grafana "Execution Details" dashboard +2. Click [View Trace] link next to failed step +3. Opens Tempo showing full execution trace for that scenario +4. Drill down into agent spans, tool calls, LLM requests +5. Correlate evaluation failure with runtime behavior diff --git a/concepts/generic-metrics-registry-concept.md b/concepts/generic-metrics-registry-concept.md new file mode 100644 index 0000000..23e607f --- /dev/null +++ b/concepts/generic-metrics-registry-concept.md @@ -0,0 +1,223 @@ +# Generic MetricsRegistry Architecture Concept + +## Purpose + +This document describes a concept for transforming MetricsRegistry from a RAGAS-specific implementation into a framework-agnostic architecture that can support multiple metric frameworks (RAGAS, DeepEval, etc.). + +## Problem Statement + +The current `MetricsRegistry` is tightly coupled to RAGAS: + +```python +class MetricsRegistry: + def __init__(self): + self._classes: dict[str, type[BaseMetric]] = {} # RAGAS BaseMetric + self._discover_metrics() + + def _discover_metrics(self) -> None: + # Hardcoded to ragas.metrics.collections + for name, obj in inspect.getmembers(metrics_module): + if inspect.isclass(obj) and issubclass(obj, BaseMetric): + self._classes[name] = obj +``` + +**Limitations**: +- Cannot use metrics from DeepEval or other frameworks +- Returns framework-specific instances (RAGAS `BaseMetric`) +- Assumes RAGAS conventions (llm injection, ascore method) +- Not extensible without modifying core code + +## User Requirements + +1. **Generic registry**: Not limited to RAGAS metrics +2. **Callable interface**: Registry returns a callable, not a metric instance +3. **Callable signature**: `async def(sample: Step, **metric_args) -> Result` (where `Step` is defined in `scripts/schema/executed_experiment.schema.json` and `Result` contains `score: float` and `reason: str | None`) +4. **Easy extensibility**: Adding new frameworks should be straightforward +5. **Configurable naming**: Support framework-prefixed names with optional aliases + +## Framework Comparison + +### RAGAS vs DeepEval + +| Aspect | RAGAS | DeepEval | +|--------|-------|----------| +| **Base Class** | `ragas.metrics.collections.BaseMetric` | `deepeval.metrics.BaseMetric` | +| **Input Format** | Dict with flexible keys | `LLMTestCase` typed object | +| **Async Method** | `async ascore(**kwargs)` | `async a_measure(test_case)` | +| **Result Format** | `MetricResult` with `.value` | Sets `metric.score` attribute | +| **LLM Injection** | Constructor param: `__init__(llm=...)` | Per-metric: `evaluation_model=...` | +| **Discovery** | Import from collections module | Import from metrics module | + +**Key Insight**: Frameworks have fundamentally different APIs, requiring an adapter pattern to unify them. + +--- + +## Proposed Architecture + +### Core Concept: Adapter Pattern + +Instead of the registry working directly with framework-specific metrics, introduce an **adapter layer** that wraps framework-specific metrics behind a unified interface. + +``` +┌─────────────────────────────────────┐ +│ GenericMetricsRegistry │ +│ │ +│ get_metric_callable() → Callable │ +└──────────────┬──────────────────────┘ + │ + ┌───────┴──────────┐ + │ │ +┌──────▼──────┐ ┌──────▼──────┐ +│ RAGAS │ │ DeepEval │ +│ Adapter │ │ Adapter │ +└──────┬──────┘ └──────┬──────┘ + │ │ +┌──────▼──────┐ ┌──────▼──────┐ +│ RAGAS │ │ DeepEval │ +│ Metrics │ │ Metrics │ +└─────────────┘ └─────────────┘ +``` + +### 1. MetricCallable Protocol + +Define a **protocol** (not abstract class) that all adapters must conform to: + +```python +from dataclasses import dataclass +from typing import Protocol, Any + +@dataclass +class Result: + """ + Unified result from a metric evaluation. + + Attributes: + score: Float score between 0.0 and 1.0 + reason: Optional explanation for the score (LLM-generated reasoning) + """ + score: float + reason: str | None = None + +class MetricCallable(Protocol): + """ + Unified interface for executing metrics across all frameworks. + + This is the callable that the registry returns to users. + """ + + async def __call__( + self, + sample: Step, + **metric_args: Any + ) -> Result: + """ + Evaluate a single sample. + + Args: + sample: A Step object as defined in executed_experiment.schema.json, + containing input, turns, reference, custom_values, and evaluations. + **metric_args: Additional runtime arguments for the metric + + Returns: + Result containing score (0.0-1.0) and optional reason + """ + ... +``` + +**Design Rationale**: +- **Protocol (not ABC)**: Enables structural subtyping - any object with `async __call__(sample, **args) -> Result` satisfies the protocol +- **Structured return type**: `Result` dataclass with `score: float` and `reason: str | None` — provides both the numeric result and the LLM-generated reasoning behind it +- **Step as input**: Uses the `Step` type defined in `executed_experiment.schema.json`, providing a structured input with `input`, `turns`, `reference`, `custom_values`, and `evaluations` fields +- **Runtime args**: Allows passing additional parameters at evaluation time + +### 2. FrameworkAdapter Abstract Base Class + +Define an **ABC** that all framework adapters must implement: + +```python +from abc import ABC, abstractmethod +from typing import Any, Type + +class FrameworkAdapter(ABC): + """ + Base class for integrating a metric framework into the registry. + + Each framework (RAGAS, DeepEval, etc.) implements this interface. + """ + + @abstractmethod + def discover_metrics(self) -> dict[str, Type[Any]]: + """ + Discover available metric classes from this framework. + + Returns: + Dict mapping metric class names to their types + Example: {"Faithfulness": } + """ + pass + + @abstractmethod + def create_callable( + self, + class_name: str, + parameters: dict[str, Any], + llm: Any + ) -> MetricCallable: + """ + Create a MetricCallable for the specified metric. + + This method: + 1. Gets the metric class from discovered metrics + 2. Instantiates it with framework-specific logic + 3. Wraps it in an adapter that conforms to MetricCallable + + Args: + class_name: Name of metric class (e.g., "Faithfulness") + parameters: Constructor parameters for the metric + llm: LLM instance (may be used differently per framework) + + Returns: + A callable conforming to MetricCallable protocol + """ + pass + + @property + @abstractmethod + def framework_name(self) -> str: + """ + Identifier for this framework (e.g., 'ragas', 'deepeval'). + + Used in: + - Config files: {"framework": "ragas", ...} + - Result keys: "ragas.Faithfulness" + - Error messages + """ + pass +``` + +**Design Rationale**: +- **ABC (not Protocol)**: Enforces implementation inheritance +- **Discovery method**: Each framework knows how to find its metrics +- **Factory method**: Encapsulates framework-specific instantiation logic +- **LLM parameter**: Passed to adapter even though frameworks use it differently + +### 3. RAGAS Adapter Implementation + +#### RagasMetricCallable + +Wraps a RAGAS metric instance to conform to `MetricCallable` + +**Parameter filtering**: Only passes parameters the metric expects +**Format translation**: Converts generic sample → RAGAS conventions +**Result extraction**: Unwraps `MetricResult` into `Result(score, reason)` +**No metric_args usage yet**: Reserved for future use + +#### RagasFrameworkAdapter + +Implements `FrameworkAdapter` for RAGAS + +### 4. GenericMetricsRegistry + +The registry manages framework adapters. + +--- \ No newline at end of file diff --git a/scripts/schema/common.schema.json b/scripts/schema/common.schema.json new file mode 100644 index 0000000..2f21b34 --- /dev/null +++ b/scripts/schema/common.schema.json @@ -0,0 +1,85 @@ +{ + "$schema": "http://json-schema.org/draft-07/schema#", + "title": "Common Schema Definitions", + "description": "Shared definitions used across experiment schemas to eliminate redundancy.", + "definitions": { + "reference": { + "type": "object", + "description": "Expected reference data for evaluation", + "properties": { + "response": { + "type": "string", + "description": "Expected final response from the agent." + }, + "tool_calls": { + "type": "array", + "description": "Expected tool calls the agent should make", + "items": { + "type": "object", + "required": ["name", "arguments"], + "properties": { + "name": { "type": "string" }, + "arguments": { + "type": "object", + "description": "Key-value pairs of tool arguments." + } + } + } + }, + "topics": { + "type": "array", + "items": { "type": "string" }, + "description": "Expected topics that should be covered in the agent's response." + } + } + }, + "turn": { + "type": "object", + "description": "A single turn in a multi-turn conversation", + "properties": { + "content": { + "type": "string", + "description": "The text content of the message or the stringified result of a tool call." + }, + "type": { + "type": "string", + "enum": ["human", "ai", "tool"], + "description": "The role of the entity generating the content." + }, + "tool_calls": { + "type": "array", + "items": { + "type": "object", + "properties": { + "name": { "type": "string" }, + "args": { "type": "object" } + }, + "required": ["name", "args"] + } + } + }, + "required": ["content", "type"] + }, + "metric": { + "type": "object", + "description": "Metric configuration for evaluations", + "required": ["metric_name"], + "properties": { + "metric_name": { + "type": "string", + "description": "Registry ID (e.g., 'ragas_faithfulness', 'tool_check')" + }, + "threshold": { + "type": "number", + "minimum": 0, + "maximum": 1, + "description": "Minimum acceptable score for this metric (0.0 to 1.0). Values below this threshold indicate test failure." + }, + "parameters": { + "type": "object", + "description": "Arguments passed to the generic metric adapter." + } + } + } + } +} diff --git a/scripts/schema/evaluated_experiment.schema.json b/scripts/schema/evaluated_experiment.schema.json new file mode 100644 index 0000000..365fd2f --- /dev/null +++ b/scripts/schema/evaluated_experiment.schema.json @@ -0,0 +1,123 @@ +{ + "$schema": "http://json-schema.org/draft-07/schema#", + "title": "Testbench Experiment", + "description": "Schema for defining experiments to evaluate agent responses.", + "type": "object", + "required": ["scenarios"], + "properties": { + "id": { + "type": "string", + "description": "Unique identifier for the experiment." + }, + "llm_as_a_judge_model": { + "type": "string", + "description": "The specific LLM used to grade responses (e.g., 'gpt-4o', 'gemini-1.5-pro')." + }, + "default_threshold": { + "type": "number", + "default": 0.9, + "minimum": 0, + "maximum": 1, + "description": "The fallback threshold used across all evaluations if not specified at the metric level." + }, + "scenarios": { + "type": "array", + "items": { "$ref": "#/definitions/scenario" } + } + }, + "definitions": { + "scenario": { + "type": "object", + "required": ["name", "steps"], + "properties": { + "id": { + "type": "string", + "description": "Unique identifier for the scenario." + }, + "trace_id": { + "type": "string", + "description": "Trace identifier linking to the execution trace of this scenario." + }, + "name": { "type": "string" }, + "steps": { + "type": "array", + "items": { "$ref": "#/definitions/step" } + }, + "reference": { "$ref": "common.schema.json#/definitions/reference" }, + "evaluations": { + "type": "array", + "items": { "$ref": "#/definitions/metric" }, + "description": "List of metric configurations to evaluate this step." + } + } + }, + "step": { + "type": "object", + "required": ["input"], + "properties": { + "id": { + "type": "string", + "description": "Unique identifier for the step." + }, + "input": { + "type": "string", + "description": "User input to the agent at this step." + }, + "turns": { + "type": "array", + "items": { "$ref": "common.schema.json#/definitions/turn" } + }, + "reference": { "$ref": "common.schema.json#/definitions/reference" }, + "custom_values": { + "type": "object", + "description": "Additional key-value pairs to include in the test results for this step." + }, + "evaluations": { + "type": "array", + "items": { "$ref": "#/definitions/metric" }, + "description": "List of metric configurations to evaluate this step." + } + } + }, + "metric": { + "type": "object", + "required": ["metric_name"], + "properties": { + "metric_name": { + "type": "string", + "description": "Registry ID (e.g., 'ragas_faithfulness', 'tool_check')" + }, + "threshold": { + "type": "number", + "minimum": 0, + "maximum": 1, + "description": "Minimum acceptable score for this metric (0.0 to 1.0). Values below this threshold indicate test failure." + }, + "parameters": { + "type": "object", + "description": "Arguments passed to the generic metric adapter." + }, + "result": { + "type": "object", + "properties": { + "result": { + "type": "string", + "enum": ["pass", "fail"], + "description": "Indicates whether the metric evaluation passed or failed based on the threshold." + }, + "score": { + "type": "number", + "minimum": 0, + "maximum": 1, + "description": "The computed score for this metric." + }, + "details": { + "type": "object", + "description": "Additional details or breakdown of the metric evaluation." + } + } + } + } + } + } +} \ No newline at end of file diff --git a/scripts/schema/executed_experiment.schema.json b/scripts/schema/executed_experiment.schema.json new file mode 100644 index 0000000..343e4cc --- /dev/null +++ b/scripts/schema/executed_experiment.schema.json @@ -0,0 +1,83 @@ +{ + "$schema": "http://json-schema.org/draft-07/schema#", + "title": "Testbench Experiment", + "description": "Schema for defining experiments to evaluate agent responses.", + "type": "object", + "required": ["scenarios"], + "properties": { + "id": { + "type": "string", + "description": "Unique identifier for the experiment." + }, + "llm_as_a_judge_model": { + "type": "string", + "description": "The specific LLM used to grade responses (e.g., 'gpt-4o', 'gemini-1.5-pro')." + }, + "default_threshold": { + "type": "number", + "default": 0.9, + "minimum": 0, + "maximum": 1, + "description": "The fallback threshold used across all evaluations if not specified at the metric level." + }, + "scenarios": { + "type": "array", + "items": { "$ref": "#/definitions/scenario" } + } + }, + "definitions": { + "scenario": { + "type": "object", + "required": ["name", "steps"], + "properties": { + "id": { + "type": "string", + "description": "Unique identifier for the scenario." + }, + "trace_id": { + "type": "string", + "description": "Trace identifier linking to the execution trace of this scenario." + }, + "name": { "type": "string" }, + "steps": { + "type": "array", + "items": { "$ref": "#/definitions/step" } + }, + "reference": { "$ref": "common.schema.json#/definitions/reference" }, + "evaluations": { + "type": "array", + "items": { "$ref": "common.schema.json#/definitions/metric" }, + "description": "List of metric configurations to evaluate this step." + } + } + }, + "step": { + "type": "object", + "required": ["input"], + "properties": { + "id": { + "type": "string", + "description": "Unique identifier for the step." + }, + "input": { + "type": "string", + "description": "User input to the agent at this step." + }, + "turns": { + "type": "array", + "items": { "$ref": "common.schema.json#/definitions/turn" } + }, + "reference": { "$ref": "common.schema.json#/definitions/reference" }, + "custom_values": { + "type": "object", + "description": "Additional key-value pairs to include in the test results for this step." + }, + "evaluations": { + "type": "array", + "items": { "$ref": "common.schema.json#/definitions/metric" }, + "description": "List of metric configurations to evaluate this step." + } + } + } + } +} \ No newline at end of file diff --git a/scripts/schema/experiment.schema.json b/scripts/schema/experiment.schema.json new file mode 100644 index 0000000..4a0ed93 --- /dev/null +++ b/scripts/schema/experiment.schema.json @@ -0,0 +1,63 @@ +{ + "$schema": "http://json-schema.org/draft-07/schema#", + "title": "Testbench Experiment", + "description": "Schema for defining experiments to evaluate agent responses.", + "type": "object", + "required": ["scenarios"], + "properties": { + "llm_as_a_judge_model": { + "type": "string", + "description": "The specific LLM used to grade responses (e.g., 'gpt-4o', 'gemini-1.5-pro')." + }, + "default_threshold": { + "type": "number", + "default": 0.9, + "minimum": 0, + "maximum": 1, + "description": "The fallback threshold used across all evaluations if not specified at the metric level." + }, + "scenarios": { + "type": "array", + "items": { "$ref": "#/definitions/scenario" } + } + }, + "definitions": { + "scenario": { + "type": "object", + "required": ["name", "steps"], + "properties": { + "name": { "type": "string" }, + "steps": { + "type": "array", + "items": { "$ref": "#/definitions/step" } + }, + "reference": { "$ref": "common.schema.json#/definitions/reference" }, + "evaluations": { + "type": "array", + "items": { "$ref": "common.schema.json#/definitions/metric" }, + "description": "List of metric configurations to evaluate this step." + } + } + }, + "step": { + "type": "object", + "required": ["input"], + "properties": { + "input": { + "type": "string", + "description": "User input to the agent at this step." + }, + "reference": { "$ref": "common.schema.json#/definitions/reference" }, + "custom_values": { + "type": "object", + "description": "Additional key-value pairs to include in the test results for this step." + }, + "evaluations": { + "type": "array", + "items": { "$ref": "common.schema.json#/definitions/metric" }, + "description": "List of metric configurations to evaluate this step." + } + } + } + } +} \ No newline at end of file