diff --git a/concepts/data_structur_concept.md b/concepts/data_structur_concept.md
new file mode 100644
index 0000000..c56b5d2
--- /dev/null
+++ b/concepts/data_structur_concept.md
@@ -0,0 +1,1116 @@
+# Data Structure Concept: Migration from JSONL to JSON Schema
+
+## Executive Summary
+
+### Problem Statement
+
+The current testbench pipeline uses flat JSONL (JSON Lines) files for storing test data throughout the 3-phase evaluation workflow.
+
+### Solution
+
+Replace the flat JSONL format with a hierarchical JSON Schema-based structure that organizes tests into a three-level hierarchy:
+
+```
+Experiment → Scenarios → Steps
+```
+
+This migration introduces formal JSON schemas validated at each pipeline phase, custom RAGAS backend implementations for bidirectional data transformation, and content-based deterministic ID generation for reproducibility.
+
+### Key Benefits
+
+1. **Scenario Organization**: Group related test steps into named scenarios (e.g., "weather_queries", "booking_flow")
+2. **Hierarchical Evaluation**: Support both step-level and scenario-level metrics
+3. **Deterministic IDs**: Content-based hashing ensures reproducible identifiers across runs
+4. **Schema Validation**: Catches data contract violations early
+
+### Impact
+
+This is a **breaking change** requiring:
+- New `JsonSchemaBackend` implementation
+- Updates to run.py, evaluate.py, publish.py, visualize.py
+- Migration of existing JSONL test data to JSON format
+- Updated Grafana dashboards
+- No backwards compatibility with JSONL format
+
+---
+
+## Current State: JSONL-Based Pipeline
+
+### Overview
+
+The testbench currently uses RAGAS framework's `LocalJSONLBackend` to store evaluation data in flat JSONL format. Each line in the file represents a single, independent test sample with no relationship to other samples.
+
+### RAGAS Backend Architecture
+
+RAGAS (Retrieval-Augmented Generation Assessment) uses a backend abstraction layer for data persistence.
+The `@experiment()` decorator processes datasets row-by-row asynchronously. Each decorated function:
+1. Receives a single row (dict) as input
+2. Processes it (queries agent, evaluates metrics, etc.)
+3. Returns an enriched row (dict) with added fields
+4. RAGAS collects all rows into a list and passes to backend for serialization
+
+The `LocalJSONLBackend` simply writes each dict as a JSON line without understanding relationships between samples.
+
+### Data Flow with Concrete Examples
+
+**Phase 1: Run (run.py)**
+
+Input: `data/datasets/ragas_dataset.jsonl` + Agent URL
+Output: `data/experiments/ragas_experiment.jsonl`
+
+Example output row:
+```json
+{
+  "user_input": "What's the weather in NYC?",
+  "retrieved_contexts": ["Weather API returned: sunny, 70F"],
+  "reference": "The weather in NYC is sunny and 70F.",
+  "response": "It's currently sunny and 70F in New York City.",
+  "trace_id": "a1b2c3d4e5f6789012345678901234ab",
+  "sample_hash": "7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a"
+}
+```
+
+Added fields:
+- **`response`** (string): Agent's actual response from A2A protocol
+- **`trace_id`** (string): OpenTelemetry trace ID for distributed tracing
+- **`sample_hash`** (string): SHA256 hash of sample content for deduplication
+
+**Phase 2: Evaluate (evaluate.py)**
+
+Input: `data/experiments/ragas_experiment.jsonl` + LLM model + metrics config
+Output: `data/experiments/ragas_evaluation.jsonl`
+
+Example row:
+```json
+{
+  "user_input": "What's the weather in NYC?",
+  "retrieved_contexts": ["Weather API returned: sunny, 70F"],
+  "reference": "The weather in NYC is sunny and 70F.",
+  "response": "It's currently sunny and 70F in New York City.",
+  "trace_id": "a1b2c3d4e5f6789012345678901234ab",
+  "sample_hash": "7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a",
+  "individual_results": {
+    "faithfulness": 0.95,
+    "answer_relevancy": 0.92,
+    "context_precision": 0.88
+  }
+}
+```
+
+Added field:
+- **`individual_results`** (dict): Metric name → score (0.0-1.0)
+
+**Phase 3: Publish (publish.py)**
+
+Input: `data/experiments/ragas_evaluation.jsonl`
+Output: OTLP metrics published to observability backend
+
+Example OTLP metric:
+```
+testbench_evaluation_metric{
+  name="faithfulness",
+  workflow_name="weather-agent-test",
+  execution_id="exec-001",
+  execution_number="1",
+  trace_id="a1b2c3d4e5f6789012345678901234ab",
+  sample_hash="7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a",
+  user_input_truncated="What's the weather in NYC?"
+} = 0.95
+```
+
+---
+
+## Target State: JSON Schema-Based Pipeline
+
+### Overview
+
+The new system uses hierarchical JSON format validated against formal schemas at each phase. Data transitions through three schemas:
+
+1. **experiment.schema.json** - User input (test definitions)
+2. **executed_experiment.schema.json** - After agent execution
+3. **evaluated_experiment.schema.json** - After metric evaluation
+
+### Schema Hierarchy Visualization
+
+```
+Experiment
+├── id (string) - Unique experiment identifier
+├── llm_as_a_judge_model (string) - LLM for evaluation
+├── default_threshold (number) - Fallback threshold (0.0-1.0)
+└── scenarios[] (array)
+    ├── id (string) - Unique scenario identifier
+    ├── trace_id (string) - OpenTelemetry trace
+    ├── name (string) - Human-readable scenario name
+    ├── steps[] (array)
+    │   ├── id (string) - Unique step identifier
+    │   ├── input (string) - User query to agent
+    │   ├── turns[] (array) - A2A conversation history
+    │   │   ├── content (string) - Message content
+    │   │   ├── type (enum) - "human" | "ai" | "tool"
+    │   │   └── tool_calls[] (array, optional) - Tool invocations
+    │   ├── reference (object, optional)
+    │   │   ├── response (string) - Expected answer
+    │   │   ├── tool_calls[] (array) - Expected tool usage
+    │   │   ├── topics[] (array) - Expected topics covered
+    │   │   └── ... (other reference fields)
+    │   ├── custom_values (object, optional) - Custom metadata
+    │   └── evaluations[] (array) - Metric configurations/results
+    │       ├── metric_name (string) - Metric identifier
+    │       ├── threshold (number, optional) - Override threshold
+    │       ├── parameters (object, optional) - Metric config
+    │       └── result (object) - Evaluation result (added by evaluate.py)
+    │           ├── result (enum) - "pass" | "fail"
+    │           ├── score (number) - 0.0-1.0
+    │           └── details (object) - Additional breakdown
+    └── evaluations[] (array, optional) - Scenario-level metrics
+```
+
+### experiment.schema.json is User Input
+
+The `experiment.schema.json` defines the **starting point** of the pipeline - the test definitions created manually by users. This is analogous to a test suite configuration file.
+
+Users create `data/datasets/experiment.json` conforming to this schema BEFORE running the pipeline. This file contains:
+- Test scenario definitions
+- Expected inputs to the agent
+- Reference data (ground truth)
+- Metric configurations (which metrics to evaluate, thresholds)
+
+**Example user-created experiment.json:**
+```json
+{
+  "llm_as_a_judge_model": "gemini-2.5-flash-lite",
+  "default_threshold": 0.9,
+  "scenarios": [
+    {
+      "name": "weather_queries",
+      "steps": [
+        {
+          "input": "What's the weather in NYC?",
+          "reference": {
+            "response": "The weather in NYC is sunny and 70F.",
+            "tool_calls": [{"name": "get_weather", "arguments": {"city": "NYC"}}]
+          },
+          "evaluations": [
+            {"metric_name": "AnswerAccuracy", "threshold": 0.9},
+            {"metric_name": "ToolCallAccuracy", "threshold": 1.0}
+          ]
+        }
+      ]
+    }
+  ]
+}
+```
+
+### Detailed Attribute Descriptions
+
+#### run.py execution
+
+**ADDED at experiment level:**
+- `id` (string) - Unique experiment identifier generated via content hash
+
+**ADDED at scenario level:**
+- `id` (string) - Unique scenario identifier (hash of experiment_id + scenario_name)
+- `trace_id` (string) - OpenTelemetry trace ID for distributed tracing
+
+**ADDED at step level:**
+- `id` (string) - Unique step identifier (Hash of scenario_id + step_input + index)
+- `turns[]` (array) - Full A2A conversation history with message objects:
+  - `content` (string) - Message text or stringified tool result
+  - `type` (enum) - "human" | "agent" | "tool"
+  - `tool_calls[]` (array, optional) - Tool invocations made by agent
+    - `name` (string) - Tool name
+    - `args` (object) - Tool arguments
+
+**PRESERVED:**
+- All user input data (llm_as_a_judge_model, default_threshold, scenarios, steps, evaluations as metric configurations)
+
+#### evaluate.py execution
+
+**ADDED within each metric in evaluations[]:**
+- `result` (object) - Evaluation result containing:
+  - `result` (string) - "pass" or "fail" based on threshold comparison
+  - `score` (number) - 0.0-1.0 computed metric score from LLM-as-judge
+  - `details` (object) - Additional evaluation breakdown and reasoning
+
+**PRESERVED:**
+- All data from executed_experiment including IDs, trace_id, turns
+
+**TRANSFORMED:**
+- evaluations[] changes from metric configurations → metric WITH results
+- The original metric config fields (metric_name, threshold, parameters) remain intact
+- The `result` object is added alongside them
+
+### Complete Data Flow with Concrete Examples
+
+#### Example 1: User Input (experiment.json)
+
+**File**: `data/datasets/experiment.json`
+**Conforms to**: `experiment.schema.json`
+
+```json
+{
+  "llm_as_a_judge_model": "gemini-2.5-flash-lite",
+  "default_threshold": 0.9,
+  "scenarios": [
+    {
+      "name": "weather_queries",
+      "steps": [
+        {
+          "input": "What's the weather in NYC?",
+          "reference": {
+            "response": "The weather in NYC is sunny and 70F.",
+            "tool_calls": [
+              {
+                "name": "get_weather",
+                "arguments": {"city": "NYC"}
+              }
+            ]
+          },
+          "evaluations": [
+            {
+              "metric_name": "AnswerAccuracy",
+              "threshold": 0.9
+            },
+            {
+              "metric_name": "ToolCallAccuracy",
+              "threshold": 1.0,
+              "parameters": {"exact_match": true}
+            }
+          ]
+        },
+        {
+          "input": "And in London?",
+          "reference": {
+            "response": "London is rainy with 12C.",
+            "tool_calls": [
+              {
+                "name": "get_weather",
+                "arguments": {"city": "London"}
+              }
+            ]
+          },
+          "evaluations": [
+            {
+              "metric_name": "AnswerAccuracy"
+            }
+          ]
+        }
+      ]
+    },
+    {
+      "name": "booking_flow",
+      "steps": [
+        {
+          "input": "Book a flight to Paris",
+          "reference": {
+            "response": "I'll help you book a flight to Paris. What date would you like to travel?",
+            "topics": ["booking", "flight", "Paris"]
+          },
+          "custom_values": {
+            "expected_intent": "flight_booking",
+            "priority": "high"
+          },
+          "evaluations": [
+            {
+              "metric_name": "IntentClassification", # Custom metric
+              "threshold": 0.95
+            }
+          ]
+        }
+      ],
+      "evaluations": [
+        {
+          "metric_name": "ScenarioCoherence", # Custom Metric
+          "threshold": 0.85,
+          "parameters": {"check_continuity": true}
+        }
+      ]
+    }
+  ]
+}
+```
+
+#### Example 2: After run.py (executed_experiment.json)
+
+**File**: `data/experiments/executed_experiment.json`
+**Conforms to**: `executed_experiment.schema.json`
+
+```json
+{
+  "id": "exp_a7f3d2e9c1b4a8f6",
+  "llm_as_a_judge_model": "gemini-2.5-flash-lite",
+  "default_threshold": 0.9,
+  "scenarios": [
+    {
+      "id": "scn_b2c4e8f1d3a5c7e9",
+      "trace_id": "a1b2c3d4e5f6789012345678901234ab",
+      "name": "weather_queries",
+      "steps": [
+        {
+          "id": "stp_c3d5a9f2e4b6d8fa",
+          "input": "What's the weather in NYC?",
+          "turns": [
+            {
+              "content": "What's the weather in NYC?",
+              "type": "human"
+            },
+            {
+              "content": "Let me check the current weather in New York City for you.",
+              "type": "ai",
+              "tool_calls": [
+                {
+                  "name": "get_weather",
+                  "args": {"city": "NYC", "units": "imperial"}
+                }
+              ]
+            },
+            {
+              "content": "{\"temperature\": 70, \"condition\": \"sunny\", \"humidity\": 45}",
+              "type": "tool"
+            },
+            {
+              "content": "It's currently sunny and 70°F in New York City with 45% humidity.",
+              "type": "ai"
+            }
+          ],
+          "reference": {
+            "response": "The weather in NYC is sunny and 70F.",
+            "tool_calls": [
+              {
+                "name": "get_weather",
+                "arguments": {"city": "NYC"}
+              }
+            ],
+            "topics": ["weather", "temperature", "NYC"]
+          },
+          "evaluations": [
+            {
+              "metric_name": "AnswerAccuracy",
+              "threshold": 0.9
+            },
+            {
+              "metric_name": "ToolCallAccuracy",
+              "threshold": 1.0,
+              "parameters": {"exact_match": true}
+            }
+          ]
+        },
+        {
+          "id": "stp_d4e6b0a3f5c7d9eb",
+          "input": "And in London?",
+          "turns": [
+            {
+              "content": "And in London?",
+              "type": "human"
+            },
+            {
+              "content": "I'll get the weather information for London.",
+              "type": "ai",
+              "tool_calls": [
+                {
+                  "name": "get_weather",
+                  "args": {"city": "London", "units": "metric"}
+                }
+              ]
+            },
+            {
+              "content": "{\"temperature\": 12, \"condition\": \"rainy\", \"humidity\": 85}",
+              "type": "tool"
+            },
+            {
+              "content": "London is currently experiencing rainy weather with a temperature of 12°C and 85% humidity.",
+              "type": "ai"
+            }
+          ],
+          "reference": {
+            "response": "London is rainy with 12C.",
+            "tool_calls": [
+              {
+                "name": "get_weather",
+                "arguments": {"city": "London"}
+              }
+            ]
+          },
+          "evaluations": [
+            {
+              "metric_name": "AnswerAccuracy"
+            }
+          ]
+        }
+      ]
+    },
+    {
+      "id": "scn_e5f7c1b4d6a8e0fc",
+      "trace_id": "b2c3d4e5f6a7890123456789012345bc",
+      "name": "booking_flow",
+      "steps": [
+        {
+          "id": "stp_f6a8d2c5e7b9f1ad",
+          "input": "Book a flight to Paris",
+          "turns": [
+            {
+              "content": "Book a flight to Paris",
+              "type": "human"
+            },
+            {
+              "content": "I'd be happy to help you book a flight to Paris! To find the best options, could you please tell me what date you'd like to travel?",
+              "type": "ai"
+            }
+          ],
+          "reference": {
+            "response": "I'll help you book a flight to Paris. What date would you like to travel?",
+            "topics": ["booking", "flight", "Paris"]
+          },
+          "custom_values": {
+            "expected_intent": "flight_booking",
+            "priority": "high"
+          },
+          "evaluations": [
+            {
+              "metric_name": "IntentClassification",
+              "threshold": 0.95
+            }
+          ]
+        }
+      ],
+      "evaluations": [
+        {
+          "metric_name": "ScenarioCoherence",
+          "threshold": 0.85,
+          "parameters": {"check_continuity": true}
+        }
+      ]
+    }
+  ]
+}
+```
+
+**Key Changes from Example 1:**
+- Added `id` at experiment, scenario, and step levels (content-based SHA256 hashes)
+- Added `trace_id` at scenario level (OpenTelemetry distributed tracing)
+- Added `turns[]` at step level with full A2A conversation history
+- Preserved all user input data unchanged
+
+#### Example 3: After evaluate.py (evaluated_experiment.json)
+
+**File**: `data/experiments/evaluated_experiment.json`
+**Conforms to**: `evaluated_experiment.schema.json`
+
+```json
+{
+  "id": "exp_a7f3d2e9c1b4a8f6",
+  "llm_as_a_judge_model": "gemini-2.5-flash-lite",
+  "default_threshold": 0.9,
+  "scenarios": [
+    {
+      "id": "scn_b2c4e8f1d3a5c7e9",
+      "trace_id": "a1b2c3d4e5f6789012345678901234ab",
+      "name": "weather_queries",
+      "steps": [
+        {
+          "id": "stp_c3d5a9f2e4b6d8fa",
+          "input": "What's the weather in NYC?",
+          "turns": [
+            {
+              "content": "What's the weather in NYC?",
+              "type": "human"
+            },
+            {
+              "content": "Let me check the current weather in New York City for you.",
+              "type": "ai",
+              "tool_calls": [
+                {
+                  "name": "get_weather",
+                  "args": {"city": "NYC", "units": "imperial"}
+                }
+              ]
+            },
+            {
+              "content": "{\"temperature\": 70, \"condition\": \"sunny\", \"humidity\": 45}",
+              "type": "tool"
+            },
+            {
+              "content": "It's currently sunny and 70°F in New York City with 45% humidity.",
+              "type": "ai"
+            }
+          ],
+          "reference": {
+            "response": "The weather in NYC is sunny and 70F.",
+            "tool_calls": [
+              {
+                "name": "get_weather",
+                "arguments": {"city": "NYC"}
+              }
+            ],
+            "topics": ["weather", "temperature", "NYC"]
+          },
+          "evaluations": [
+            {
+              "metric_name": "AnswerAccuracy",
+              "threshold": 0.9,
+              "result": {
+                "result": "pass",
+                "score": 0.92,
+                "details": {
+                  "semantic_similarity": 0.94,
+                  "factual_consistency": 0.90,
+                  "reasoning": "Response accurately conveys weather information with additional helpful details"
+                }
+              }
+            },
+            {
+              "metric_name": "ToolCallAccuracy",
+              "threshold": 1.0,
+              "parameters": {"exact_match": true},
+              "result": {
+                "result": "pass",
+                "score": 1.0,
+                "details": {
+                  "tool_name_match": true,
+                  "required_args_match": true,
+                  "extra_args": ["units"]
+                }
+              }
+            }
+          ]
+        },
+        {
+          "id": "stp_d4e6b0a3f5c7d9eb",
+          "input": "And in London?",
+          "turns": [
+            {
+              "content": "And in London?",
+              "type": "human"
+            },
+            {
+              "content": "I'll get the weather information for London.",
+              "type": "ai",
+              "tool_calls": [
+                {
+                  "name": "get_weather",
+                  "args": {"city": "London", "units": "metric"}
+                }
+              ]
+            },
+            {
+              "content": "{\"temperature\": 12, \"condition\": \"rainy\", \"humidity\": 85}",
+              "type": "tool"
+            },
+            {
+              "content": "London is currently experiencing rainy weather with a temperature of 12°C and 85% humidity.",
+              "type": "ai"
+            }
+          ],
+          "reference": {
+            "response": "London is rainy with 12C.",
+            "tool_calls": [
+              {
+                "name": "get_weather",
+                "arguments": {"city": "London"}
+              }
+            ]
+          },
+          "evaluations": [
+            {
+              "metric_name": "AnswerAccuracy",
+              "result": {
+                "result": "fail",
+                "score": 0.87,
+                "details": {
+                  "semantic_similarity": 0.91,
+                  "factual_consistency": 0.83,
+                  "reasoning": "Response contains correct information but fails to meet default threshold of 0.9"
+                }
+              }
+            }
+          ]
+        }
+      ]
+    },
+    {
+      "id": "scn_e5f7c1b4d6a8e0fc",
+      "trace_id": "b2c3d4e5f6a7890123456789012345bc",
+      "name": "booking_flow",
+      "steps": [
+        {
+          "id": "stp_f6a8d2c5e7b9f1ad",
+          "input": "Book a flight to Paris",
+          "turns": [
+            {
+              "content": "Book a flight to Paris",
+              "type": "human"
+            },
+            {
+              "content": "I'd be happy to help you book a flight to Paris! To find the best options, could you please tell me what date you'd like to travel?",
+              "type": "ai"
+            }
+          ],
+          "reference": {
+            "response": "I'll help you book a flight to Paris. What date would you like to travel?",
+            "topics": ["booking", "flight", "Paris"]
+          },
+          "custom_values": {
+            "expected_intent": "flight_booking",
+            "priority": "high"
+          },
+          "evaluations": [
+            {
+              "metric_name": "IntentClassification",
+              "threshold": 0.95,
+              "result": {
+                "result": "pass",
+                "score": 0.98,
+                "details": {
+                  "predicted_intent": "flight_booking",
+                  "confidence": 0.98,
+                  "alternative_intents": []
+                }
+              }
+            }
+          ]
+        }
+      ],
+      "evaluations": [
+        {
+          "metric_name": "ScenarioCoherence",
+          "threshold": 0.85,
+          "parameters": {"check_continuity": true},
+          "result": {
+            "result": "pass",
+            "score": 0.90,
+            "details": {
+              "conversation_flow": 0.92,
+              "topic_consistency": 0.88,
+              "reasoning": "Scenario maintains coherent booking flow with appropriate agent responses"
+            }
+          }
+        }
+      ]
+    }
+  ]
+}
+```
+
+**Key Changes from Example 2:**
+- Added `result` object to each metric in `evaluations[]` arrays
+- Each result contains:
+  - `result`: "pass" or "fail" based on threshold comparison
+  - `score`: 0.0-1.0 metric score from LLM-as-judge
+  - `details`: Additional breakdown and reasoning
+- Metric configurations (metric_name, threshold, parameters) remain intact
+- Step with score 0.87 fails because it's below default_threshold (0.9)
+- All IDs, trace_id, turns preserved unchanged
+
+### JsonSchemaBackend Architecture
+
+The new `JsonSchemaBackend` class provides a direct mapping between hierarchical JSON storage and RAGAS processing.
+
+**Key Design Decision**: Steps within a scenario must run sequentially to maintain conversation context (via `context_id` in A2A protocol). The backend passes scenarios to RAGAS.
+
+#### Scenario-Level Processing
+
+The backend loads experiments and passes each scenario to RAGAS. The `@experiment()` decorated functions process all steps within a scenario sequentially.
+
+**Input**: Hierarchical JSON structure
+**Output**: List of scenario dictionaries
+
+**Example - Loading Scenarios**:
+
+Input (hierarchical JSON file):
+```json
+{
+  "id": "exp_a7f3",
+  "llm_as_a_judge_model": "gemini-2.5-flash-lite",
+  "default_threshold": 0.9,
+  "scenarios": [
+    {
+      "id": "scn_b2c4",
+      "trace_id": "a1b2c3d4",
+      "name": "weather_queries",
+      "steps": [
+        {
+          "id": "stp_c3d5",
+          "input": "What's the weather?",
+          "reference": {"response": "Sunny"},
+          "evaluations": [{"metric_name": "AnswerAccuracy"}]
+        },
+        {
+          "id": "stp_d4e6",
+          "input": "In NYC?",
+          "reference": {"response": "70F"}
+        }
+      ]
+    }
+  ]
+}
+```
+
+Output (list of scenario rows for RAGAS):
+```python
+[
+  {
+    # Complete scenario structure preserved
+    "id": "scn_b2c4",
+    "trace_id": "a1b2c3d4",
+    "name": "weather_queries",
+    "steps": [  # Steps stay nested inside scenario
+      {
+        "id": "stp_c3d5",
+        "input": "What's the weather?",
+        "reference": {"response": "Sunny"},
+        "evaluations": [{"metric_name": "AnswerAccuracy"}]
+      },
+      {
+        "id": "stp_d4e6",
+        "input": "In NYC?",
+        "reference": {"response": "70F"}
+      }
+    ],
+
+    # Experiment metadata added for experiment function access
+    "_experiment_meta": {
+      "id": "exp_a7f3",
+      "llm_as_a_judge_model": "gemini-2.5-flash-lite",
+      "default_threshold": 0.9
+    }
+  }
+]
+```
+
+**Key Points**:
+- Each scenario becomes a row
+- Steps remain nested within the scenario structure
+- Experiment metadata added via `_experiment_meta` for easy access
+
+#### Saving Strategy (Scenarios → JSON)
+
+When saving data via `save_experiment()`, the backend simply writes scenarios back to JSON, adding any missing IDs.
+
+**Input**: List of scenario dictionaries (after processing by experiment functions)
+**Output**: Hierarchical JSON structure
+
+**Example - Saving Scenarios**:
+
+Input (list of scenario rows after RAGAS processing):
+```python
+[
+  {
+    # Complete scenario with all steps (now with added data)
+    "id": "scn_b2c4",
+    "trace_id": "a1b2c3d4",
+    "name": "weather_queries",
+    "steps": [
+      {
+        "id": "stp_c3d5",
+        "input": "What's the weather?",
+        "turns": [{"content": "What's the weather?", "type": "human"}, {"content": "Sunny!", "type": "ai"}],
+        "reference": {"response": "Sunny"},
+        "evaluations": [{"metric_name": "AnswerAccuracy", "result": {"result": "pass", "score": 0.95, "details": {}}}]
+      },
+      {
+        "id": "stp_d4e6",
+        "input": "In NYC?",
+        "turns": [{"content": "In NYC?", "type": "human"}, {"content": "70F", "type": "ai"}],
+        "reference": {"response": "70F"}
+      }
+    ],
+
+    # Experiment metadata (will be extracted and removed)
+    "_experiment_meta": {
+      "id": "exp_a7f3",
+      "llm_as_a_judge_model": "gemini-2.5-flash-lite",
+      "default_threshold": 0.9
+    }
+  }
+]
+```
+
+Output (hierarchical JSON file):
+```json
+{
+  "id": "exp_a7f3",
+  "llm_as_a_judge_model": "gemini-2.5-flash-lite",
+  "default_threshold": 0.9,
+  "scenarios": [
+    {
+      "id": "scn_b2c4",
+      "trace_id": "a1b2c3d4",
+      "name": "weather_queries",
+      "steps": [
+        {
+          "id": "stp_c3d5",
+          "input": "What's the weather?",
+          "turns": [{"content": "What's the weather?", "type": "human"}, {"content": "Sunny!", "type": "ai"}],
+          "reference": {"response": "Sunny"},
+          "evaluations": [{"metric_name": "AnswerAccuracy", "result": {"result": "pass", "score": 0.95, "details": {}}}]
+        },
+        {
+          "id": "stp_d4e6",
+          "input": "In NYC?",
+          "turns": [{"content": "In NYC?", "type": "human"}, {"content": "70F", "type": "ai"}],
+          "reference": {"response": "70F"}
+        }
+      ]
+    }
+  ]
+}
+```
+
+**Key Points**:
+- Scenarios maintain their structure
+- Only `_experiment_meta` removed (not part of scenario schema)
+- No grouping or sorting needed - scenarios already complete
+- Missing IDs generated via content hashing
+
+#### Content-Based ID Generation
+
+IDs are generated deterministically using hashing, ensuring reproducibility across runs.
+
+**ID Generation Strategy**:
+
+- **scenario_id**: Hash of (step_ids)
+- **step_id**: Hash of (scenario_id + step_input + step_index)
+
+**Benefits**:
+- Deterministic: Same content always produces same ID
+- Traceable: Can verify ID matches content by re-hashing
+- Reproducible: Re-running same test produces same IDs for comparison
+
+
+## Grafana Visualization Strategy
+
+### Overview
+
+The testbench uses two complementary Grafana dashboards for monitoring and debugging agent quality:
+
+1. **Trends Dashboard** - Monitor quality trends over time, spot regressions after deployments
+2. **Execution Details Dashboard** - Investigate specific execution failures, identify root causes
+
+This two-dashboard approach separates high-level monitoring from deep debugging, enabling efficient quality assurance workflows.
+
+### User Workflow: Monitoring → Investigation → Debugging
+
+```
+Trends Dashboard
+    │
+    ├─ Spot drop in scores after deployment
+    │
+    └─ Click execution row → Execution Details Dashboard
+                              │
+                              ├─ See which scenarios/steps failed
+                              │
+                              └─ Click [View Trace] → Tempo
+                                                        │
+                                                        └─ See full agent behavior
+```
+
+### Current OTLP Metrics
+
+The current system publishes flat metrics with sample-level labels:
+
+Example OTLP metrics:
+```
+testbench_evaluation_metric{
+  name="faithfulness",
+  workflow_name="weather-agent-test",
+  execution_id="exec-001",
+  execution_number="1",
+  trace_id="a1b2c3d4e5f6789012345678901234ab",
+  sample_hash="7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a",
+  user_input_truncated="What's the weather in NYC?"
+} = 0.95
+
+testbench_evaluation_metric{
+  name="answer_relevancy",
+  workflow_name="weather-agent-test",
+  execution_id="exec-001",
+  execution_number="1",
+  trace_id="a1b2c3d4e5f6789012345678901234ab",
+  sample_hash="8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a3",
+  user_input_truncated="And in London?"
+} = 0.92
+```
+
+### Enhanced OTLP Metrics
+
+The new system publishes metrics with rich hierarchical labels:
+
+Example OTLP metrics:
+```
+testbench_evaluation_metric{
+  name="AnswerAccuracy",
+  workflow_name="weather-agent-test",
+  execution_id="exec-001",
+  execution_number="1",
+  experiment_id="exp_a7f3d2e9c1b4a8f6",
+  scenario_id="scn_b2c4e8f1d3a5c7e9",
+  scenario_name="weather_queries",
+  step_id="stp_c3d5a9f2e4b6d8fa",
+  step_index="0",
+  trace_id="a1b2c3d4e5f6789012345678901234ab",
+  threshold="0.9",
+  result="pass",
+  user_input_truncated="What's the weather in NYC?"
+} = 0.92
+
+testbench_evaluation_metric{
+  name="AnswerAccuracy",
+  workflow_name="weather-agent-test",
+  execution_id="exec-001",
+  execution_number="1",
+  experiment_id="exp_a7f3d2e9c1b4a8f6",
+  scenario_id="scn_b2c4e8f1d3a5c7e9",
+  scenario_name="weather_queries",
+  step_id="stp_d4e6b0a3f5c7d9eb",
+  step_index="1",
+  trace_id="a1b2c3d4e5f6789012345678901234ab",
+  threshold="0.9",
+  result="fail",
+  user_input_truncated="And in London?"
+} = 0.87
+```
+
+### Dashboard 1: Trends Over Time
+
+**Purpose**: Monitor agent quality trends, identify regressions correlated with deployments.
+
+**Use Cases**:
+- Continuous monitoring of agent quality metrics
+- Spotting degradation after code deployments
+- Comparing scenario performance over time
+- Identifying which executions need investigation
+
+**Dashboard Mockup**:
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│ RAGAS Trends Dashboard - weather-agent-test                        │
+│ Time Range: [Last 30 Days ▼]                                       │
+├─────────────────────────────────────────────────────────────────────┤
+│ Overall Pass Rate Over Time                                         │
+│                                                                     │
+│ 100% ┤                                                              │
+│      │  ┌─────────────┐                                            │
+│  90% │  │  v2.1.0     │     ●─────────●  weather_queries          │
+│      │  │  deployed   │    ╱           ╲                           │
+│  80% │  └─────────────┘   ●             ●  booking_flow            │
+│      │                   ╱               ╲                          │
+│  70% ├──────────────────●                 ●  error_handling        │
+│      │                                                              │
+│  60% ┤                                                              │
+│      └─────────────────────────────────────────                   │
+│       Jan 15    Jan 22    Jan 29    Feb 5                          │
+│                                                                     │
+├─────────────────────────────────────────────────────────────────────┤
+│ Average Metric Scores Over Time                                     │
+│                                                                     │
+│ 1.0  ┤                                                              │
+│      │                    ●─────●─────●  AnswerAccuracy            │
+│ 0.9  │     ●─────●───────╱                                         │
+│      │    ╱           ╲                                             │
+│ 0.8  ├───●             ●───────────────  ToolCallAccuracy          │
+│      │                                                              │
+│ 0.7  ┤                                                              │
+│      └─────────────────────────────────────────                   │
+│       Jan 15    Jan 22    Jan 29    Feb 5                          │
+│                                                                     │
+├─────────────────────────────────────────────────────────────────────┤
+│ Recent Executions                                [Filter ▼]         │
+├─────────────────────────────────────────────────────────────────────┤
+│ Exec ID  │ Time       │ Pass Rate │ Avg Score │ Failed │ Status   │
+│──────────┼────────────┼───────────┼───────────┼────────┼──────────│
+│ exec-005 │ Feb 5 14:30│   85%     │   0.89    │   3    │ [View]   │ ← Click to drill down
+│ exec-004 │ Feb 5 12:15│   90%     │   0.91    │   2    │ [View]   │
+│ exec-003 │ Feb 4 18:45│   75%     │   0.82    │   5    │ [View]   │
+│ exec-002 │ Feb 4 14:20│   95%     │   0.94    │   1    │ [View]   │
+│                                                                     │
+│ 📊 Click [View] to investigate execution details                    │
+└─────────────────────────────────────────────────────────────────────┘
+```
+
+**Key Features**:
+- **Time-series focus**: All data aggregated over time windows
+- **Deployment annotations**: Visual markers showing when agent versions deployed
+- **Scenario-level trends**: Compare different test scenarios (weather_queries, booking_flow, etc.)
+- **Metric-level trends**: Track specific metrics (AnswerAccuracy, ToolCallAccuracy, etc.)
+- **Quick drill-down**: Click execution ID to jump to details dashboard
+
+### Dashboard 2: Execution Details
+
+**Purpose**: Deep-dive into a specific execution to diagnose failures.
+
+**Use Cases**:
+- Investigating why an execution failed
+- Identifying which scenarios/steps caused failures
+- Finding patterns in failed evaluations
+- Linking to distributed traces for root cause analysis
+
+**Dashboard Mockup**:
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│ Execution Details - exec-005 (Feb 5, 2026 14:30)                   │
+│ Workflow: weather-agent-test  |  Experiment: exp_a7f3d2e9          │
+│ [← Back to Trends]                                                  │
+├─────────────────────────────────────────────────────────────────────┤
+│                                                                     │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐             │
+│  │  Scenarios   │  │  Pass Rate   │  │  Avg Score   │             │
+│  │      3       │  │     85%      │  │     0.89     │             │
+│  └──────────────┘  └──────────────┘  └──────────────┘             │
+│                                                                     │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐             │
+│  │ Failed Steps │  │ Total Steps  │  │  Metrics     │             │
+│  │      3       │  │      20      │  │      5       │             │
+│  └──────────────┘  └──────────────┘  └──────────────┘             │
+│                                                                     │
+├─────────────────────────────────────────────────────────────────────┤
+│ Pass Rate by Scenario (This Execution)                             │
+├─────────────────────────────────────────────────────────────────────┤
+│                                                                     │
+│  weather_queries    ██████████ 100%  (6/6 passed)                  │
+│  booking_flow       ████████░░  80%  (4/5 passed) ⚠                │
+│  error_handling     ███████░░░  70%  (7/10 passed) ⚠               │
+│                                                                     │
+├─────────────────────────────────────────────────────────────────────┤
+│ Failed Steps (This Execution)                                      │
+├─────────────────────────────────────────────────────────────────────┤
+│ Scenario         │ Step │ Metric          │ Score │ Thresh │ Trace │
+│──────────────────┼──────┼─────────────────┼───────┼────────┼───────│
+│ booking_flow     │  2   │ IntentAccuracy  │ 0.87  │ 0.90   │[View] │
+│ error_handling   │  1   │ ErrorRecovery   │ 0.75  │ 0.80   │[View] │
+│ error_handling   │  5   │ Faithfulness    │ 0.82  │ 0.85   │[View] │
+│                                                                     │
+│ 🔍 Click [View] to see full trace in Tempo                         │
+├─────────────────────────────────────────────────────────────────────┤
+│ All Steps Results                          [Search: ___] [Filter]  │
+├─────────────────────────────────────────────────────────────────────┤
+│ Scenario         │ Step │ Metric          │ Result │ Score         │
+│──────────────────┼──────┼─────────────────┼────────┼───────────────│
+│ weather_queries  │  0   │ AnswerAccuracy  │ PASS ✓ │ 0.95          │
+│ weather_queries  │  1   │ AnswerAccuracy  │ PASS ✓ │ 0.92          │
+│ booking_flow     │  0   │ IntentAccuracy  │ PASS ✓ │ 0.98          │
+│ booking_flow     │  2   │ IntentAccuracy  │ FAIL ✗ │ 0.87          │
+│ ... (showing 10 of 20)                                              │
+│                                                                     │
+└─────────────────────────────────────────────────────────────────────┘
+```
+
+**Key Features**:
+- **Single execution focus**: All data filtered by `execution_id`
+- **Summary cards**: Quick overview of execution health
+- **Scenario breakdown**: See which scenarios passed/failed
+- **Failed steps detail**: Pinpoint exact failures with threshold comparisons
+- **All steps results**: Searchable/filterable table of every evaluation
+- **Trace linking**: Direct links to Tempo for deep debugging
+
+### Trace Linking to Tempo
+
+Each scenario has a `trace_id` that links to OpenTelemetry traces in Tempo, enabling deep debugging of agent behavior.
+
+**User Workflow**:
+1. Identify failed step in Grafana "Execution Details" dashboard
+2. Click [View Trace] link next to failed step
+3. Opens Tempo showing full execution trace for that scenario
+4. Drill down into agent spans, tool calls, LLM requests
+5. Correlate evaluation failure with runtime behavior
diff --git a/concepts/generic-metrics-registry-concept.md b/concepts/generic-metrics-registry-concept.md
new file mode 100644
index 0000000..23e607f
--- /dev/null
+++ b/concepts/generic-metrics-registry-concept.md
@@ -0,0 +1,223 @@
+# Generic MetricsRegistry Architecture Concept
+
+## Purpose
+
+This document describes a concept for transforming MetricsRegistry from a RAGAS-specific implementation into a framework-agnostic architecture that can support multiple metric frameworks (RAGAS, DeepEval, etc.).
+
+## Problem Statement
+
+The current `MetricsRegistry` is tightly coupled to RAGAS:
+
+```python
+class MetricsRegistry:
+    def __init__(self):
+        self._classes: dict[str, type[BaseMetric]] = {}  # RAGAS BaseMetric
+        self._discover_metrics()
+
+    def _discover_metrics(self) -> None:
+        # Hardcoded to ragas.metrics.collections
+        for name, obj in inspect.getmembers(metrics_module):
+            if inspect.isclass(obj) and issubclass(obj, BaseMetric):
+                self._classes[name] = obj
+```
+
+**Limitations**:
+- Cannot use metrics from DeepEval or other frameworks
+- Returns framework-specific instances (RAGAS `BaseMetric`)
+- Assumes RAGAS conventions (llm injection, ascore method)
+- Not extensible without modifying core code
+
+## User Requirements
+
+1. **Generic registry**: Not limited to RAGAS metrics
+2. **Callable interface**: Registry returns a callable, not a metric instance
+3. **Callable signature**: `async def(sample: Step, **metric_args) -> Result` (where `Step` is defined in `scripts/schema/executed_experiment.schema.json` and `Result` contains `score: float` and `reason: str | None`)
+4. **Easy extensibility**: Adding new frameworks should be straightforward
+5. **Configurable naming**: Support framework-prefixed names with optional aliases
+
+## Framework Comparison
+
+### RAGAS vs DeepEval
+
+| Aspect | RAGAS | DeepEval |
+|--------|-------|----------|
+| **Base Class** | `ragas.metrics.collections.BaseMetric` | `deepeval.metrics.BaseMetric` |
+| **Input Format** | Dict with flexible keys | `LLMTestCase` typed object |
+| **Async Method** | `async ascore(**kwargs)` | `async a_measure(test_case)` |
+| **Result Format** | `MetricResult` with `.value` | Sets `metric.score` attribute |
+| **LLM Injection** | Constructor param: `__init__(llm=...)` | Per-metric: `evaluation_model=...` |
+| **Discovery** | Import from collections module | Import from metrics module |
+
+**Key Insight**: Frameworks have fundamentally different APIs, requiring an adapter pattern to unify them.
+
+---
+
+## Proposed Architecture
+
+### Core Concept: Adapter Pattern
+
+Instead of the registry working directly with framework-specific metrics, introduce an **adapter layer** that wraps framework-specific metrics behind a unified interface.
+
+```
+┌─────────────────────────────────────┐
+│   GenericMetricsRegistry            │
+│                                     │
+│  get_metric_callable() → Callable   │
+└──────────────┬──────────────────────┘
+               │
+       ┌───────┴──────────┐
+       │                  │
+┌──────▼──────┐    ┌──────▼──────┐
+│   RAGAS     │    │  DeepEval   │
+│  Adapter    │    │  Adapter    │
+└──────┬──────┘    └──────┬──────┘
+       │                  │
+┌──────▼──────┐    ┌──────▼──────┐
+│   RAGAS     │    │  DeepEval   │
+│   Metrics   │    │   Metrics   │
+└─────────────┘    └─────────────┘
+```
+
+### 1. MetricCallable Protocol
+
+Define a **protocol** (not abstract class) that all adapters must conform to:
+
+```python
+from dataclasses import dataclass
+from typing import Protocol, Any
+
+@dataclass
+class Result:
+    """
+    Unified result from a metric evaluation.
+
+    Attributes:
+        score: Float score between 0.0 and 1.0
+        reason: Optional explanation for the score (LLM-generated reasoning)
+    """
+    score: float
+    reason: str | None = None
+
+class MetricCallable(Protocol):
+    """
+    Unified interface for executing metrics across all frameworks.
+
+    This is the callable that the registry returns to users.
+    """
+
+    async def __call__(
+        self,
+        sample: Step,
+        **metric_args: Any
+    ) -> Result:
+        """
+        Evaluate a single sample.
+
+        Args:
+            sample: A Step object as defined in executed_experiment.schema.json,
+                     containing input, turns, reference, custom_values, and evaluations.
+            **metric_args: Additional runtime arguments for the metric
+
+        Returns:
+            Result containing score (0.0-1.0) and optional reason
+        """
+        ...
+```
+
+**Design Rationale**:
+- **Protocol (not ABC)**: Enables structural subtyping - any object with `async __call__(sample, **args) -> Result` satisfies the protocol
+- **Structured return type**: `Result` dataclass with `score: float` and `reason: str | None` — provides both the numeric result and the LLM-generated reasoning behind it
+- **Step as input**: Uses the `Step` type defined in `executed_experiment.schema.json`, providing a structured input with `input`, `turns`, `reference`, `custom_values`, and `evaluations` fields
+- **Runtime args**: Allows passing additional parameters at evaluation time
+
+### 2. FrameworkAdapter Abstract Base Class
+
+Define an **ABC** that all framework adapters must implement:
+
+```python
+from abc import ABC, abstractmethod
+from typing import Any, Type
+
+class FrameworkAdapter(ABC):
+    """
+    Base class for integrating a metric framework into the registry.
+
+    Each framework (RAGAS, DeepEval, etc.) implements this interface.
+    """
+
+    @abstractmethod
+    def discover_metrics(self) -> dict[str, Type[Any]]:
+        """
+        Discover available metric classes from this framework.
+
+        Returns:
+            Dict mapping metric class names to their types
+            Example: {"Faithfulness": <class 'ragas...Faithfulness'>}
+        """
+        pass
+
+    @abstractmethod
+    def create_callable(
+        self,
+        class_name: str,
+        parameters: dict[str, Any],
+        llm: Any
+    ) -> MetricCallable:
+        """
+        Create a MetricCallable for the specified metric.
+
+        This method:
+        1. Gets the metric class from discovered metrics
+        2. Instantiates it with framework-specific logic
+        3. Wraps it in an adapter that conforms to MetricCallable
+
+        Args:
+            class_name: Name of metric class (e.g., "Faithfulness")
+            parameters: Constructor parameters for the metric
+            llm: LLM instance (may be used differently per framework)
+
+        Returns:
+            A callable conforming to MetricCallable protocol
+        """
+        pass
+
+    @property
+    @abstractmethod
+    def framework_name(self) -> str:
+        """
+        Identifier for this framework (e.g., 'ragas', 'deepeval').
+
+        Used in:
+        - Config files: {"framework": "ragas", ...}
+        - Result keys: "ragas.Faithfulness"
+        - Error messages
+        """
+        pass
+```
+
+**Design Rationale**:
+- **ABC (not Protocol)**: Enforces implementation inheritance
+- **Discovery method**: Each framework knows how to find its metrics
+- **Factory method**: Encapsulates framework-specific instantiation logic
+- **LLM parameter**: Passed to adapter even though frameworks use it differently
+
+### 3. RAGAS Adapter Implementation
+
+#### RagasMetricCallable
+
+Wraps a RAGAS metric instance to conform to `MetricCallable`
+
+**Parameter filtering**: Only passes parameters the metric expects
+**Format translation**: Converts generic sample → RAGAS conventions
+**Result extraction**: Unwraps `MetricResult` into `Result(score, reason)`
+**No metric_args usage yet**: Reserved for future use
+
+#### RagasFrameworkAdapter
+
+Implements `FrameworkAdapter` for RAGAS
+
+### 4. GenericMetricsRegistry
+
+The registry manages framework adapters.
+
+---
\ No newline at end of file
diff --git a/scripts/schema/common.schema.json b/scripts/schema/common.schema.json
new file mode 100644
index 0000000..2f21b34
--- /dev/null
+++ b/scripts/schema/common.schema.json
@@ -0,0 +1,85 @@
+{
+  "$schema": "http://json-schema.org/draft-07/schema#",
+  "title": "Common Schema Definitions",
+  "description": "Shared definitions used across experiment schemas to eliminate redundancy.",
+  "definitions": {
+    "reference": {
+      "type": "object",
+      "description": "Expected reference data for evaluation",
+      "properties": {
+        "response": {
+          "type": "string",
+          "description": "Expected final response from the agent."
+        },
+        "tool_calls": {
+          "type": "array",
+          "description": "Expected tool calls the agent should make",
+          "items": {
+            "type": "object",
+            "required": ["name", "arguments"],
+            "properties": {
+              "name": { "type": "string" },
+              "arguments": {
+                "type": "object",
+                "description": "Key-value pairs of tool arguments."
+              }
+            }
+          }
+        },
+        "topics": {
+          "type": "array",
+          "items": { "type": "string" },
+          "description": "Expected topics that should be covered in the agent's response."
+        }
+      }
+    },
+    "turn": {
+      "type": "object",
+      "description": "A single turn in a multi-turn conversation",
+      "properties": {
+        "content": {
+          "type": "string",
+          "description": "The text content of the message or the stringified result of a tool call."
+        },
+        "type": {
+          "type": "string",
+          "enum": ["human", "ai", "tool"],
+          "description": "The role of the entity generating the content."
+        },
+        "tool_calls": {
+          "type": "array",
+          "items": {
+            "type": "object",
+            "properties": {
+              "name": { "type": "string" },
+              "args": { "type": "object" }
+            },
+            "required": ["name", "args"]
+          }
+        }
+      },
+      "required": ["content", "type"]
+    },
+    "metric": {
+      "type": "object",
+      "description": "Metric configuration for evaluations",
+      "required": ["metric_name"],
+      "properties": {
+        "metric_name": {
+          "type": "string",
+          "description": "Registry ID (e.g., 'ragas_faithfulness', 'tool_check')"
+        },
+        "threshold": {
+          "type": "number",
+          "minimum": 0,
+          "maximum": 1,
+          "description": "Minimum acceptable score for this metric (0.0 to 1.0). Values below this threshold indicate test failure."
+        },
+        "parameters": {
+          "type": "object",
+          "description": "Arguments passed to the generic metric adapter."
+        }
+      }
+    }
+  }
+}
diff --git a/scripts/schema/evaluated_experiment.schema.json b/scripts/schema/evaluated_experiment.schema.json
new file mode 100644
index 0000000..365fd2f
--- /dev/null
+++ b/scripts/schema/evaluated_experiment.schema.json
@@ -0,0 +1,123 @@
+{
+  "$schema": "http://json-schema.org/draft-07/schema#",
+  "title": "Testbench Experiment",
+  "description": "Schema for defining experiments to evaluate agent responses.",
+  "type": "object",
+  "required": ["scenarios"],
+  "properties": {
+    "id": {
+      "type": "string",
+      "description": "Unique identifier for the experiment."
+    },
+    "llm_as_a_judge_model": {
+      "type": "string",
+      "description": "The specific LLM used to grade responses (e.g., 'gpt-4o', 'gemini-1.5-pro')."
+    },
+    "default_threshold": {
+      "type": "number",
+      "default": 0.9,
+      "minimum": 0,
+      "maximum": 1,
+      "description": "The fallback threshold used across all evaluations if not specified at the metric level."
+    },
+    "scenarios": {
+      "type": "array",
+      "items": { "$ref": "#/definitions/scenario" }
+    }
+  },
+  "definitions": {
+    "scenario": {
+      "type": "object",
+      "required": ["name", "steps"],
+      "properties": {
+        "id": {
+          "type": "string",
+          "description": "Unique identifier for the scenario."
+        },
+        "trace_id": {
+          "type": "string",
+          "description": "Trace identifier linking to the execution trace of this scenario."
+        },
+        "name": { "type": "string" },
+        "steps": {
+          "type": "array",
+          "items": { "$ref": "#/definitions/step" }
+        },
+        "reference": { "$ref": "common.schema.json#/definitions/reference" },
+        "evaluations": {
+          "type": "array",
+          "items": { "$ref": "#/definitions/metric" },
+          "description": "List of metric configurations to evaluate this step."
+        }
+      }
+    },
+    "step": {
+      "type": "object",
+      "required": ["input"],
+      "properties": {
+        "id": {
+          "type": "string",
+          "description": "Unique identifier for the step."
+        },
+        "input": {
+          "type": "string",
+          "description": "User input to the agent at this step."
+        },
+        "turns": {
+          "type": "array",
+          "items": { "$ref": "common.schema.json#/definitions/turn" }
+        },
+        "reference": { "$ref": "common.schema.json#/definitions/reference" },
+        "custom_values": {
+          "type": "object",
+          "description": "Additional key-value pairs to include in the test results for this step."
+        },
+        "evaluations": {
+          "type": "array",
+          "items": { "$ref": "#/definitions/metric" },
+          "description": "List of metric configurations to evaluate this step."
+        }
+      }
+    },
+    "metric": {
+      "type": "object",
+      "required": ["metric_name"],
+      "properties": {
+        "metric_name": {
+          "type": "string",
+          "description": "Registry ID (e.g., 'ragas_faithfulness', 'tool_check')"
+        },
+        "threshold": {
+          "type": "number",
+          "minimum": 0,
+          "maximum": 1,
+          "description": "Minimum acceptable score for this metric (0.0 to 1.0). Values below this threshold indicate test failure."
+        },
+        "parameters": {
+          "type": "object",
+          "description": "Arguments passed to the generic metric adapter."
+        },
+        "result": {
+          "type": "object",
+          "properties": {
+            "result": {
+              "type": "string",
+              "enum": ["pass", "fail"],
+              "description": "Indicates whether the metric evaluation passed or failed based on the threshold."
+            },
+            "score": {
+              "type": "number",
+              "minimum": 0,
+              "maximum": 1,
+              "description": "The computed score for this metric."
+            },
+            "details": {
+              "type": "object",
+              "description": "Additional details or breakdown of the metric evaluation."
+            }
+          }
+        }
+      }
+    }
+  }
+}
\ No newline at end of file
diff --git a/scripts/schema/executed_experiment.schema.json b/scripts/schema/executed_experiment.schema.json
new file mode 100644
index 0000000..343e4cc
--- /dev/null
+++ b/scripts/schema/executed_experiment.schema.json
@@ -0,0 +1,83 @@
+{
+  "$schema": "http://json-schema.org/draft-07/schema#",
+  "title": "Testbench Experiment",
+  "description": "Schema for defining experiments to evaluate agent responses.",
+  "type": "object",
+  "required": ["scenarios"],
+  "properties": {
+    "id": {
+      "type": "string",
+      "description": "Unique identifier for the experiment."
+    },
+    "llm_as_a_judge_model": {
+      "type": "string",
+      "description": "The specific LLM used to grade responses (e.g., 'gpt-4o', 'gemini-1.5-pro')."
+    },
+    "default_threshold": {
+      "type": "number",
+      "default": 0.9,
+      "minimum": 0,
+      "maximum": 1,
+      "description": "The fallback threshold used across all evaluations if not specified at the metric level."
+    },
+    "scenarios": {
+      "type": "array",
+      "items": { "$ref": "#/definitions/scenario" }
+    }
+  },
+  "definitions": {
+    "scenario": {
+      "type": "object",
+      "required": ["name", "steps"],
+      "properties": {
+        "id": {
+          "type": "string",
+          "description": "Unique identifier for the scenario."
+        },
+        "trace_id": {
+          "type": "string",
+          "description": "Trace identifier linking to the execution trace of this scenario."
+        },
+        "name": { "type": "string" },
+        "steps": {
+          "type": "array",
+          "items": { "$ref": "#/definitions/step" }
+        },
+        "reference": { "$ref": "common.schema.json#/definitions/reference" },
+        "evaluations": {
+          "type": "array",
+          "items": { "$ref": "common.schema.json#/definitions/metric" },
+          "description": "List of metric configurations to evaluate this step."
+        }
+      }
+    },
+    "step": {
+      "type": "object",
+      "required": ["input"],
+      "properties": {
+        "id": {
+          "type": "string",
+          "description": "Unique identifier for the step."
+        },
+        "input": {
+          "type": "string",
+          "description": "User input to the agent at this step."
+        },
+        "turns": {
+          "type": "array",
+          "items": { "$ref": "common.schema.json#/definitions/turn" }
+        },
+        "reference": { "$ref": "common.schema.json#/definitions/reference" },
+        "custom_values": {
+          "type": "object",
+          "description": "Additional key-value pairs to include in the test results for this step."
+        },
+        "evaluations": {
+          "type": "array",
+          "items": { "$ref": "common.schema.json#/definitions/metric" },
+          "description": "List of metric configurations to evaluate this step."
+        }
+      }
+    }
+  }
+}
\ No newline at end of file
diff --git a/scripts/schema/experiment.schema.json b/scripts/schema/experiment.schema.json
new file mode 100644
index 0000000..4a0ed93
--- /dev/null
+++ b/scripts/schema/experiment.schema.json
@@ -0,0 +1,63 @@
+{
+  "$schema": "http://json-schema.org/draft-07/schema#",
+  "title": "Testbench Experiment",
+  "description": "Schema for defining experiments to evaluate agent responses.",
+  "type": "object",
+  "required": ["scenarios"],
+  "properties": {
+    "llm_as_a_judge_model": {
+      "type": "string",
+      "description": "The specific LLM used to grade responses (e.g., 'gpt-4o', 'gemini-1.5-pro')."
+    },
+    "default_threshold": {
+      "type": "number",
+      "default": 0.9,
+      "minimum": 0,
+      "maximum": 1,
+      "description": "The fallback threshold used across all evaluations if not specified at the metric level."
+    },
+    "scenarios": {
+      "type": "array",
+      "items": { "$ref": "#/definitions/scenario" }
+    }
+  },
+  "definitions": {
+    "scenario": {
+      "type": "object",
+      "required": ["name", "steps"],
+      "properties": {
+        "name": { "type": "string" },
+        "steps": {
+          "type": "array",
+          "items": { "$ref": "#/definitions/step" }
+        },
+        "reference": { "$ref": "common.schema.json#/definitions/reference" },
+        "evaluations": {
+          "type": "array",
+          "items": { "$ref": "common.schema.json#/definitions/metric" },
+          "description": "List of metric configurations to evaluate this step."
+        }
+      }
+    },
+    "step": {
+      "type": "object",
+      "required": ["input"],
+      "properties": {
+        "input": {
+          "type": "string",
+          "description": "User input to the agent at this step."
+        },
+        "reference": { "$ref": "common.schema.json#/definitions/reference" },
+        "custom_values": {
+          "type": "object",
+          "description": "Additional key-value pairs to include in the test results for this step."
+        },
+        "evaluations": {
+          "type": "array",
+          "items": { "$ref": "common.schema.json#/definitions/metric" },
+          "description": "List of metric configurations to evaluate this step."
+        }
+      }
+    }
+  }
+}
\ No newline at end of file