From 95069214f6b09063969546ca21decf3d3c07c0b2 Mon Sep 17 00:00:00 2001
From: Jose Carlos Rodriguez <josecarlosrodriguez@Carlos-MacBook-Pro.local>
Date: Mon, 9 Feb 2026 16:10:38 -0400
Subject: [PATCH 1/5] T6 M0: Technical plan + analysis notebook for
 multi-objective vector scores

---
 docs/T6_technical_plan.md               | 837 +++++++++++++++++++++
 examples/notebooks/t6_m0_analysis.ipynb | 950 ++++++++++++++++++++++++
 2 files changed, 1787 insertions(+)
 create mode 100644 docs/T6_technical_plan.md
 create mode 100644 examples/notebooks/t6_m0_analysis.ipynb

diff --git a/docs/T6_technical_plan.md b/docs/T6_technical_plan.md
new file mode 100644
index 00000000..e37c0c8c
--- /dev/null
+++ b/docs/T6_technical_plan.md
@@ -0,0 +1,837 @@
+# T6 Technical Plan — Multi-Objective Vector Scores for Trainer Selection
+
+**Version:** 1.0 (Refined)
+**Author:** Carlos Rodriguez
+**Date:** February 9, 2025
+**Status:** M0 Deliverable — Analysis + Architecture + Interface Spec
+
+**Target repos / branches:**
+- **Primary (implementation + PR):** [`AgentOpt/OpenTrace@experimental`](https://github.com/AgentOpt/OpenTrace/tree/experimental)
+- **Benchmark integration (M3):** [`AgentOpt/Trace-Bench`](https://github.com/AgentOpt/Trace-Bench)
+
+---
+
+## Table of Contents
+
+1. [Executive Summary](#1-executive-summary)
+2. [Goals, Non-Goals, Success Criteria](#2-goals-non-goals-success-criteria)
+3. [Current Code Reality (Baseline)](#3-current-code-reality-baseline)
+4. [Proposed Architecture (Minimal Delta)](#4-proposed-architecture-minimal-delta)
+5. [Public API & Data Contracts](#5-public-api--data-contracts)
+6. [Module Modifications (Files to Create / Modify)](#6-module-modifications)
+7. [Edge Cases & Defensive Design](#7-edge-cases--defensive-design)
+8. [Milestones & Validation Gates](#8-milestones--validation-gates)
+9. [Test Plan](#9-test-plan)
+10. [Risks & Mitigation](#10-risks--mitigation)
+11. [Design Decisions (Resolved)](#11-design-decisions-resolved)
+12. [Appendix: Code Touchpoints](#12-appendix-code-touchpoints)
+
+---
+
+## 1. Executive Summary
+
+Today, trainer selection in Trace is driven by a **single scalar score**. Guides return `Tuple[float, str]` via `get_feedback()`, evaluators produce `np.array` of floats, and trainers (`BasicSearchAlgorithm`, `BeamsearchAlgorithm`) select candidates via scalar comparison (`max(candidates, key=lambda x: x[0])` and `sorted(..., key=lambda x: x[0])` respectively). This blocks trainer-side search from exploiting multiple metrics like `{accuracy, latency_ms, cost}`.
+
+### What this plan adds
+
+| Component | Change |
+|-----------|--------|
+| **Score contract** | `Dict[str, float]` returned by guides (optional), with backward-compatible scalar fallback |
+| **ObjectiveConfig** | Frozen dataclass defining selection mode: `scalar` (default), `weighted`, or `pareto` |
+| **objectives.py** (new) | All multi-objective logic isolated in pure, testable functions |
+| **Evaluators** | Vector-score aggregation helpers (`evaluate_vector`, `aggregate_vector_scores`) |
+| **BasicSearchAlgorithm** | Selection via `select_best(candidates, objective_config)` |
+| **BeamsearchAlgorithm** | Selection via `select_top_k(candidates, objective_config, k)` |
+| **PrioritySearch** (optional) | Scalarize heap priority via ObjectiveConfig; store dict for logging |
+| **Benchmarks** (M3) | 3 simple benchmarks integrated into Trace-Bench |
+
+### Guiding principles
+
+- **Backward compatibility is non-negotiable.** `mode="scalar"` (the default) preserves identical behavior.
+- **Isolate complexity.** All multi-objective logic lives in `objectives.py` — pure functions, easy to test.
+- **Minimal churn.** Trainers gain an optional `objective_config` parameter; existing call sites are untouched.
+- **Determinism.** Fixed `seed` → deterministic selection, especially Pareto tie-breaks.
+
+---
+
+## 2. Goals, Non-Goals, Success Criteria
+
+### 2.1 Goals
+
+| ID | Goal | Acceptance Signal |
+|----|------|-------------------|
+| G1 | **Backward compatibility** | Existing scalar-score guides/trainers produce identical results when `objective_config` is `None` or `mode="scalar"` |
+| G2 | **Vector score support** | Guide returns `{"accuracy": 1.0, "latency_ms": 120.0}` and trainers select candidates using weighted or Pareto mode |
+| G3 | **Determinism** | Fixed `seed` → identical selection across runs (tested in CI) |
+| G4 | **Actionability** | Every milestone: Colab notebook + pytest coverage (M1+) |
+| G5 | **Benchmarks** | 3 benchmarks defined, integrated into Trace-Bench, runnable from notebooks |
+
+### 2.2 Non-goals (explicit)
+
+- No multi-objective UCB (MO-UCB) — too risky for v1 scope.
+- No Pareto archive / non-dominated set management inside PrioritySearch.
+- No changes to optimizer internals or new telemetry infrastructure.
+- No modification to `get_feedback()` return signature (we use a helper instead).
+
+### 2.3 Crisp success criteria
+
+All of the following must be true:
+
+1. Scalar-only trainers still work and produce same results by default.
+2. Multi-objective guide dict works end-to-end for BasicSearch + Beamsearch.
+3. Deterministic behavior with fixed seed (tests + notebook).
+4. Each milestone delivers a runnable Colab notebook.
+5. From M1 onward, new functions have pytest tests and CI is green.
+6. M3: three benchmarks exist, run, and Trace-Bench integration works.
+
+---
+
+## 3. Current Code Reality (Baseline)
+
+### 3.1 Guide — scalar score contract
+
+```python
+# opto/trainer/guide.py
+
+class Guide:
+    def get_feedback(self, query, response, reference=None, **kwargs) -> Tuple[float, str]:
+        raise NotImplementedError
+
+    def metric(self, query, response, reference=None, **kwargs) -> float:
+        return self.get_feedback(query, response, reference)[0]  # extracts scalar
+```
+
+**Implication:** `metric()` always returns `float`. Multi-metric feedback is not usable for selection.
+
+### 3.2 Evaluators — scalar arrays
+
+```python
+# opto/trainer/evaluators.py
+
+def evaluate(agent, guide, inputs, infos, ...) -> np.ndarray:
+    # Calls guide.metric() per example → float
+    # Returns np.array of shape (N,) or (N, num_samples)
+```
+
+**Implication:** All scores are numeric scalars aggregated via `np.mean()`.
+
+### 3.3 BasicSearchAlgorithm — scalar max selection
+
+```python
+# opto/trainer/algorithms/basic_algorithms.py :: BasicSearchAlgorithm.optimizer_step()
+
+def validate():
+    scores = evaluate(self.agent, self.validate_guide, ...)
+    return np.mean(scores) if all([s is not None for s in scores]) else -np.inf
+
+# Selection:
+candidates.append((score, update_dict))        # score is float
+best_score, best_update = max(candidates, key=lambda x: x[0])  # scalar max
+```
+
+**Insertion point:** Replace `max(candidates, ...)` with `select_best(candidates, objective_config)`.
+
+### 3.4 BeamsearchAlgorithm — scalar sort selection
+
+```python
+# opto/trainer/algorithms/beamsearch_algorithm.py :: BeamsearchAlgorithm.select()
+
+scored_candidates.append((validation_score, candidate_params))  # float
+sorted_candidates = sorted(scored_candidates, key=lambda x: x[0], reverse=True)
+selected_candidates = sorted_candidates[:beam_width]  # take top-k by scalar
+```
+
+**Insertion point:** Replace scalar sort with `select_top_k(scored_candidates, objective_config, k=beam_width)`.
+
+### 3.5 Shared patterns across both trainers
+
+| Pattern | BasicSearch | Beamsearch |
+|---------|-------------|------------|
+| Validate | `np.mean(scores)` → float | `np.mean(validation_scores)` → float |
+| Store | `(score, update_dict)` | `(validation_score, candidate_params)` |
+| Select | `max(candidates, key=λ x: x[0])` | `sorted(candidates, key=λ x: x[0])[:k]` |
+| Fallback | `-np.inf` | `-np.inf` |
+
+Both converge to the same abstraction: **given a list of `(score, params)` pairs, select the best or top-k.** This is exactly what `objectives.py` will provide.
+
+### 3.6 Existing infrastructure we leverage
+
+- **Logger abstraction:** `BaseLogger` with `log(name, value, step)` — can log each metric in a vector score.
+- **StubLLM / DummyLLM:** Wraps deterministic callables — usable for CI and no-keys notebooks.
+- **`batch_run` / `async_run`:** Parallelism utilities already in place.
+
+---
+
+## 4. Proposed Architecture (Minimal Delta)
+
+### 4.1 Core idea
+
+Isolate all multi-objective logic into one new module (`opto/trainer/objectives.py`) containing **pure functions**:
+
+```
+normalize_score()   →  scalar ↔ dict conversion
+apply_minimize()    →  flip signs for minimize metrics
+weighted_scalarize()→  dict → float via weighted sum
+pareto_rank()       →  dominance ranking + tie-break
+select_best()       →  given candidates + config, return best index
+select_top_k()      →  given candidates + config, return top-k indices
+```
+
+Trainers call these functions instead of inline `max()` / `sorted()`. When `objective_config` is `None`, the functions fall through to scalar comparison — **identical to current behavior**.
+
+### 4.2 Data flow (target)
+
+```
+Guide.get_feedback()
+    │
+    ├── returns (float, str)          ← existing path, unchanged
+    └── returns (Dict[str,float], str) ← new path (via get_score_dict helper)
+            │
+            ▼
+Evaluator.evaluate_vector()
+    │
+    ├── per-example: List[Dict[str, float]]
+    └── aggregated:  Dict[str, float]  (mean per metric)
+            │
+            ▼
+Trainer selection (objectives.py)
+    │
+    ├── mode="scalar"   → max(mean_scores)           ← unchanged
+    ├── mode="weighted"  → max(weighted_scalarize())  ← new
+    └── mode="pareto"    → pareto_rank() + tie-break  ← new
+```
+
+### 4.3 Backward compatibility guarantee
+
+The entire vector-score path is **opt-in**:
+
+1. If `objective_config` is `None` → existing scalar path, no new code executed.
+2. If guide returns `float` and `objective_config` is provided → `normalize_score()` wraps it as `{"score": float}`, weights default to `{"score": 1.0}`.
+3. If guide returns `Dict[str, float]` and `objective_config` is `None` → `mean(values)` used as scalar fallback, preserving scalar selection.
+
+---
+
+## 5. Public API & Data Contracts
+
+### 5.1 Score types
+
+```python
+from typing import Union, Dict
+
+ScalarScore = float
+VectorScore = Dict[str, float]          # JSON-serializable, all values finite
+ScoreLike   = Union[int, float, bool, Dict[str, float]]
+```
+
+**Contract:**
+- "Higher is better" by default for all metrics.
+- Metrics to minimize are declared in `ObjectiveConfig.minimize` (semantics: negate internally).
+- All dict values must be finite floats. `NaN` / `±inf` in a dict raises `ValueError`.
+- `int` and `bool` scalar scores are accepted and converted to `float` (e.g., `LLMJudge` returns `int` 0/1, test guides return `bool`).
+
+### 5.2 ObjectiveConfig
+
+```python
+from dataclasses import dataclass, field
+from typing import Literal, Optional, Dict, Tuple
+
+@dataclass(frozen=True)
+class ObjectiveConfig:
+    """Configuration for multi-objective candidate selection.
+
+    Attributes:
+        mode: Selection strategy.
+            - "scalar": Use existing scalar comparison (default, backward-compatible).
+            - "weighted": Scalarize via weighted sum, then select max.
+            - "pareto": Pareto dominance ranking with configurable tie-break.
+        weights: Per-metric weights for weighted scalarization.
+            Missing metrics use missing_value. Metrics not present in the weights dict
+            are ignored (not included in the weighted sum).
+            If empty dict in weighted mode, all present metrics get equal weight 1.0.
+        minimize: Frozenset of metric names where lower is better (users can pass set; auto-converted).
+            These are negated internally before comparison ("higher-is-better" normalization).
+        missing_value: Score assigned to missing metrics in a candidate's score dict.
+            Default: float('-inf') (effectively disqualifies candidates missing required metrics).
+        pareto_metrics: Subset of metrics to use for Pareto dominance.
+            If None, all metrics present across candidates are used.
+        tie_break: Strategy for breaking ties among Pareto-equivalent candidates.
+            - "weighted": Fall back to weighted scalarization among tied candidates.
+            - "lexicographic": Sort by metrics in alphabetical order.
+            - "random_seeded": Seeded random shuffle.
+        seed: Random seed for deterministic tie-breaking.
+    """
+    mode: Literal["scalar", "weighted", "pareto"] = "scalar"
+    weights: Dict[str, float] = field(default_factory=dict)
+    minimize: frozenset = field(default_factory=frozenset)
+    missing_value: float = float("-inf")
+    pareto_metrics: Optional[Tuple[str, ...]] = None
+    tie_break: Literal["weighted", "lexicographic", "random_seeded"] = "weighted"
+    seed: int = 0
+
+    def __post_init__(self):
+        # Convert set → frozenset for true immutability + hashability
+        if isinstance(self.minimize, set):
+            object.__setattr__(self, 'minimize', frozenset(self.minimize))
+        # Validate weights are non-negative
+        for k, v in self.weights.items():
+            if v < 0:
+                raise ValueError(f"Weight for '{k}' must be non-negative, got {v}")
+        # Validate pareto_metrics
+        if self.pareto_metrics is not None and len(self.pareto_metrics) == 0:
+            raise ValueError("pareto_metrics must be None (auto) or non-empty tuple")
+```
+
+**Validation rules (enforced in `__post_init__`):**
+- `minimize` is stored as `frozenset` for true immutability (users can pass `set` for convenience; it's auto-converted).
+- `mode="weighted"` with empty `weights` → auto-assign equal weight 1.0 to all encountered metrics.
+- `mode="pareto"` with `pareto_metrics=None` → use union of all metric keys across candidates.
+- `mode="pareto"` with `pareto_metrics=()` → `ValueError`.
+- All weight values must be non-negative.
+- `minimize` metric names must be valid strings (warning if not found in any candidate).
+
+### 5.3 Guide helper method
+
+```python
+# Added to Guide base class (non-breaking)
+
+class Guide:
+    # ... existing methods unchanged ...
+
+    def get_score_dict(self, query: str, response: str, reference=None, **kwargs) -> Dict[str, float]:
+        """Return the evaluation score as a dictionary.
+
+        Wraps get_feedback() for backward compatibility:
+        - If get_feedback returns (float, str): returns {"score": float}
+        - If get_feedback returns (dict, str):  returns dict directly
+
+        Subclasses returning multi-metric scores should override get_feedback()
+        to return (Dict[str, float], str) instead of (float, str).
+        """
+        score, _ = self.get_feedback(query, response, reference, **kwargs)
+        if isinstance(score, dict):
+            return score
+        return {"score": float(score)}
+
+    def metric(self, query: str, response: str, reference=None, **kwargs) -> float:
+        """Always returns float. For dict scores, returns mean of values as scalar fallback.
+
+        This ensures evaluate() and the training loop (which call metric()) remain
+        completely safe. Dict scores only flow through get_score_dict() → evaluate_vector().
+        """
+        score, _ = self.get_feedback(query, response, reference, **kwargs)
+        if isinstance(score, dict):
+            return float(np.mean(list(score.values())))
+        return float(score)
+```
+
+**Why this approach:**
+- `get_score_dict()` is a new method — zero risk of breaking existing subclasses.
+- `metric()` always returns `float` — the existing `evaluate()` function (which calls `guide.metric()` and passes results to `np.array()`) and the training loop (which calls `np.mean(scores)`) are completely unaffected.
+- Dict scores are only accessible via `get_score_dict()` → `evaluate_vector()`, keeping the two data paths cleanly separated.
+
+### 5.4 Evaluator additions
+
+```python
+# Added to opto/trainer/evaluators.py
+
+def evaluate_vector(agent, guide, inputs, infos, min_score=None,
+                    num_samples=1, num_threads=None, description=None
+                    ) -> list:
+    """Like evaluate(), but returns List[ScoreLike] (float or dict per example).
+
+    Uses guide.get_score_dict() to obtain dict scores per example.
+    When guide returns scalar, get_score_dict() wraps it as {"score": float}.
+
+    When num_samples > 1: for each example, collects num_samples score dicts,
+    computes per-key mean across the samples, and returns one aggregated dict
+    per example. Final output is always List[Dict[str, float]] of length N.
+    """
+    ...
+
+def aggregate_vector_scores(scores: list) -> Union[float, Dict[str, float]]:
+    """Aggregate per-example scores into a single summary score.
+
+    - If all scores are float: returns np.mean (existing behavior).
+    - If all scores are dict: returns per-metric mean dict.
+    - Mixed float/dict: normalizes all to dict via normalize_score(), then averages.
+
+    Args:
+        scores: List of float or Dict[str, float] values.
+
+    Returns:
+        float (if all scalar) or Dict[str, float] (if any dicts present).
+    """
+    ...
+```
+
+### 5.5 objectives.py — complete function signatures
+
+```python
+# opto/trainer/objectives.py (NEW FILE)
+
+from typing import Union, Dict, List, Set, Optional, Tuple, Literal
+from dataclasses import dataclass, field
+
+# --- ObjectiveConfig defined here (see §5.2) ---
+
+# --- Score type aliases ---
+ScalarScore = float
+VectorScore = Dict[str, float]
+ScoreLike = Union[float, Dict[str, float]]
+
+# --- Pure utility functions ---
+
+def normalize_score(score: ScoreLike) -> Dict[str, float]:
+    """Convert any score to dict form.
+
+    - int/float/bool → {"score": float(value)}
+    - Dict[str, float] → returned as-is (validated: all values finite)
+
+    Handles int (LLMJudge returns 0/1) and bool (test guides) via isinstance(score, (int, float, bool)).
+
+    Raises:
+        TypeError: if score is not int, float, bool, or dict
+        ValueError: if dict contains non-finite values or is empty
+    """
+    ...
+
+def apply_minimize(score_dict: Dict[str, float],
+                   minimize: Set[str]) -> Dict[str, float]:
+    """Negate values for minimize metrics (higher-is-better normalization).
+
+    Returns a new dict with minimize metrics negated.
+    Metrics not in minimize set are unchanged.
+    """
+    ...
+
+def weighted_scalarize(score_dict: Dict[str, float],
+                       weights: Dict[str, float],
+                       missing_value: float = float("-inf")) -> float:
+    """Compute weighted sum of score dict.
+
+    For each metric in weights:
+      - If present in score_dict: weight * value
+      - If missing: weight * missing_value
+
+    Metrics in score_dict but NOT in weights are ignored.
+    If weights is empty, all metrics get equal weight 1.0.
+
+    Returns:
+        Weighted scalar score.
+    """
+    ...
+
+def dominates(a: Dict[str, float], b: Dict[str, float],
+              metrics: Optional[Tuple[str, ...]] = None) -> bool:
+    """Check if candidate 'a' Pareto-dominates candidate 'b'.
+
+    a dominates b iff:
+      - a[m] >= b[m] for all metrics m, AND
+      - a[m] > b[m] for at least one metric m
+
+    Both dicts must already be in "higher-is-better" form (post apply_minimize).
+    Missing metrics are treated as missing_value (caller should handle before call).
+
+    Args:
+        a, b: Score dicts (higher-is-better normalized).
+        metrics: Subset of metrics to compare. If None, use union of keys.
+    """
+    ...
+
+def pareto_rank(candidates: List[Dict[str, float]],
+                metrics: Optional[Tuple[str, ...]] = None) -> List[int]:
+    """Assign Pareto rank to each candidate (0 = non-dominated front).
+
+    Uses standard non-dominated sorting.
+
+    Args:
+        candidates: List of score dicts (higher-is-better normalized).
+        metrics: Subset of metrics for dominance. If None, use all present.
+
+    Returns:
+        List of integer ranks (same length as candidates). Rank 0 = Pareto front.
+    """
+    ...
+
+def select_best(candidates: List[Tuple[ScoreLike, any]],
+                objective_config: Optional['ObjectiveConfig'] = None) -> int:
+    """Select the single best candidate index.
+
+    Args:
+        candidates: List of (score, payload) tuples.
+        objective_config: Selection config. If None, uses scalar max (backward-compatible).
+
+    Returns:
+        Index of best candidate.
+
+    Behavior by mode:
+        - scalar/None: max(score) where score is float (or mean of dict values).
+        - weighted: max(weighted_scalarize(normalize(score), config.weights)).
+        - pareto: rank candidates, tie-break among rank-0 front, return winner.
+
+    Call-site transformation (BasicSearch):
+        # Current:
+        best_score, best_update = max(candidates, key=lambda x: x[0])
+        # Target:
+        best_idx = select_best(candidates, objective_config)
+        best_score, best_update = candidates[best_idx]
+    """
+    ...
+
+def select_top_k(candidates: List[Tuple[ScoreLike, any]],
+                 objective_config: Optional['ObjectiveConfig'] = None,
+                 k: int = 1) -> List[int]:
+    """Select the top-k candidate indices.
+
+    Same logic as select_best, but returns k indices.
+
+    For pareto mode: returns rank-0 front (up to k). If front < k,
+    includes rank-1 candidates by tie-break order, etc.
+
+    Deterministic ordering guaranteed with fixed seed.
+    """
+    ...
+```
+
+---
+
+## 6. Module Modifications
+
+### 6.1 Files to CREATE
+
+| File | Contents | Milestone |
+|------|----------|-----------|
+| `opto/trainer/objectives.py` | `ObjectiveConfig`, `normalize_score`, `apply_minimize`, `weighted_scalarize`, `dominates`, `pareto_rank`, `select_best`, `select_top_k` | M1 |
+| `tests/test_objectives.py` | Unit tests for all functions in objectives.py | M1 |
+| `tests/test_evaluators_vector.py` | Tests for evaluate_vector + aggregate_vector_scores | M1 |
+| `tests/test_trainers_multiobjective.py` | Integration tests for BasicSearch + Beamsearch with ObjectiveConfig | M2 |
+| `examples/notebooks/t6_m0_analysis.ipynb` | M0 analysis notebook | M0 |
+| `examples/notebooks/t6_m1_vector_scores.ipynb` | M1 demo notebook | M1 |
+| `examples/notebooks/t6_m2_trainers.ipynb` | M2 demo notebook | M2 |
+| `examples/notebooks/t6_m3_benchmarks.ipynb` | M3 benchmark notebook | M3 |
+| `docs/T6_technical_plan.md` | This document | M0 |
+| `docs/multi_objective_scores.md` | User-facing documentation | M4 |
+
+### 6.2 Files to MODIFY
+
+| File | Change | Milestone |
+|------|--------|-----------|
+| `opto/trainer/guide.py` | Add `get_score_dict()` method to `Guide` base class. Update `metric()` to collapse dict scores to `float` via `mean(values)` (return type stays `float`). | M1 |
+| `opto/trainer/evaluators.py` | Add `evaluate_vector()` and `aggregate_vector_scores()`. Existing `evaluate()` unchanged. | M1 |
+| `opto/trainer/algorithms/basic_algorithms.py` | Add `objective_config` param to `BasicSearchAlgorithm.train()`. Replace `max(candidates, ...)` with `select_best()` in `optimizer_step()`. | M1 (minimal) / M2 (robust) |
+| `opto/trainer/algorithms/beamsearch_algorithm.py` | Add `objective_config` param to `BeamsearchAlgorithm.train()`. Replace scalar sort in `select()` with `select_top_k()`. | M2 |
+| `opto/features/priority_search/priority_search.py` | (Optional) Add `objective_config` param. Scalarize heap key via weighted mode. Store dict for logging. Pareto falls back to weighted. | M2 |
+
+### 6.3 Files NOT modified
+
+- `opto/trace/` — no changes to trace primitives.
+- `opto/optimizers/` — optimizers are upstream of selection; they produce candidates, not rank them.
+- Existing tests — no modifications; they validate backward compatibility by continuing to pass.
+
+---
+
+## 7. Edge Cases & Defensive Design
+
+### 7.1 Score validation
+
+| Case | Behavior |
+|------|----------|
+| `score = 0.85` (float) | `normalize_score()` → `{"score": 0.85}` |
+| `score = 1` (int) | `normalize_score()` → `{"score": 1.0}` (LLMJudge returns int 0/1) |
+| `score = True` (bool) | `normalize_score()` → `{"score": 1.0}` (test guides return bool) |
+| `score = {"accuracy": 0.9, "latency_ms": 120.0}` | Returned as-is after validation |
+| `score = {}` (empty dict) | `ValueError("Score dict must not be empty")` |
+| `score = {"accuracy": float('nan')}` | `ValueError("Score dict contains non-finite value")` |
+| `score = {"accuracy": float('inf')}` | `ValueError("Score dict contains non-finite value")` |
+| `score = "text"` (wrong type) | `TypeError("Score must be int, float, bool, or Dict[str, float]")` |
+
+### 7.2 Missing metrics across candidates
+
+| Case | Behavior |
+|------|----------|
+| Candidate A has `{accuracy, latency}`, B has `{accuracy}` | B gets `latency = missing_value` (default `-inf`) |
+| `weights = {"accuracy": 0.7, "latency": 0.3}`, candidate missing `latency` | Weighted sum uses `0.3 * missing_value` |
+| All candidates missing a weighted metric | Warning logged; metric still contributes `weight * missing_value` |
+
+### 7.3 Mixed scalar/dict batches
+
+| Case | Behavior |
+|------|----------|
+| All scores are `float` (or `int`/`bool`) | `aggregate_vector_scores()` returns `float` via `np.mean()` (existing behavior) |
+| All scores are `dict` with same keys | `aggregate_vector_scores()` returns per-metric mean `Dict[str, float]` |
+| Mixed `float` and `dict` in same batch | `ValueError("All scores in a batch must be the same type (all float or all dict)")` |
+
+A mixed batch most likely indicates a bug in the guide implementation (e.g., returning `float` on some inputs and `dict` on others). Failing loudly prevents silent incorrect aggregation.
+
+### 7.4 Single-metric dict
+
+| Case | Behavior |
+|------|----------|
+| Guide returns `{"accuracy": 0.9}` with `mode="weighted"` | Weighted sum = `weight * 0.9` (trivially correct) |
+| Guide returns `{"accuracy": 0.9}` with `mode="pareto"` | Pareto degenerates to scalar max (single dimension — no tradeoffs). Warning logged. |
+
+### 7.5 Tie-breaking
+
+| Case | Behavior |
+|------|----------|
+| Two candidates with identical weighted score | Deterministic: lower original index wins (stable sort) |
+| Pareto front with 3 equivalent candidates, `tie_break="weighted"` | Fall back to weighted scalarization among the 3; select max |
+| Pareto front with 3 equivalent candidates, `tie_break="lexicographic"` | Sort by metric names alphabetically, compare values in order |
+| Pareto front with 3 equivalent candidates, `tie_break="random_seeded"` | Seeded shuffle with `config.seed`; same seed → same order always |
+
+### 7.7 Training loop safety
+
+The training loop has a **separate data path** from evaluation/selection. In `standard_optimization_step()` (basic_algorithms.py:46) and `standard_forward()` (sampler.py:130):
+
+```python
+score, feedback = guide(x, target.data, info)
+```
+
+This `score` flows into `MinibatchAlgorithm.update()` where `np.mean(scores)` is computed (basic_algorithms.py:511). **This path must always receive `float`.**
+
+| Constraint | Enforcement |
+|-----------|-------------|
+| `guide.__call__()` / `get_feedback()` return type is **NOT widened** | No changes to `get_feedback()` signature; it still returns `Tuple[float, str]` |
+| Training loop always receives scalar `score` | `metric()` always returns `float` (collapses dict via `mean(values)` if needed) |
+| Dict scores flow through a separate path | `get_score_dict()` → `evaluate_vector()` → `select_best()` / `select_top_k()` |
+| A multi-objective guide must return `(float, str)` from `get_feedback()` for the training loop | The float is a collapsed scalar summary; the full dict is extracted via `get_score_dict()` during selection |
+
+**Two data paths (by design):**
+```
+Training loop:    guide() → score (float) → np.mean(scores)       ← UNCHANGED
+Selection path:   get_score_dict() → evaluate_vector() → objectives.py  ← NEW
+```
+
+### 7.6 ObjectiveConfig validation
+
+| Case | Behavior |
+|------|----------|
+| `mode="weighted"`, `weights={}` | Auto-assign equal weight 1.0 to all metrics encountered at selection time |
+| `mode="pareto"`, `pareto_metrics=()` (empty tuple) | `ValueError("pareto_metrics must be None (auto) or non-empty tuple")` |
+| `weights={"accuracy": -0.5}` (negative weight) | `ValueError("All weights must be non-negative")` |
+| `minimize={"unknown_metric"}` | Warning logged at selection time if metric never appears; no error (tolerant) |
+
+---
+
+## 8. Milestones & Validation Gates
+
+### Milestone 0 — Analysis + technical plan + interface spec
+
+**Deliverables:**
+- `docs/T6_technical_plan.md` — this document, finalized
+- `notebooks/t6_m0_analysis.ipynb` — Colab-ready notebook
+
+**Notebook demonstrates:**
+- Current Guide score contract (`get_feedback` → `Tuple[float, str]`, `metric` → `float`)
+- Where scalar selection happens in BasicSearch (`max(candidates, ...)`) and Beamsearch (`sorted(...)[:k]`)
+- Planned behavior prototype: deterministic toy guide returning dict metrics, showing weighted vs Pareto selection on dummy candidates
+
+**SMART validation:**
+- Plan includes final API signatures and precise file list (create/modify) ✓
+- Notebook runs without API keys ✓
+- Notebook prints: current score contract, selection touchpoints, planned selection outputs ✓
+
+---
+
+### Milestone 1 — ObjectiveConfig + utilities + evaluator support + BasicSearch minimal
+
+**Deliverables:**
+- `opto/trainer/objectives.py` (new)
+- `opto/trainer/guide.py` (add `get_score_dict`)
+- `opto/trainer/evaluators.py` (add `evaluate_vector`, `aggregate_vector_scores`)
+- `opto/trainer/algorithms/basic_algorithms.py` (BasicSearch: accept/use ObjectiveConfig)
+- `tests/test_objectives.py`, `tests/test_evaluators_vector.py`
+- `notebooks/t6_m1_vector_scores.ipynb`
+
+**Notebook demonstrates:**
+- StubLLM mode: BasicSearchAlgorithm on small candidate set (5-10) with deterministic dummy guide returning dict metrics
+- Shows: (a) scalar baseline, (b) weighted mode, (c) Pareto mode, (d) deterministic tie-break under fixed seed
+- Real LLM mode (optional): tiny dataset (≤5 items) producing ≥2 metrics
+
+**SMART validation:**
+- `pytest -q` passes (all new functions covered)
+- Notebook runs in Colab: weighted selection result changes when weights change
+- Pareto returns tradeoffs and is deterministic under fixed seed
+- Scalar path produces identical results to pre-change behavior
+
+---
+
+### Milestone 2 — Trainer upgrades (Beamsearch + robust BasicSearch)
+
+**Deliverables:**
+- `opto/trainer/algorithms/beamsearch_algorithm.py` (accept ObjectiveConfig, vector selection)
+- Expanded BasicSearch tests (edge cases, missing metrics, tie-break policies)
+- Optional: minimal PrioritySearch support (weighted scalarization for heap, dict stored for logging)
+- `tests/test_trainers_multiobjective.py`
+- `notebooks/t6_m2_trainers.ipynb`
+
+**Notebook demonstrates:**
+- BasicSearch + Beamsearch in: scalar mode (baseline), weighted mode, Pareto mode
+- StubLLM + real LLM sections
+
+**SMART validation:**
+- `pytest -q` green
+- Integration test confirms: weighted vs Pareto select different candidates where expected
+- Scalar-only example produces same final best score when `objective_config=None`
+- Deterministic tie-break is stable across runs
+
+---
+
+### Milestone 3 — Benchmarks (Trace-Bench integration)
+
+**Deliverables:**
+- PR to Trace-Bench: benchmark configs/tasks + notebook
+- 3 benchmarks:
+  1. **Accuracy vs latency** (toy QA dataset)
+  2. **Accuracy vs response length** (penalize verbosity)
+  3. **Accuracy vs tool calls** (penalize excessive tool usage)
+- `notebooks/t6_m3_benchmarks.ipynb`
+
+**SMART validation:**
+- Notebook outputs per-benchmark table: weighted-mode best candidate metrics + Pareto-mode set of tradeoffs
+- Benchmarks run in StubLLM mode (fast/deterministic) and real LLM mode (small sample)
+- Trace-Bench run completes without private datasets
+- `pytest -q` green (smoke tests for benchmark integration)
+
+---
+
+### Milestone 4 — Documentation + polished notebooks
+
+**Deliverables:**
+- `docs/multi_objective_scores.md` — user-facing documentation
+- README update with pointers to docs and notebooks
+- Polished "How-to" notebook: installs from GitHub, runs BasicSearch weighted + Pareto, prints metric tradeoffs
+
+**SMART validation:**
+- Fresh Colab runtime runs how-to notebook without manual patching
+- CI green, no behavioral changes beyond documentation/polish
+
+---
+
+## 9. Test Plan
+
+### 9.1 Unit tests — `tests/test_objectives.py` (M1)
+
+| Test | Validates |
+|------|-----------|
+| `test_normalize_score_from_float` | `0.85` → `{"score": 0.85}` |
+| `test_normalize_score_from_dict` | `{"a": 1.0, "b": 2.0}` → same dict |
+| `test_normalize_score_empty_dict_raises` | `{}` → `ValueError` |
+| `test_normalize_score_nan_raises` | `{"a": float('nan')}` → `ValueError` |
+| `test_normalize_score_wrong_type_raises` | `"text"` → `TypeError` |
+| `test_apply_minimize` | `{"acc": 0.9, "lat": 100}` with `minimize={"lat"}` → `{"acc": 0.9, "lat": -100}` |
+| `test_apply_minimize_empty_set` | No metrics negated |
+| `test_weighted_scalarize_basic` | `{"a": 0.8, "b": 0.2}` with `weights={"a": 0.7, "b": 0.3}` → `0.7*0.8 + 0.3*0.2` |
+| `test_weighted_scalarize_missing_metric` | Missing metric uses `missing_value` |
+| `test_weighted_scalarize_empty_weights` | Equal weight 1.0 for all metrics |
+| `test_dominates_true` | A dominates B (all ≥, at least one >) |
+| `test_dominates_false_equal` | A == B → does not dominate |
+| `test_dominates_false_tradeoff` | A better on one, B better on another |
+| `test_pareto_rank_simple` | 3 candidates with clear rank 0, 1, 2 |
+| `test_pareto_rank_all_nondominated` | All candidates rank 0 |
+| `test_select_best_scalar_mode` | Falls back to scalar max |
+| `test_select_best_weighted_mode` | Returns highest weighted score |
+| `test_select_best_pareto_mode` | Returns Pareto-optimal by tie-break |
+| `test_select_best_none_config` | `objective_config=None` → scalar max (backward compat) |
+| `test_select_top_k_weighted` | Returns k highest weighted scores |
+| `test_select_top_k_pareto` | Returns k from Pareto front + spillover |
+| `test_deterministic_tie_break_seeded` | Same seed → same result across 100 runs |
+| `test_deterministic_tie_break_different_seeds` | Different seeds → potentially different result |
+
+### 9.2 Unit tests — `tests/test_evaluators_vector.py` (M1)
+
+| Test | Validates |
+|------|-----------|
+| `test_aggregate_vector_scores_all_scalar` | `[0.8, 0.9, 0.7]` → `np.mean` (backward compat) |
+| `test_aggregate_vector_scores_all_dict` | Per-metric mean computed correctly |
+| `test_aggregate_vector_scores_mixed` | Scalars normalized to dict, then averaged |
+| `test_evaluate_vector_returns_correct_types` | Returns list of ScoreLike matching guide output |
+
+### 9.3 Integration tests — `tests/test_trainers_multiobjective.py` (M2)
+
+| Test | Validates |
+|------|-----------|
+| `test_basicsearch_scalar_unchanged` | Default behavior identical to pre-change |
+| `test_basicsearch_weighted_selects_expected` | Weighted mode picks correct candidate |
+| `test_basicsearch_pareto_selects_expected` | Pareto mode picks different candidate than weighted |
+| `test_beamsearch_scalar_unchanged` | Default behavior identical |
+| `test_beamsearch_weighted_selects_top_k` | Weighted mode picks correct top-k |
+| `test_beamsearch_pareto_selects_front` | Pareto mode returns non-dominated front |
+| `test_deterministic_across_runs` | Fixed seed → same selections in 5 repeated runs |
+
+### 9.4 Notebook validation (human / Trace team)
+
+Each notebook contains:
+- **StubLLM (no keys) section:** deterministic dummy guide, runs quickly
+- **Real LLM section (optional):** small N (5-20 examples), prints cost/latency caveats, requires API key
+
+---
+
+## 10. Risks & Mitigation
+
+| Risk | Severity | Mitigation |
+|------|----------|------------|
+| **R1: Missing metrics across candidates** | Medium | `missing_value` in ObjectiveConfig (default `-inf`). Enforce metric presence for configured weights (or warn). |
+| **R2: Pareto nondeterminism** | High | Deterministic ordering via stable sort + explicit tie-break rules. Seeded randomness only when requested. |
+| **R3: Multi-thread eval ordering** | Medium | Tests run with `num_threads=1` to guarantee stability. Document thread-safety considerations. |
+| **R4: Breaking Guide subclasses** | High | Use `get_score_dict()` helper — never change `get_feedback()` signature. Union type on `metric()` is safe because existing callers only receive floats. |
+| **R5: Performance regression** | Low | `objectives.py` functions are O(n²) for Pareto ranking on n candidates, but n is typically ≤20 (num_proposals). No concern at this scale. |
+| **R6: Mixed scalar/dict in same batch** | Medium | `aggregate_vector_scores()` rejects mixed batches with `ValueError`. A mixed batch indicates a bug in the guide. |
+| **R7: Training loop receives dict score** | High | `guide.__call__()` / `get_feedback()` return type is NOT widened. `metric()` always returns `float`. Dict scores only flow through `get_score_dict()` → `evaluate_vector()`. See §7.7. |
+
+---
+
+## 11. Design Decisions (Resolved)
+
+### D1: Where to implement scalar→dict normalization?
+
+**Decision: Option A — `Guide.get_score_dict()` helper + `objectives.normalize_score()`**
+
+- `get_score_dict()` on Guide provides a clean entry point for subclasses.
+- `normalize_score()` in objectives.py is the canonical utility (pure function, testable).
+- Avoids widening `get_feedback()` return type (higher churn, breaks typing).
+
+### D2: Pareto selection definition
+
+**Decision: Option A — Standard dominance on aggregated metrics, return single best by tie-break.**
+
+- `select_best()` returns one winner. `select_top_k()` returns k winners.
+- Trainers don't need to manage a "front" — they just get indices.
+- Beamsearch naturally uses `select_top_k(k=beam_width)`.
+
+### D3: PrioritySearch scope
+
+**Decision: Minimal (in-scope).**
+
+- Scalarize heap priority via `weighted_scalarize()`.
+- Store full `score_dict` on each candidate for logging.
+- `mode="pareto"` falls back to weighted with documented warning.
+- Pareto archive is out-of-scope for v1.
+
+---
+
+## 12. Appendix: Code Touchpoints
+
+### OpenTrace / experimental
+
+| File | URL |
+|------|-----|
+| Guide base | [guide.py](https://github.com/AgentOpt/OpenTrace/blob/experimental/opto/trainer/guide.py) |
+| Evaluators | [evaluators.py](https://github.com/AgentOpt/OpenTrace/blob/experimental/opto/trainer/evaluators.py) |
+| BasicSearch | [basic_algorithms.py](https://github.com/AgentOpt/OpenTrace/blob/experimental/opto/trainer/algorithms/basic_algorithms.py) |
+| Beamsearch | [beamsearch_algorithm.py](https://github.com/AgentOpt/OpenTrace/blob/experimental/opto/trainer/algorithms/beamsearch_algorithm.py) |
+| PrioritySearch | [priority_search.py](https://github.com/AgentOpt/OpenTrace/blob/experimental/opto/features/priority_search/priority_search.py) |
+
+### Trace-Bench
+
+| File | URL |
+|------|-----|
+| Repo | [Trace-Bench](https://github.com/AgentOpt/Trace-Bench) |
+
+### Selection logic summary (current → target)
+
+| Trainer | Current Code | Target Code |
+|---------|-------------|-------------|
+| BasicSearch | `max(candidates, key=lambda x: x[0])` | `select_best(candidates, objective_config)` |
+| Beamsearch | `sorted(candidates, key=lambda x: x[0], reverse=True)[:k]` | `select_top_k(candidates, objective_config, k)` |
+| PrioritySearch | scalar heap key | `weighted_scalarize(score_dict, config)` for heap key |
diff --git a/examples/notebooks/t6_m0_analysis.ipynb b/examples/notebooks/t6_m0_analysis.ipynb
new file mode 100644
index 00000000..90eefcad
--- /dev/null
+++ b/examples/notebooks/t6_m0_analysis.ipynb
@@ -0,0 +1,950 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "275808ea",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'\\nT6 Milestone 0 — Analysis Notebook\\n\\nThis notebook is the M0 deliverable for the T6 Multi-Objective Vector Scores project.\\nIt demonstrates:\\n  1. Current baseline behavior (Guide score contract, evaluator aggregation, trainer selection)\\n  2. Exact code touchpoints and signatures in the OpenTrace codebase\\n  3. Planned behavior prototype: Pareto front vs weighted selection on deterministic toy candidates\\n\\nRuns end-to-end WITHOUT API keys.\\n'"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "\"\"\"\n",
+    "T6 Milestone 0 — Analysis Notebook\n",
+    "\n",
+    "This notebook is the M0 deliverable for the T6 Multi-Objective Vector Scores project.\n",
+    "It demonstrates:\n",
+    "  1. Current baseline behavior (Guide score contract, evaluator aggregation, trainer selection)\n",
+    "  2. Exact code touchpoints and signatures in the OpenTrace codebase\n",
+    "  3. Planned behavior prototype: Pareto front vs weighted selection on deterministic toy candidates\n",
+    "\n",
+    "Runs end-to-end WITHOUT API keys.\n",
+    "\"\"\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b1a58d26",
+   "metadata": {},
+   "source": [
+    "# T6 Multi-Objective Vector Scores — M0 Analysis\n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/)\n",
+    "\n",
+    "**Milestone 0 Deliverable** — Analysis + Technical Plan + Interface Spec\n",
+    "\n",
+    "This notebook demonstrates:\n",
+    "1. **Current baseline**: How Guide returns scalar scores, how evaluators aggregate, where selection happens\n",
+    "2. **Exact touchpoints**: The specific lines of code in BasicSearch and Beamsearch that perform scalar selection\n",
+    "3. **Planned behavior**: A deterministic prototype showing weighted vs Pareto selection on toy candidates\n",
+    "\n",
+    "**No API keys required.** All examples use deterministic dummy data.\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a252270b",
+   "metadata": {},
+   "source": [
+    "## How to Validate This Milestone\n",
+    "\n",
+    "After running all cells, confirm:\n",
+    "- [ ] Current Guide score contract is printed (`get_feedback → Tuple[float, str]`, `metric → float`)\n",
+    "- [ ] Scalar selection points in BasicSearch and Beamsearch are identified\n",
+    "- [ ] Weighted selection produces different results when weights change\n",
+    "- [ ] Pareto selection returns non-dominated candidates (tradeoff set)\n",
+    "- [ ] Deterministic tie-break produces identical results across repeated runs with same seed"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "067cd49e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "======================================================================\n",
+      "T6 M0 Analysis — Multi-Objective Vector Scores\n",
+      "======================================================================\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Setup — no external dependencies beyond numpy\n",
+    "import numpy as np\n",
+    "from typing import Dict, List, Tuple, Optional, Set, Union, Literal\n",
+    "from dataclasses import dataclass, field\n",
+    "import json\n",
+    "\n",
+    "print(\"=\" * 70)\n",
+    "print(\"T6 M0 Analysis — Multi-Objective Vector Scores\")\n",
+    "print(\"=\" * 70)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "54b6022f",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## Part 1: Current Baseline Behavior\n",
+    "\n",
+    "### 1.1 Guide Score Contract\n",
+    "\n",
+    "The `Guide` base class defines the current score interface:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "2ab12cbf",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "======================================================================\n",
+      "PART 1: CURRENT BASELINE BEHAVIOR\n",
+      "======================================================================\n",
+      "\n",
+      "=== Current Guide Score Contract ===\n",
+      "\n",
+      "class Guide:\n",
+      "    def get_feedback(self, query, response, reference=None, **kwargs) -> Tuple[float, str]:\n",
+      "        raise NotImplementedError\n",
+      "\n",
+      "    def metric(self, query, response, reference=None, **kwargs) -> float:\n",
+      "        return self.get_feedback(query, response, reference)[0]  # extracts scalar\n",
+      "\n",
+      "Key observations:\n",
+      "  • get_feedback() returns Tuple[float, str] — a SCALAR score + feedback string\n",
+      "  • metric() returns float — just extracts the first element\n",
+      "  • LLMJudge (subclass) returns binary 0/1 scores\n",
+      "  • No mechanism to return Dict[str, float] for multiple metrics\n",
+      "\n",
+      "Example — get_feedback(): score=1.0 (type=float), feedback='Correct!'\n",
+      "Example — metric():       1.0 (type=float)\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "print(\"PART 1: CURRENT BASELINE BEHAVIOR\")\n",
+    "print(\"=\" * 70)\n",
+    "\n",
+    "print(\"\"\"\n",
+    "=== Current Guide Score Contract ===\n",
+    "\n",
+    "class Guide:\n",
+    "    def get_feedback(self, query, response, reference=None, **kwargs) -> Tuple[float, str]:\n",
+    "        raise NotImplementedError\n",
+    "\n",
+    "    def metric(self, query, response, reference=None, **kwargs) -> float:\n",
+    "        return self.get_feedback(query, response, reference)[0]  # extracts scalar\n",
+    "\n",
+    "Key observations:\n",
+    "  • get_feedback() returns Tuple[float, str] — a SCALAR score + feedback string\n",
+    "  • metric() returns float — just extracts the first element\n",
+    "  • LLMJudge (subclass) returns binary 0/1 scores\n",
+    "  • No mechanism to return Dict[str, float] for multiple metrics\n",
+    "\"\"\")\n",
+    "\n",
+    "# Simulate current behavior\n",
+    "class CurrentGuide:\n",
+    "    \"\"\"Simulates the current Guide behavior — scalar scores only.\"\"\"\n",
+    "    def get_feedback(self, query, response, reference=None) -> Tuple[float, str]:\n",
+    "        score = 1.0 if response == reference else 0.0\n",
+    "        feedback = \"Correct!\" if score == 1.0 else f\"Expected '{reference}', got '{response}'\"\n",
+    "        return score, feedback\n",
+    "\n",
+    "    def metric(self, query, response, reference=None) -> float:\n",
+    "        return self.get_feedback(query, response, reference)[0]\n",
+    "\n",
+    "guide = CurrentGuide()\n",
+    "score, feedback = guide.get_feedback(\"What is 2+2?\", \"4\", \"4\")\n",
+    "print(f\"Example — get_feedback(): score={score} (type={type(score).__name__}), feedback='{feedback}'\")\n",
+    "print(f\"Example — metric():       {guide.metric('What is 2+2?', '4', '4')} (type={type(guide.metric('What is 2+2?', '4', '4')).__name__})\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fcbb5663",
+   "metadata": {},
+   "source": [
+    "### 1.2 Evaluator Aggregation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "55bf7801",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "=== Current Evaluator Behavior ===\n",
+      "\n",
+      "def evaluate(agent, guide, inputs, infos, ...) -> np.ndarray:\n",
+      "    # For each input: calls guide.metric(input, agent(input), info) → float\n",
+      "    # Returns np.array of shape (N,) or (N, num_samples)\n",
+      "    # Aggregated via np.mean(scores)\n",
+      "\n",
+      "Key observations:\n",
+      "  • All scores are numeric scalars\n",
+      "  • Aggregation: np.mean() over all examples\n",
+      "  • No support for Dict[str, float] scores\n",
+      "\n",
+      "Example — evaluate() returns: [0.9  0.85 0.95 0.7  0.88] (shape=(5,), dtype=float64)\n",
+      "Example — np.mean(scores):    0.8560 (single scalar used for selection)\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"\"\"\n",
+    "=== Current Evaluator Behavior ===\n",
+    "\n",
+    "def evaluate(agent, guide, inputs, infos, ...) -> np.ndarray:\n",
+    "    # For each input: calls guide.metric(input, agent(input), info) → float\n",
+    "    # Returns np.array of shape (N,) or (N, num_samples)\n",
+    "    # Aggregated via np.mean(scores)\n",
+    "\n",
+    "Key observations:\n",
+    "  • All scores are numeric scalars\n",
+    "  • Aggregation: np.mean() over all examples\n",
+    "  • No support for Dict[str, float] scores\n",
+    "\"\"\")\n",
+    "\n",
+    "# Simulate current evaluator\n",
+    "scores_array = np.array([0.9, 0.85, 0.95, 0.7, 0.88])\n",
+    "mean_score = np.mean(scores_array)\n",
+    "print(f\"Example — evaluate() returns: {scores_array} (shape={scores_array.shape}, dtype={scores_array.dtype})\")\n",
+    "print(f\"Example — np.mean(scores):    {mean_score:.4f} (single scalar used for selection)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7ab684f0",
+   "metadata": {},
+   "source": [
+    "### 1.3 Selection Points in Trainers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "b8b0032f",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "=== BasicSearchAlgorithm — Selection Logic ===\n",
+      "\n",
+      "File: opto/trainer/algorithms/basic_algorithms.py\n",
+      "Method: BasicSearchAlgorithm.optimizer_step()\n",
+      "\n",
+      "    def validate():\n",
+      "        scores = evaluate(self.agent, self.validate_guide, ...)\n",
+      "        return np.mean(scores) if all([s is not None for s in scores]) else -np.inf\n",
+      "                 ^^^^^^^^^^^^^^^^\n",
+      "                 Returns: single float\n",
+      "\n",
+      "    candidates.append((score, update_dict))       # score is float\n",
+      "    candidates.append((self.current_score, backup_dict))  # include current\n",
+      "\n",
+      "    best_score, best_update = max(candidates, key=lambda x: x[0])\n",
+      "                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
+      "                               SELECTION: scalar max — single metric only\n",
+      "\n",
+      ">>> This is the PRIMARY insertion point for multi-objective selection. <<<\n",
+      "\n",
+      "BasicSearch candidates: [(0.72, 'proposal_A'), (0.85, 'proposal_B'), (0.78, 'proposal_C'), (0.85, 'current_params')]\n",
+      "Selected (scalar max):  score=0.85, params='proposal_B'\n",
+      "Note: Tie between proposal_B and current_params — max() picks first occurrence (proposal_B)\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"\"\"\n",
+    "=== BasicSearchAlgorithm — Selection Logic ===\n",
+    "\n",
+    "File: opto/trainer/algorithms/basic_algorithms.py\n",
+    "Method: BasicSearchAlgorithm.optimizer_step()\n",
+    "\n",
+    "    def validate():\n",
+    "        scores = evaluate(self.agent, self.validate_guide, ...)\n",
+    "        return np.mean(scores) if all([s is not None for s in scores]) else -np.inf\n",
+    "                 ^^^^^^^^^^^^^^^^\n",
+    "                 Returns: single float\n",
+    "\n",
+    "    candidates.append((score, update_dict))       # score is float\n",
+    "    candidates.append((self.current_score, backup_dict))  # include current\n",
+    "\n",
+    "    best_score, best_update = max(candidates, key=lambda x: x[0])\n",
+    "                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
+    "                               SELECTION: scalar max — single metric only\n",
+    "\n",
+    ">>> This is the PRIMARY insertion point for multi-objective selection. <<<\n",
+    "\"\"\")\n",
+    "\n",
+    "# Simulate current BasicSearch selection\n",
+    "candidates_basic = [\n",
+    "    (0.72, \"proposal_A\"),\n",
+    "    (0.85, \"proposal_B\"),\n",
+    "    (0.78, \"proposal_C\"),\n",
+    "    (0.85, \"current_params\"),  # tie with proposal_B\n",
+    "]\n",
+    "best_score, best_update = max(candidates_basic, key=lambda x: x[0])\n",
+    "print(f\"BasicSearch candidates: {[(s, name) for s, name in candidates_basic]}\")\n",
+    "print(f\"Selected (scalar max):  score={best_score}, params='{best_update}'\")\n",
+    "print(f\"Note: Tie between proposal_B and current_params — max() picks first occurrence (proposal_B)\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "8db5aa87",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "=== BeamsearchAlgorithm — Selection Logic ===\n",
+      "\n",
+      "File: opto/trainer/algorithms/beamsearch_algorithm.py\n",
+      "Method: BeamsearchAlgorithm.select()\n",
+      "\n",
+      "    scored_candidates.append((validation_score, candidate_params))  # float\n",
+      "\n",
+      "    sorted_candidates = sorted(scored_candidates, key=lambda x: x[0], reverse=True)\n",
+      "                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
+      "                         SELECTION: scalar sort descending\n",
+      "\n",
+      "    selected_candidates = sorted_candidates[:beam_width]   # take top-k\n",
+      "                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
+      "                          Top-k by scalar score only\n",
+      "\n",
+      ">>> This is the SECONDARY insertion point for multi-objective selection. <<<\n",
+      "\n",
+      "Beamsearch candidates: [(0.72, 'candidate_1'), (0.91, 'candidate_2'), (0.85, 'candidate_3'), (0.91, 'candidate_4'), (0.78, 'candidate_5')]\n",
+      "Selected (top-3 by scalar): [(0.91, 'candidate_2'), (0.91, 'candidate_4'), (0.85, 'candidate_3')]\n",
+      "Note: Tie between candidate_2 and candidate_4 — sorted() preserves input order (stable sort)\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"\"\"\n",
+    "=== BeamsearchAlgorithm — Selection Logic ===\n",
+    "\n",
+    "File: opto/trainer/algorithms/beamsearch_algorithm.py\n",
+    "Method: BeamsearchAlgorithm.select()\n",
+    "\n",
+    "    scored_candidates.append((validation_score, candidate_params))  # float\n",
+    "\n",
+    "    sorted_candidates = sorted(scored_candidates, key=lambda x: x[0], reverse=True)\n",
+    "                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
+    "                         SELECTION: scalar sort descending\n",
+    "\n",
+    "    selected_candidates = sorted_candidates[:beam_width]   # take top-k\n",
+    "                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
+    "                          Top-k by scalar score only\n",
+    "\n",
+    ">>> This is the SECONDARY insertion point for multi-objective selection. <<<\n",
+    "\"\"\")\n",
+    "\n",
+    "# Simulate current Beamsearch selection\n",
+    "candidates_beam = [\n",
+    "    (0.72, \"candidate_1\"),\n",
+    "    (0.91, \"candidate_2\"),\n",
+    "    (0.85, \"candidate_3\"),\n",
+    "    (0.91, \"candidate_4\"),  # tie with candidate_2\n",
+    "    (0.78, \"candidate_5\"),\n",
+    "]\n",
+    "beam_width = 3\n",
+    "sorted_candidates = sorted(candidates_beam, key=lambda x: x[0], reverse=True)\n",
+    "selected = sorted_candidates[:beam_width]\n",
+    "print(f\"Beamsearch candidates: {[(s, name) for s, name in candidates_beam]}\")\n",
+    "print(f\"Selected (top-{beam_width} by scalar): {[(s, name) for s, name in selected]}\")\n",
+    "print(f\"Note: Tie between candidate_2 and candidate_4 — sorted() preserves input order (stable sort)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7119b4a4",
+   "metadata": {},
+   "source": [
+    "### 1.4 Summary: What's Missing\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fbf9e98b",
+   "metadata": {},
+   "outputs": [],
+   "source": "print(\"\"\"\n=== Summary: Current Limitations ===\n\n1. Guide.metric() → float only (and stays float BY DESIGN)\n   metric() will NOT be widened to return dicts.\n   Dict scores flow through the NEW get_score_dict() path instead.\n\n2. evaluate() → np.array of floats\n   Cannot aggregate per-metric means across examples.\n   New evaluate_vector() will handle dict aggregation separately.\n\n3. BasicSearch: max(candidates, key=scalar)\n   Cannot do weighted multi-metric selection or Pareto ranking\n\n4. Beamsearch: sorted(candidates, key=scalar)[:k]\n   Cannot select top-k by Pareto dominance\n\n5. No ObjectiveConfig\n   No way to declare minimize metrics, weights, or selection mode\n\n>>> All of the above will be addressed in M1-M2 without breaking existing behavior. <<<\n>>> Training loop (guide.__call__ → float) is NEVER modified.                      <<<\n\"\"\")"
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8e97b2fd",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## Part 2: Planned Behavior — Prototype\n",
+    "\n",
+    "The following cells implement the **planned multi-objective selection** as pure functions.\n",
+    "This is a standalone prototype (no OpenTrace dependency) demonstrating the exact behavior\n",
+    "that `opto/trainer/objectives.py` will provide.\n",
+    "\n",
+    "### 2.1 ObjectiveConfig"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bad5944d",
+   "metadata": {},
+   "outputs": [],
+   "source": "print(\"\\n\" + \"=\" * 70)\nprint(\"PART 2: PLANNED BEHAVIOR — PROTOTYPE\")\nprint(\"=\" * 70)\n\n@dataclass(frozen=True)\nclass ObjectiveConfig:\n    \"\"\"Configuration for multi-objective candidate selection.\"\"\"\n    mode: str = \"scalar\"  # \"scalar\", \"weighted\", \"pareto\"\n    weights: Dict[str, float] = field(default_factory=dict)\n    minimize: frozenset = field(default_factory=frozenset)\n    missing_value: float = float(\"-inf\")\n    pareto_metrics: Optional[Tuple[str, ...]] = None\n    tie_break: str = \"weighted\"  # \"weighted\", \"lexicographic\", \"random_seeded\"\n    seed: int = 0\n\n    def __post_init__(self):\n        # Convert set → frozenset for true immutability + hashability\n        if isinstance(self.minimize, set):\n            object.__setattr__(self, 'minimize', frozenset(self.minimize))\n        # Validate weights are non-negative\n        for k, v in self.weights.items():\n            if v < 0:\n                raise ValueError(f\"Weight for '{k}' must be non-negative, got {v}\")\n        # Validate pareto_metrics\n        if self.pareto_metrics is not None and len(self.pareto_metrics) == 0:\n            raise ValueError(\"pareto_metrics must be None (auto) or non-empty tuple\")\n\nprint(\"ObjectiveConfig defined with modes: scalar | weighted | pareto\")\nprint(f\"Default config: {ObjectiveConfig()}\")\n\n# Verify set → frozenset auto-conversion\nconfig_with_set = ObjectiveConfig(minimize={\"latency_s\"})\nprint(f\"minimize=set auto-converts: type={type(config_with_set.minimize).__name__}, value={config_with_set.minimize}\")"
+  },
+  {
+   "cell_type": "markdown",
+   "id": "478f806d",
+   "metadata": {},
+   "source": [
+    "### 2.2 Core Utility Functions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2ed7c83c",
+   "metadata": {},
+   "outputs": [],
+   "source": "# --- Score type aliases ---\nScoreLike = Union[int, float, bool, Dict[str, float]]\n\n\ndef normalize_score(score: ScoreLike) -> Dict[str, float]:\n    \"\"\"Convert any score to dict form.\n    \n    - int/float/bool → {\"score\": float(value)}\n    - Dict[str, float] → returned as-is (validated)\n    \n    Handles int (LLMJudge returns 0/1) and bool (test guides) explicitly.\n    \"\"\"\n    if isinstance(score, (bool, int, float)):\n        # bool check must come before int since bool is subclass of int\n        val = float(score)\n        if not np.isfinite(val):\n            raise ValueError(f\"Score must be finite, got {score}\")\n        return {\"score\": val}\n    elif isinstance(score, dict):\n        if len(score) == 0:\n            raise ValueError(\"Score dict must not be empty\")\n        for k, v in score.items():\n            if not isinstance(v, (int, float)) or not np.isfinite(float(v)):\n                raise ValueError(f\"Score dict value for '{k}' must be finite float, got {v}\")\n        return {k: float(v) for k, v in score.items()}\n    else:\n        raise TypeError(f\"Score must be int, float, bool, or Dict[str, float], got {type(score).__name__}\")\n\n\ndef apply_minimize(score_dict: Dict[str, float], minimize: set) -> Dict[str, float]:\n    \"\"\"Negate values for minimize metrics (higher-is-better normalization).\"\"\"\n    return {\n        k: -v if k in minimize else v\n        for k, v in score_dict.items()\n    }\n\n\ndef weighted_scalarize(score_dict: Dict[str, float], weights: Dict[str, float],\n                       missing_value: float = float(\"-inf\")) -> float:\n    \"\"\"Compute weighted sum. Empty weights → equal weight 1.0.\"\"\"\n    if not weights:\n        weights = {k: 1.0 for k in score_dict}\n    total = 0.0\n    for metric, weight in weights.items():\n        value = score_dict.get(metric, missing_value)\n        total += weight * value\n    return total\n\n\ndef dominates(a: Dict[str, float], b: Dict[str, float],\n              metrics: Optional[Tuple[str, ...]] = None) -> bool:\n    \"\"\"Check if candidate 'a' Pareto-dominates candidate 'b'.\n    \n    a dominates b iff:\n      - a[m] >= b[m] for ALL metrics m, AND\n      - a[m] > b[m] for AT LEAST ONE metric m\n    \"\"\"\n    if metrics is None:\n        metrics = tuple(sorted(set(a.keys()) | set(b.keys())))\n    \n    at_least_one_better = False\n    for m in metrics:\n        va = a.get(m, float(\"-inf\"))\n        vb = b.get(m, float(\"-inf\"))\n        if va < vb:\n            return False  # a is worse on this metric\n        if va > vb:\n            at_least_one_better = True\n    return at_least_one_better\n\n\ndef pareto_rank(candidates: List[Dict[str, float]],\n                metrics: Optional[Tuple[str, ...]] = None) -> List[int]:\n    \"\"\"Assign Pareto rank (0 = non-dominated front).\"\"\"\n    n = len(candidates)\n    ranks = [0] * n\n    assigned = [False] * n\n    current_rank = 0\n\n    remaining = set(range(n))\n    while remaining:\n        # Find non-dominated set among remaining\n        front = []\n        for i in remaining:\n            dominated = False\n            for j in remaining:\n                if i != j and dominates(candidates[j], candidates[i], metrics):\n                    dominated = True\n                    break\n            if not dominated:\n                front.append(i)\n\n        for i in front:\n            ranks[i] = current_rank\n            remaining.remove(i)\n        current_rank += 1\n\n    return ranks\n\n\ndef select_best(candidates: List[Tuple[ScoreLike, any]],\n                config: Optional[ObjectiveConfig] = None) -> int:\n    \"\"\"Select the single best candidate index.\"\"\"\n    if config is None or config.mode == \"scalar\":\n        # Backward-compatible: scalar max\n        scores = []\n        for score, _ in candidates:\n            if isinstance(score, dict):\n                scores.append(np.mean(list(score.values())))\n            else:\n                scores.append(float(score))\n        return int(np.argmax(scores))\n\n    # Normalize all scores to dict\n    score_dicts = [normalize_score(s) for s, _ in candidates]\n\n    # Apply minimize\n    score_dicts = [apply_minimize(sd, config.minimize) for sd in score_dicts]\n\n    if config.mode == \"weighted\":\n        weighted_scores = [weighted_scalarize(sd, config.weights, config.missing_value) for sd in score_dicts]\n        return int(np.argmax(weighted_scores))\n\n    elif config.mode == \"pareto\":\n        ranks = pareto_rank(score_dicts, config.pareto_metrics)\n        # Get indices of rank-0 (Pareto front)\n        front_indices = [i for i, r in enumerate(ranks) if r == 0]\n\n        if len(front_indices) == 1:\n            return front_indices[0]\n\n        # Tie-break among front\n        if config.tie_break == \"weighted\":\n            front_scores = [weighted_scalarize(score_dicts[i], config.weights, config.missing_value)\n                           for i in front_indices]\n            return front_indices[int(np.argmax(front_scores))]\n        elif config.tie_break == \"lexicographic\":\n            metrics = sorted(score_dicts[front_indices[0]].keys())\n            def lex_key(idx):\n                return tuple(score_dicts[idx].get(m, config.missing_value) for m in metrics)\n            return max(front_indices, key=lex_key)\n        elif config.tie_break == \"random_seeded\":\n            rng = np.random.RandomState(config.seed)\n            return front_indices[rng.randint(len(front_indices))]\n\n    raise ValueError(f\"Unknown mode: {config.mode}\")\n\n\ndef select_top_k(candidates: List[Tuple[ScoreLike, any]],\n                 config: Optional[ObjectiveConfig] = None,\n                 k: int = 1) -> List[int]:\n    \"\"\"Select the top-k candidate indices.\"\"\"\n    if config is None or config.mode == \"scalar\":\n        scores = []\n        for score, _ in candidates:\n            if isinstance(score, dict):\n                scores.append(np.mean(list(score.values())))\n            else:\n                scores.append(float(score))\n        return list(np.argsort(scores)[::-1][:k])\n\n    score_dicts = [normalize_score(s) for s, _ in candidates]\n    score_dicts = [apply_minimize(sd, config.minimize) for sd in score_dicts]\n\n    if config.mode == \"weighted\":\n        weighted_scores = [weighted_scalarize(sd, config.weights, config.missing_value) for sd in score_dicts]\n        return list(np.argsort(weighted_scores)[::-1][:k])\n\n    elif config.mode == \"pareto\":\n        ranks = pareto_rank(score_dicts, config.pareto_metrics)\n        # Collect by rank, then tie-break within each rank\n        result = []\n        max_rank = max(ranks)\n        for rank in range(max_rank + 1):\n            rank_indices = [i for i, r in enumerate(ranks) if r == rank]\n            # Sort within rank by tie-break\n            if config.tie_break == \"weighted\":\n                rank_indices.sort(\n                    key=lambda i: weighted_scalarize(score_dicts[i], config.weights, config.missing_value),\n                    reverse=True\n                )\n            elif config.tie_break == \"lexicographic\":\n                metrics = sorted(score_dicts[rank_indices[0]].keys()) if rank_indices else []\n                rank_indices.sort(\n                    key=lambda i: tuple(score_dicts[i].get(m, config.missing_value) for m in metrics),\n                    reverse=True\n                )\n            elif config.tie_break == \"random_seeded\":\n                rng = np.random.RandomState(config.seed + rank)\n                rng.shuffle(rank_indices)\n            result.extend(rank_indices)\n            if len(result) >= k:\n                break\n        return result[:k]\n\n    raise ValueError(f\"Unknown mode: {config.mode}\")\n\n\nprint(\"Core utility functions defined:\")\nprint(\"  \\u2022 normalize_score()  — handles float, int, bool, and dict\")\nprint(\"  \\u2022 apply_minimize()\")\nprint(\"  \\u2022 weighted_scalarize()\")\nprint(\"  \\u2022 dominates()\")\nprint(\"  \\u2022 pareto_rank()\")\nprint(\"  \\u2022 select_best()\")\nprint(\"  \\u2022 select_top_k()\")"
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6233d2c7",
+   "metadata": {},
+   "source": [
+    "### 2.3 Validation: normalize_score()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "25003e79",
+   "metadata": {},
+   "outputs": [],
+   "source": "print(\"\\n--- normalize_score() examples ---\")\nprint(f\"  normalize_score(0.85)                     = {normalize_score(0.85)}\")\nprint(f\"  normalize_score({{'acc': 0.9, 'lat': 50}}) = {normalize_score({'acc': 0.9, 'lat': 50})}\")\n\n# int and bool edge cases (LLMJudge returns int 0/1, test guides return bool)\nprint(f\"\\n  --- int / bool edge cases ---\")\nprint(f\"  normalize_score(1)       = {normalize_score(1)}       # LLMJudge returns int 0/1\")\nprint(f\"  normalize_score(0)       = {normalize_score(0)}       # LLMJudge incorrect → int 0\")\nprint(f\"  normalize_score(True)    = {normalize_score(True)}    # test guide correct → bool\")\nprint(f\"  normalize_score(False)   = {normalize_score(False)}    # test guide incorrect → bool\")\n\n# Error edge cases\nprint(f\"\\n  --- Error edge cases ---\")\ntry:\n    normalize_score({})\nexcept ValueError as e:\n    print(f\"  normalize_score({{}})                       → ValueError: {e}\")\n\ntry:\n    normalize_score(\"bad\")\nexcept TypeError as e:\n    print(f\"  normalize_score('bad')                    → TypeError: {e}\")"
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a5c0fef1",
+   "metadata": {},
+   "source": [
+    "### 2.4 Validation: apply_minimize() + weighted_scalarize()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "b9e31bec",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "--- apply_minimize() examples ---\n",
+      "  Original:  {'accuracy': 0.9, 'latency_ms': 120.0, 'cost': 0.05}\n",
+      "  Minimize:  {'latency_ms', 'cost'}\n",
+      "  Result:    {'accuracy': 0.9, 'latency_ms': -120.0, 'cost': -0.05}\n",
+      "  (latency and cost negated → higher-is-better)\n",
+      "\n",
+      "--- weighted_scalarize() examples ---\n",
+      "  Score (normalized): {'accuracy': 0.9, 'latency_ms': -120.0, 'cost': -0.05}\n",
+      "  Weights:            {'accuracy': 0.6, 'latency_ms': 0.3, 'cost': 0.1}\n",
+      "  Weighted sum:       -35.4650\n",
+      "  = 0.6*0.9 + 0.3*(-120.0) + 0.1*(-0.05) = -35.4650\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"\\n--- apply_minimize() examples ---\")\n",
+    "score = {\"accuracy\": 0.9, \"latency_ms\": 120.0, \"cost\": 0.05}\n",
+    "minimized = apply_minimize(score, minimize={\"latency_ms\", \"cost\"})\n",
+    "print(f\"  Original:  {score}\")\n",
+    "print(f\"  Minimize:  {{'latency_ms', 'cost'}}\")\n",
+    "print(f\"  Result:    {minimized}\")\n",
+    "print(f\"  (latency and cost negated → higher-is-better)\")\n",
+    "\n",
+    "print(\"\\n--- weighted_scalarize() examples ---\")\n",
+    "weights = {\"accuracy\": 0.6, \"latency_ms\": 0.3, \"cost\": 0.1}\n",
+    "ws = weighted_scalarize(minimized, weights)\n",
+    "print(f\"  Score (normalized): {minimized}\")\n",
+    "print(f\"  Weights:            {weights}\")\n",
+    "print(f\"  Weighted sum:       {ws:.4f}\")\n",
+    "print(f\"  = 0.6*0.9 + 0.3*(-120.0) + 0.1*(-0.05) = {0.6*0.9 + 0.3*(-120.0) + 0.1*(-0.05):.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f1725c01",
+   "metadata": {},
+   "source": [
+    "### 2.5 Demonstration: Weighted vs Pareto Selection\n",
+    "\n",
+    "We create 6 candidates with realistic multi-metric scores to show how weighted and Pareto selection differ."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "d3023945",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "======================================================================\n",
+      "DEMONSTRATION: WEIGHTED vs PARETO SELECTION\n",
+      "======================================================================\n",
+      "\n",
+      "Candidate Scores:\n",
+      "  Name              Accuracy  Latency (s)\n",
+      "  --------------- ---------- ------------\n",
+      "  candidate_A           0.95        0.200\n",
+      "  candidate_B           0.70        0.030\n",
+      "  candidate_C           0.88        0.080\n",
+      "  candidate_D           0.92        0.150\n",
+      "  candidate_E           0.60        0.020\n",
+      "  candidate_F           0.85        0.085\n",
+      "\n",
+      "--- Mode: SCALAR (baseline) ---\n",
+      "  Selection: mean of dict values → max\n",
+      "  Winner: candidate_A (index 0)\n",
+      "  Score: {'accuracy': 0.95, 'latency_s': 0.2}\n",
+      "  Note: This is the CURRENT behavior — treats multi-metric as mean scalar.\n",
+      "\n",
+      "--- Mode: WEIGHTED (accuracy-heavy) ---\n",
+      "  Weights: accuracy=0.8, latency_s=0.2 (minimized)\n",
+      "  Winner: candidate_A (index 0)\n",
+      "  Weighted score: 0.7200\n",
+      "\n",
+      "--- Mode: WEIGHTED (latency-heavy) ---\n",
+      "  Weights: accuracy=0.2, latency_s=0.8 (minimized)\n",
+      "  Winner: candidate_B (index 1)\n",
+      "  Weighted score: 0.1160\n",
+      "\n",
+      "  >>> Changing weights changes the winner!\n",
+      "  >>> Accuracy-heavy → candidate_A, Latency-heavy → candidate_B\n",
+      "\n",
+      "--- Mode: PARETO ---\n",
+      "\n",
+      "  Pareto Ranking (after minimize normalization):\n",
+      "  Name              Accuracy  Neg Latency  Pareto Rank\n",
+      "  --------------- ---------- ------------ ------------\n",
+      "  candidate_A           0.95       -0.200            0\n",
+      "  candidate_B           0.70       -0.030            0\n",
+      "  candidate_C           0.88       -0.080            0\n",
+      "  candidate_D           0.92       -0.150            0\n",
+      "  candidate_E           0.60       -0.020            0\n",
+      "  candidate_F           0.85       -0.085            1\n",
+      "\n",
+      "  Pareto Front (Rank 0): ['candidate_A', 'candidate_B', 'candidate_C', 'candidate_D', 'candidate_E']\n",
+      "  These candidates represent TRADEOFFS — none is dominated by another.\n",
+      "\n",
+      "  After tie-break (weighted, weights={acc: 0.5, lat: 0.5}):\n",
+      "  Winner: candidate_C (index 2)\n",
+      "\n",
+      "--- Mode: PARETO (top-k for Beamsearch, k=3) ---\n",
+      "  Selected top-3:\n",
+      "    #1: candidate_C (Pareto rank 0, scores: {'accuracy': 0.88, 'latency_s': 0.08})\n",
+      "    #2: candidate_D (Pareto rank 0, scores: {'accuracy': 0.92, 'latency_s': 0.15})\n",
+      "    #3: candidate_A (Pareto rank 0, scores: {'accuracy': 0.95, 'latency_s': 0.2})\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "print(\"DEMONSTRATION: WEIGHTED vs PARETO SELECTION\")\n",
+    "print(\"=\" * 70)\n",
+    "\n",
+    "# 6 candidates with accuracy (higher=better) and latency_s (lower=better)\n",
+    "# Using latency_s (seconds, 0-1 scale) so metrics are comparable and weight changes matter\n",
+    "candidates = [\n",
+    "    ({\"accuracy\": 0.95, \"latency_s\": 0.200}, \"candidate_A\"),  # high accuracy, high latency\n",
+    "    ({\"accuracy\": 0.70, \"latency_s\": 0.030}, \"candidate_B\"),  # low accuracy, low latency\n",
+    "    ({\"accuracy\": 0.88, \"latency_s\": 0.080}, \"candidate_C\"),  # balanced\n",
+    "    ({\"accuracy\": 0.92, \"latency_s\": 0.150}, \"candidate_D\"),  # good accuracy, moderate latency\n",
+    "    ({\"accuracy\": 0.60, \"latency_s\": 0.020}, \"candidate_E\"),  # lowest accuracy, fastest\n",
+    "    ({\"accuracy\": 0.85, \"latency_s\": 0.085}, \"candidate_F\"),  # similar to C\n",
+    "]\n",
+    "\n",
+    "print(\"\\nCandidate Scores:\")\n",
+    "print(f\"  {'Name':<15} {'Accuracy':>10} {'Latency (s)':>12}\")\n",
+    "print(f\"  {'-'*15} {'-'*10} {'-'*12}\")\n",
+    "for score, name in candidates:\n",
+    "    print(f\"  {name:<15} {score['accuracy']:>10.2f} {score['latency_s']:>12.3f}\")\n",
+    "\n",
+    "# --- Scalar mode (baseline) ---\n",
+    "print(\"\\n--- Mode: SCALAR (baseline) ---\")\n",
+    "config_scalar = ObjectiveConfig(mode=\"scalar\")\n",
+    "best_idx = select_best(candidates, config_scalar)\n",
+    "print(f\"  Selection: mean of dict values → max\")\n",
+    "print(f\"  Winner: {candidates[best_idx][1]} (index {best_idx})\")\n",
+    "print(f\"  Score: {candidates[best_idx][0]}\")\n",
+    "print(f\"  Note: This is the CURRENT behavior — treats multi-metric as mean scalar.\")\n",
+    "\n",
+    "# --- Weighted mode: accuracy-heavy ---\n",
+    "print(\"\\n--- Mode: WEIGHTED (accuracy-heavy) ---\")\n",
+    "config_weighted_acc = ObjectiveConfig(\n",
+    "    mode=\"weighted\",\n",
+    "    weights={\"accuracy\": 0.8, \"latency_s\": 0.2},\n",
+    "    minimize=frozenset({\"latency_s\"})\n",
+    ")\n",
+    "best_idx = select_best(candidates, config_weighted_acc)\n",
+    "print(f\"  Weights: accuracy=0.8, latency_s=0.2 (minimized)\")\n",
+    "print(f\"  Winner: {candidates[best_idx][1]} (index {best_idx})\")\n",
+    "score_dict = apply_minimize(candidates[best_idx][0], config_weighted_acc.minimize)\n",
+    "ws = weighted_scalarize(score_dict, config_weighted_acc.weights)\n",
+    "print(f\"  Weighted score: {ws:.4f}\")\n",
+    "\n",
+    "# --- Weighted mode: latency-heavy ---\n",
+    "print(\"\\n--- Mode: WEIGHTED (latency-heavy) ---\")\n",
+    "config_weighted_lat = ObjectiveConfig(\n",
+    "    mode=\"weighted\",\n",
+    "    weights={\"accuracy\": 0.2, \"latency_s\": 0.8},\n",
+    "    minimize=frozenset({\"latency_s\"})\n",
+    ")\n",
+    "best_idx_lat = select_best(candidates, config_weighted_lat)\n",
+    "print(f\"  Weights: accuracy=0.2, latency_s=0.8 (minimized)\")\n",
+    "print(f\"  Winner: {candidates[best_idx_lat][1]} (index {best_idx_lat})\")\n",
+    "score_dict_lat = apply_minimize(candidates[best_idx_lat][0], config_weighted_lat.minimize)\n",
+    "ws_lat = weighted_scalarize(score_dict_lat, config_weighted_lat.weights)\n",
+    "print(f\"  Weighted score: {ws_lat:.4f}\")\n",
+    "\n",
+    "print(f\"\\n  >>> Changing weights changes the winner!\")\n",
+    "print(f\"  >>> Accuracy-heavy → {candidates[best_idx][1]}, Latency-heavy → {candidates[best_idx_lat][1]}\")\n",
+    "\n",
+    "# --- Pareto mode ---\n",
+    "print(\"\\n--- Mode: PARETO ---\")\n",
+    "config_pareto = ObjectiveConfig(\n",
+    "    mode=\"pareto\",\n",
+    "    weights={\"accuracy\": 0.5, \"latency_s\": 0.5},  # used for tie-breaking\n",
+    "    minimize=frozenset({\"latency_s\"}),\n",
+    "    tie_break=\"weighted\",\n",
+    "    seed=42\n",
+    ")\n",
+    "\n",
+    "# Show full Pareto ranking\n",
+    "score_dicts_norm = [apply_minimize(normalize_score(s), config_pareto.minimize) for s, _ in candidates]\n",
+    "ranks = pareto_rank(score_dicts_norm)\n",
+    "\n",
+    "print(f\"\\n  Pareto Ranking (after minimize normalization):\")\n",
+    "print(f\"  {'Name':<15} {'Accuracy':>10} {'Neg Latency':>12} {'Pareto Rank':>12}\")\n",
+    "print(f\"  {'-'*15} {'-'*10} {'-'*12} {'-'*12}\")\n",
+    "for i, ((score, name), rank) in enumerate(zip(candidates, ranks)):\n",
+    "    nd = score_dicts_norm[i]\n",
+    "    print(f\"  {name:<15} {nd['accuracy']:>10.2f} {nd['latency_s']:>12.3f} {rank:>12}\")\n",
+    "\n",
+    "front_indices = [i for i, r in enumerate(ranks) if r == 0]\n",
+    "print(f\"\\n  Pareto Front (Rank 0): {[candidates[i][1] for i in front_indices]}\")\n",
+    "print(f\"  These candidates represent TRADEOFFS — none is dominated by another.\")\n",
+    "\n",
+    "best_idx_pareto = select_best(candidates, config_pareto)\n",
+    "print(f\"\\n  After tie-break (weighted, weights={{acc: 0.5, lat: 0.5}}):\")\n",
+    "print(f\"  Winner: {candidates[best_idx_pareto][1]} (index {best_idx_pareto})\")\n",
+    "\n",
+    "# --- Top-k selection (Beamsearch simulation) ---\n",
+    "print(\"\\n--- Mode: PARETO (top-k for Beamsearch, k=3) ---\")\n",
+    "top_k_indices = select_top_k(candidates, config_pareto, k=3)\n",
+    "print(f\"  Selected top-3:\")\n",
+    "for rank_pos, idx in enumerate(top_k_indices):\n",
+    "    r = ranks[idx]\n",
+    "    print(f\"    #{rank_pos+1}: {candidates[idx][1]} (Pareto rank {r}, scores: {candidates[idx][0]})\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c1bdf524",
+   "metadata": {},
+   "source": [
+    "### 2.6 Deterministic Tie-Break Validation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "dc6ea71d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "======================================================================\n",
+      "DETERMINISTIC TIE-BREAK VALIDATION\n",
+      "======================================================================\n",
+      "\n",
+      "--- Repeated runs with seed=42 ---\n",
+      "  10 runs with seed=42: indices = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]\n",
+      "  All identical: True ✓\n",
+      "\n",
+      "--- Different seeds with random_seeded tie-break ---\n",
+      "  seed= 0: winner = candidate_E (index 4)\n",
+      "  seed= 1: winner = candidate_D (index 3)\n",
+      "  seed= 2: winner = candidate_A (index 0)\n",
+      "  seed=42: winner = candidate_D (index 3)\n",
+      "  seed=99: winner = candidate_B (index 1)\n",
+      "\n",
+      "--- Determinism check for random_seeded (seed=42, 10 runs) ---\n",
+      "  10 runs: indices = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3]\n",
+      "  All identical: True ✓\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "print(\"DETERMINISTIC TIE-BREAK VALIDATION\")\n",
+    "print(\"=\" * 70)\n",
+    "\n",
+    "# Run selection 10 times with same seed — must produce identical results\n",
+    "print(\"\\n--- Repeated runs with seed=42 ---\")\n",
+    "results = []\n",
+    "for run in range(10):\n",
+    "    config = ObjectiveConfig(\n",
+    "        mode=\"pareto\",\n",
+    "        weights={\"accuracy\": 0.5, \"latency_s\": 0.5},\n",
+    "        minimize=frozenset({\"latency_s\"}),\n",
+    "        tie_break=\"weighted\",\n",
+    "        seed=42\n",
+    "    )\n",
+    "    idx = select_best(candidates, config)\n",
+    "    results.append(idx)\n",
+    "\n",
+    "all_same = len(set(results)) == 1\n",
+    "print(f\"  10 runs with seed=42: indices = {results}\")\n",
+    "print(f\"  All identical: {all_same} ✓\" if all_same else f\"  NOT identical: FAIL ✗\")\n",
+    "\n",
+    "# Different seed should potentially give different tie-break (if random_seeded)\n",
+    "print(\"\\n--- Different seeds with random_seeded tie-break ---\")\n",
+    "for seed in [0, 1, 2, 42, 99]:\n",
+    "    config = ObjectiveConfig(\n",
+    "        mode=\"pareto\",\n",
+    "        weights={\"accuracy\": 0.5, \"latency_s\": 0.5},\n",
+    "        minimize=frozenset({\"latency_s\"}),\n",
+    "        tie_break=\"random_seeded\",\n",
+    "        seed=seed\n",
+    "    )\n",
+    "    idx = select_best(candidates, config)\n",
+    "    print(f\"  seed={seed:>2}: winner = {candidates[idx][1]} (index {idx})\")\n",
+    "\n",
+    "# Verify same seed is deterministic for random_seeded too\n",
+    "print(\"\\n--- Determinism check for random_seeded (seed=42, 10 runs) ---\")\n",
+    "results_random = []\n",
+    "for _ in range(10):\n",
+    "    config = ObjectiveConfig(\n",
+    "        mode=\"pareto\",\n",
+    "        weights={\"accuracy\": 0.5, \"latency_s\": 0.5},\n",
+    "        minimize=frozenset({\"latency_s\"}),\n",
+    "        tie_break=\"random_seeded\",\n",
+    "        seed=42\n",
+    "    )\n",
+    "    idx = select_best(candidates, config)\n",
+    "    results_random.append(idx)\n",
+    "all_same_random = len(set(results_random)) == 1\n",
+    "print(f\"  10 runs: indices = {results_random}\")\n",
+    "print(f\"  All identical: {all_same_random} ✓\" if all_same_random else f\"  NOT identical: FAIL ✗\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3cc966d2",
+   "metadata": {},
+   "source": [
+    "### 2.7 Edge Cases"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d4545dc3",
+   "metadata": {},
+   "outputs": [],
+   "source": "print(\"\\n\" + \"=\" * 70)\nprint(\"EDGE CASES\")\nprint(\"=\" * 70)\n\n# Single-metric dict\nprint(\"\\n--- Single-metric dict with Pareto mode ---\")\nsingle_metric_candidates = [\n    ({\"accuracy\": 0.9}, \"A\"),\n    ({\"accuracy\": 0.8}, \"B\"),\n    ({\"accuracy\": 0.95}, \"C\"),\n]\nconfig_single = ObjectiveConfig(mode=\"pareto\", tie_break=\"weighted\")\nbest = select_best(single_metric_candidates, config_single)\nprint(f\"  Candidates: {[s for s, _ in single_metric_candidates]}\")\nprint(f\"  Winner: {single_metric_candidates[best][1]} (index {best})\")\nprint(f\"  Note: Pareto with 1 metric degenerates to scalar max — expected behavior.\")\n\n# Mixed float and dict\nprint(\"\\n--- Backward compat: float scores with ObjectiveConfig ---\")\nfloat_candidates = [\n    (0.85, \"A\"),\n    (0.92, \"B\"),\n    (0.78, \"C\"),\n]\nconfig_float = ObjectiveConfig(mode=\"weighted\", weights={\"score\": 1.0})\nbest_float = select_best(float_candidates, config_float)\nprint(f\"  Float candidates: {[s for s, _ in float_candidates]}\")\nprint(f\"  Winner: {float_candidates[best_float][1]} (score={float_candidates[best_float][0]})\")\nprint(f\"  Note: Floats normalized to {{'score': val}} — backward-compatible.\")\n\n# None config (pure backward compatibility)\nprint(\"\\n--- None config (current behavior) ---\")\nbest_none = select_best(float_candidates, None)\nprint(f\"  config=None → scalar max → {float_candidates[best_none][1]} (score={float_candidates[best_none][0]})\")\nprint(f\"  Identical to current max(candidates, key=lambda x: x[0])\")\n\n# Negative weight validation\nprint(\"\\n--- Negative weight validation ---\")\ntry:\n    ObjectiveConfig(weights={\"accuracy\": 0.8, \"latency_s\": -0.2})\nexcept ValueError as e:\n    print(f\"  ObjectiveConfig(weights={{..., 'latency_s': -0.2}}) → ValueError: {e}\")\n    print(f\"  Note: Use minimize={{'latency_s'}} instead of negative weights.\")\n\n# Empty pareto_metrics validation\nprint(\"\\n--- Empty pareto_metrics validation ---\")\ntry:\n    ObjectiveConfig(pareto_metrics=())\nexcept ValueError as e:\n    print(f\"  ObjectiveConfig(pareto_metrics=()) → ValueError: {e}\")\n    print(f\"  Note: Use None (auto-detect) or a non-empty tuple of metric names.\")"
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b510fdc9",
+   "metadata": {},
+   "source": [
+    "### 2.8 Visual Summary: Selection Behavior Comparison"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "c9abcad1",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "======================================================================\n",
+      "SELECTION BEHAVIOR COMPARISON\n",
+      "======================================================================\n",
+      "\n",
+      "  Mode                      Winner          Reasoning\n",
+      "  ------------------------- --------------- --------------------------------------------------\n",
+      "  scalar (baseline)         candidate_A     mean of dict values → max\n",
+      "  weighted (acc=0.8)        candidate_A     weighted sum with {'accuracy': 0.8, 'latency_s': 0.2}\n",
+      "  weighted (lat=0.8)        candidate_B     weighted sum with {'accuracy': 0.2, 'latency_s': 0.8}\n",
+      "  pareto (tie=weighted)     candidate_C     rank-0 front, tie-break=weighted\n",
+      "\n",
+      "  >>> Different modes select different candidates from the SAME pool.\n",
+      "  >>> This is exactly the behavior objectives.py will provide to trainers.\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "print(\"SELECTION BEHAVIOR COMPARISON\")\n",
+    "print(\"=\" * 70)\n",
+    "\n",
+    "print(f\"\\n  {'Mode':<25} {'Winner':<15} {'Reasoning'}\")\n",
+    "print(f\"  {'-'*25} {'-'*15} {'-'*50}\")\n",
+    "\n",
+    "modes = [\n",
+    "    (\"scalar (baseline)\", config_scalar),\n",
+    "    (\"weighted (acc=0.8)\", config_weighted_acc),\n",
+    "    (\"weighted (lat=0.8)\", config_weighted_lat),\n",
+    "    (\"pareto (tie=weighted)\", config_pareto),\n",
+    "]\n",
+    "\n",
+    "for mode_name, config in modes:\n",
+    "    idx = select_best(candidates, config)\n",
+    "    name = candidates[idx][1]\n",
+    "    if config.mode == \"scalar\":\n",
+    "        reason = \"mean of dict values → max\"\n",
+    "    elif config.mode == \"weighted\":\n",
+    "        reason = f\"weighted sum with {dict(config.weights)}\"\n",
+    "    elif config.mode == \"pareto\":\n",
+    "        reason = f\"rank-0 front, tie-break={config.tie_break}\"\n",
+    "    print(f\"  {mode_name:<25} {name:<15} {reason}\")\n",
+    "\n",
+    "print(f\"\\n  >>> Different modes select different candidates from the SAME pool.\")\n",
+    "print(f\"  >>> This is exactly the behavior objectives.py will provide to trainers.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3f1ed487",
+   "metadata": {},
+   "source": "---\n## Part 3: Architecture Summary\n\n### Two separate data paths (by design)\n\nThe training loop and selection path are **intentionally separate**. `guide.__call__()` / `get_feedback()` return type is NOT widened — the training loop always receives `float`.\n\n```\nTRAINING LOOP (unchanged):\n  guide(x, target.data, info) → (float, str)\n      │\n      └── score (float) → np.mean(scores)  → optimizer backward\n          Always float. Never dict. Training loop is completely safe.\n\nSELECTION PATH (new):\n  guide.get_score_dict(query, response, reference) → Dict[str, float]\n      │\n      ▼\n  evaluate_vector() → List[Dict[str, float]]   (one dict per example)\n      │\n      ▼\n  aggregate_vector_scores() → Dict[str, float]  (mean per metric)\n      │\n      ▼\n  objectives.py (select_best / select_top_k)\n      │\n      ├── mode=\"scalar\"   → max(mean_scores)           ← unchanged\n      ├── mode=\"weighted\"  → max(weighted_scalarize())  ← new\n      └── mode=\"pareto\"    → pareto_rank() + tie-break  ← new\n```\n\n**Key safety invariant:** `metric()` always returns `float`. If a guide's `get_feedback()` returns a dict as the score, `metric()` collapses it via `mean(values)`. Dict scores are only accessible through `get_score_dict()`.\n\n### Files to create/modify\n\n| Action | File | Milestone |\n|--------|------|-----------|\n| CREATE | `opto/trainer/objectives.py` | M1 |\n| MODIFY | `opto/trainer/guide.py` — add `get_score_dict()`, update `metric()` to collapse dicts to float | M1 |\n| MODIFY | `opto/trainer/evaluators.py` — add `evaluate_vector()`, `aggregate_vector_scores()` | M1 |\n| MODIFY | `basic_algorithms.py` | M1-M2 |\n| MODIFY | `beamsearch_algorithm.py` | M2 |\n| OPTIONAL | `priority_search.py` | M2 |"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "3e97bc57",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "======================================================================\n",
+      "M0 ANALYSIS COMPLETE\n",
+      "======================================================================\n",
+      "\n",
+      "Deliverables verified:\n",
+      "  ✓ Current Guide score contract documented (Tuple[float, str])\n",
+      "  ✓ Scalar selection points identified (BasicSearch max, Beamsearch sorted[:k])\n",
+      "  ✓ Weighted selection produces different results with different weights\n",
+      "  ✓ Pareto selection returns non-dominated tradeoff set\n",
+      "  ✓ Deterministic tie-break verified (same seed → same result, 10 runs)\n",
+      "  ✓ Edge cases validated (empty dict, single metric, float compat, None config)\n",
+      "  ✓ Architecture summary with file list and data flow\n",
+      "\n",
+      "See docs/T6_technical_plan.md for the complete refined technical plan.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "print(\"M0 ANALYSIS COMPLETE\")\n",
+    "print(\"=\" * 70)\n",
+    "print(\"\"\"\n",
+    "Deliverables verified:\n",
+    "  ✓ Current Guide score contract documented (Tuple[float, str])\n",
+    "  ✓ Scalar selection points identified (BasicSearch max, Beamsearch sorted[:k])\n",
+    "  ✓ Weighted selection produces different results with different weights\n",
+    "  ✓ Pareto selection returns non-dominated tradeoff set\n",
+    "  ✓ Deterministic tie-break verified (same seed → same result, 10 runs)\n",
+    "  ✓ Edge cases validated (empty dict, single metric, float compat, None config)\n",
+    "  ✓ Architecture summary with file list and data flow\n",
+    "\n",
+    "See docs/T6_technical_plan.md for the complete refined technical plan.\n",
+    "\"\"\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": [],
+   "toc_visible": true
+  },
+  "kernelspec": {
+   "display_name": "venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
\ No newline at end of file

From 3b2a0b29c0b3ba9cea64675340626f74b4f3c4c7 Mon Sep 17 00:00:00 2001
From: Jose Carlos Rodriguez <josecarlosrodriguez@Carlos-MacBook-Pro.local>
Date: Tue, 10 Feb 2026 17:20:55 -0400
Subject: [PATCH 2/5] T6 M0: Apply Xavier's review fixes (paths, dates,
 motivation, real LLM required)

---
 examples/notebooks/t6_m0_analysis.ipynb | 19 ++-----------------
 1 file changed, 2 insertions(+), 17 deletions(-)

diff --git a/examples/notebooks/t6_m0_analysis.ipynb b/examples/notebooks/t6_m0_analysis.ipynb
index 90eefcad..2549d76a 100644
--- a/examples/notebooks/t6_m0_analysis.ipynb
+++ b/examples/notebooks/t6_m0_analysis.ipynb
@@ -35,22 +35,7 @@
    "cell_type": "markdown",
    "id": "b1a58d26",
    "metadata": {},
-   "source": [
-    "# T6 Multi-Objective Vector Scores — M0 Analysis\n",
-    "\n",
-    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/)\n",
-    "\n",
-    "**Milestone 0 Deliverable** — Analysis + Technical Plan + Interface Spec\n",
-    "\n",
-    "This notebook demonstrates:\n",
-    "1. **Current baseline**: How Guide returns scalar scores, how evaluators aggregate, where selection happens\n",
-    "2. **Exact touchpoints**: The specific lines of code in BasicSearch and Beamsearch that perform scalar selection\n",
-    "3. **Planned behavior**: A deterministic prototype showing weighted vs Pareto selection on toy candidates\n",
-    "\n",
-    "**No API keys required.** All examples use deterministic dummy data.\n",
-    "\n",
-    "---"
-   ]
+   "source": "# T6 Multi-Objective Vector Scores — M0 Analysis\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/OpenTrace/blob/pull/61/head/examples/notebooks/t6_m0_analysis.ipynb)\n\n**Milestone 0 Deliverable** — Analysis + Technical Plan + Interface Spec\n\nThis notebook demonstrates:\n1. **Current baseline**: How Guide returns scalar scores, how evaluators aggregate, where selection happens\n2. **Exact touchpoints**: The specific lines of code in BasicSearch and Beamsearch that perform scalar selection\n3. **Planned behavior**: A deterministic prototype showing weighted vs Pareto selection on toy candidates\n\n**Motivation (why score-as-dict):** adding extra metrics into the *feedback dict/text* can help optimizers (OptoPrime/OPRO), but trainers typically only use the scalar score for ranking/UCB and ignore additional feedback structure. To enable Pareto/weighted multi-objective selection at the trainer level, we need vector score (score-as-dict) with backward-compatible scalar reduction.\n\n**No API keys required for M0.** All examples use deterministic dummy data. (From M1 onward, milestone notebooks must validate both StubLLM and real LLM modes.)\n\n---"
   },
   {
    "cell_type": "markdown",
@@ -405,7 +390,7 @@
    "id": "fbf9e98b",
    "metadata": {},
    "outputs": [],
-   "source": "print(\"\"\"\n=== Summary: Current Limitations ===\n\n1. Guide.metric() → float only (and stays float BY DESIGN)\n   metric() will NOT be widened to return dicts.\n   Dict scores flow through the NEW get_score_dict() path instead.\n\n2. evaluate() → np.array of floats\n   Cannot aggregate per-metric means across examples.\n   New evaluate_vector() will handle dict aggregation separately.\n\n3. BasicSearch: max(candidates, key=scalar)\n   Cannot do weighted multi-metric selection or Pareto ranking\n\n4. Beamsearch: sorted(candidates, key=scalar)[:k]\n   Cannot select top-k by Pareto dominance\n\n5. No ObjectiveConfig\n   No way to declare minimize metrics, weights, or selection mode\n\n>>> All of the above will be addressed in M1-M2 without breaking existing behavior. <<<\n>>> Training loop (guide.__call__ → float) is NEVER modified.                      <<<\n\"\"\")"
+   "source": "print(\"\"\"\n=== Summary: Current Limitations ===\n\n0. Extra metrics in feedback are not usable by trainers today.\n   Trainers typically rank/UCB using only the scalar score, and do not inspect feedback structure.\n\n1. Guide.metric() → float only (and stays float BY DESIGN)\n   metric() will NOT be widened to return dicts.\n   Dict scores flow through the NEW get_score_dict() path instead.\n\n2. evaluate() → np.array of floats\n   Cannot aggregate per-metric means across examples.\n   New evaluate_vector() will handle dict aggregation separately.\n\n3. BasicSearch: max(candidates, key=scalar)\n   Cannot do weighted multi-metric selection or Pareto ranking\n\n4. Beamsearch: sorted(candidates, key=scalar)[:k]\n   Cannot select top-k by Pareto dominance\n\n5. No ObjectiveConfig\n   No way to declare minimize metrics, weights, or selection mode\n\n>>> All of the above will be addressed in M1-M2 without breaking existing behavior. <<<\n>>> Training loop (guide.__call__ → float) is NEVER modified.                      <<<\n\"\"\")"
   },
   {
    "cell_type": "markdown",

From 249bde6e187b6fc211a344fd328f6bcd35f92ff7 Mon Sep 17 00:00:00 2001
From: Jose Carlos Rodriguez <josecarlosrodriguez@Carlos-MacBook-Pro.local>
Date: Tue, 10 Feb 2026 17:22:59 -0400
Subject: [PATCH 3/5] T6 M0: Apply Xavier's review fixes to technical plan

---
 docs/T6_technical_plan.md | 28 +++++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/docs/T6_technical_plan.md b/docs/T6_technical_plan.md
index e37c0c8c..87f3e764 100644
--- a/docs/T6_technical_plan.md
+++ b/docs/T6_technical_plan.md
@@ -2,7 +2,7 @@
 
 **Version:** 1.0 (Refined)
 **Author:** Carlos Rodriguez
-**Date:** February 9, 2025
+**Date:** February 9, 2026
 **Status:** M0 Deliverable — Analysis + Architecture + Interface Spec
 
 **Target repos / branches:**
@@ -32,6 +32,9 @@
 
 Today, trainer selection in Trace is driven by a **single scalar score**. Guides return `Tuple[float, str]` via `get_feedback()`, evaluators produce `np.array` of floats, and trainers (`BasicSearchAlgorithm`, `BeamsearchAlgorithm`) select candidates via scalar comparison (`max(candidates, key=lambda x: x[0])` and `sorted(..., key=lambda x: x[0])` respectively). This blocks trainer-side search from exploiting multiple metrics like `{accuracy, latency_ms, cost}`.
 
+**Motivation note (from team discussion):**
+Putting multiple metrics into the *feedback dict/text* is useful for optimizers (OptoPrime/OPRO), but trainers (BasicSearch/UCB/PrioritySearch/GEPA) typically only inspect the **scalar score** for ranking/UCB and ignore additional feedback structure. Therefore, enabling **vector score / score-as-dict** (with backward-compatible scalar reduction) is required for multi-objective trainer selection.
+
 ### What this plan adds
 
 | Component | Change |
@@ -516,7 +519,7 @@ def select_top_k(candidates: List[Tuple[ScoreLike, any]],
 
 | File | Change | Milestone |
 |------|--------|-----------|
-| `opto/trainer/guide.py` | Add `get_score_dict()` method to `Guide` base class. Update `metric()` to collapse dict scores to `float` via `mean(values)` (return type stays `float`). | M1 |
+| `opto/trainer/guide.py` | Add `get_score_dict()` method to `Guide` base class. Keep training loop scalar-safe (`metric()` returns `float`). Dict/vector scores are accessed via `get_score_dict()` for trainer-side selection. | M1 |
 | `opto/trainer/evaluators.py` | Add `evaluate_vector()` and `aggregate_vector_scores()`. Existing `evaluate()` unchanged. | M1 |
 | `opto/trainer/algorithms/basic_algorithms.py` | Add `objective_config` param to `BasicSearchAlgorithm.train()`. Replace `max(candidates, ...)` with `select_best()` in `optimizer_step()`. | M1 (minimal) / M2 (robust) |
 | `opto/trainer/algorithms/beamsearch_algorithm.py` | Add `objective_config` param to `BeamsearchAlgorithm.train()`. Replace scalar sort in `select()` with `select_top_k()`. | M2 |
@@ -592,7 +595,7 @@ This `score` flows into `MinibatchAlgorithm.update()` where `np.mean(scores)` is
 | Constraint | Enforcement |
 |-----------|-------------|
 | `guide.__call__()` / `get_feedback()` return type is **NOT widened** | No changes to `get_feedback()` signature; it still returns `Tuple[float, str]` |
-| Training loop always receives scalar `score` | `metric()` always returns `float` (collapses dict via `mean(values)` if needed) |
+| Training loop always receives scalar `score` | `metric()` always returns `float`. Vector/dict scores are not used by the training loop and are accessed via `get_score_dict()` for trainer-side selection. |
 | Dict scores flow through a separate path | `get_score_dict()` → `evaluate_vector()` → `select_best()` / `select_top_k()` |
 | A multi-objective guide must return `(float, str)` from `get_feedback()` for the training loop | The float is a collapsed scalar summary; the full dict is extracted via `get_score_dict()` during selection |
 
@@ -619,7 +622,7 @@ Selection path:   get_score_dict() → evaluate_vector() → objectives.py  ←
 
 **Deliverables:**
 - `docs/T6_technical_plan.md` — this document, finalized
-- `notebooks/t6_m0_analysis.ipynb` — Colab-ready notebook
+- `examples/notebooks/t6_m0_analysis.ipynb` — Colab-ready notebook
 
 **Notebook demonstrates:**
 - Current Guide score contract (`get_feedback` → `Tuple[float, str]`, `metric` → `float`)
@@ -641,12 +644,12 @@ Selection path:   get_score_dict() → evaluate_vector() → objectives.py  ←
 - `opto/trainer/evaluators.py` (add `evaluate_vector`, `aggregate_vector_scores`)
 - `opto/trainer/algorithms/basic_algorithms.py` (BasicSearch: accept/use ObjectiveConfig)
 - `tests/test_objectives.py`, `tests/test_evaluators_vector.py`
-- `notebooks/t6_m1_vector_scores.ipynb`
+- `examples/notebooks/t6_m1_vector_scores.ipynb`
 
 **Notebook demonstrates:**
 - StubLLM mode: BasicSearchAlgorithm on small candidate set (5-10) with deterministic dummy guide returning dict metrics
 - Shows: (a) scalar baseline, (b) weighted mode, (c) Pareto mode, (d) deterministic tie-break under fixed seed
-- Real LLM mode (optional): tiny dataset (≤5 items) producing ≥2 metrics
+- Real LLM mode (required): tiny dataset (≤5 items) producing ≥2 metrics
 
 **SMART validation:**
 - `pytest -q` passes (all new functions covered)
@@ -663,7 +666,7 @@ Selection path:   get_score_dict() → evaluate_vector() → objectives.py  ←
 - Expanded BasicSearch tests (edge cases, missing metrics, tie-break policies)
 - Optional: minimal PrioritySearch support (weighted scalarization for heap, dict stored for logging)
 - `tests/test_trainers_multiobjective.py`
-- `notebooks/t6_m2_trainers.ipynb`
+- `examples/notebooks/t6_m2_trainers.ipynb`
 
 **Notebook demonstrates:**
 - BasicSearch + Beamsearch in: scalar mode (baseline), weighted mode, Pareto mode
@@ -681,11 +684,18 @@ Selection path:   get_score_dict() → evaluate_vector() → objectives.py  ←
 
 **Deliverables:**
 - PR to Trace-Bench: benchmark configs/tasks + notebook
+  - **Trace-Bench touchpoints (update `main` if default branch differs):**
+    - https://github.com/AgentOpt/Trace-Bench/blob/main/LLM4AD/trainers_benchmark.py
+    - https://github.com/AgentOpt/Trace-Bench/blob/main/LLM4AD/trainers_benchmark_tasks_validation.py
+    - https://github.com/AgentOpt/Trace-Bench/blob/main/LLM4AD/benchmark_tasks/index.json
+    - https://github.com/AgentOpt/Trace-Bench/tree/main/LLM4AD/benchmark_tasks
+    - https://github.com/AgentOpt/Trace-Bench/blob/main/LLM4AD/llm4ad_loader.py
+    - https://github.com/AgentOpt/Trace-Bench/blob/main/tests/test_lite_optimize_llm4ad.py
 - 3 benchmarks:
   1. **Accuracy vs latency** (toy QA dataset)
   2. **Accuracy vs response length** (penalize verbosity)
   3. **Accuracy vs tool calls** (penalize excessive tool usage)
-- `notebooks/t6_m3_benchmarks.ipynb`
+- Trace-Bench notebook: `notebooks/t6_multiobjective_benchmarks.ipynb` (in Trace-Bench repo)
 
 **SMART validation:**
 - Notebook outputs per-benchmark table: weighted-mode best candidate metrics + Pareto-mode set of tradeoffs
@@ -763,7 +773,7 @@ Selection path:   get_score_dict() → evaluate_vector() → objectives.py  ←
 
 Each notebook contains:
 - **StubLLM (no keys) section:** deterministic dummy guide, runs quickly
-- **Real LLM section (optional):** small N (5-20 examples), prints cost/latency caveats, requires API key
+- **Real LLM section (required):** small N (5-20 examples), prints cost/latency caveats, requires API key
 
 ---
 

From 2213a191bc2ba06274df1e657ba9a2e434e308ce Mon Sep 17 00:00:00 2001
From: Jose Carlos Rodriguez <josecarlosrodriguez@Carlos-MacBook-Pro.local>
Date: Thu, 12 Feb 2026 11:50:07 -0400
Subject: [PATCH 4/5] =?UTF-8?q?T6=20M1:=20Multi-objective=20vector=20score?=
 =?UTF-8?q?s=20=E2=80=94=20ObjectiveConfig,=20objectives.py,=20evaluate=5F?=
 =?UTF-8?q?vector,=20BasicSearch=20integration,=2059=20tests?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 examples/notebooks/t6_m1_vector_scores.ipynb | 810 +++++++++++++++++++
 opto/trainer/algorithms/basic_algorithms.py  |  95 ++-
 opto/trainer/evaluators.py                   |  74 +-
 opto/trainer/guide.py                        |  16 +
 opto/trainer/objectives.py                   | 312 +++++++
 tests/unit_tests/test_evaluators_vector.py   | 154 ++++
 tests/unit_tests/test_objectives.py          | 383 +++++++++
 7 files changed, 1823 insertions(+), 21 deletions(-)
 create mode 100644 examples/notebooks/t6_m1_vector_scores.ipynb
 create mode 100644 opto/trainer/objectives.py
 create mode 100644 tests/unit_tests/test_evaluators_vector.py
 create mode 100644 tests/unit_tests/test_objectives.py

diff --git a/examples/notebooks/t6_m1_vector_scores.ipynb b/examples/notebooks/t6_m1_vector_scores.ipynb
new file mode 100644
index 00000000..637322d0
--- /dev/null
+++ b/examples/notebooks/t6_m1_vector_scores.ipynb
@@ -0,0 +1,810 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0000001",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\"\"\"\n",
+    "T6 Milestone 1 — Multi-Objective Vector Scores\n",
+    "\n",
+    "This notebook is the M1 deliverable for the T6 Multi-Objective Vector Scores project.\n",
+    "It demonstrates:\n",
+    "  1. ObjectiveConfig creation and validation\n",
+    "  2. MultiMetricGuide with get_score_dict()\n",
+    "  3. evaluate_vector() + aggregate_vector_scores()\n",
+    "  4. Full BasicSearchAlgorithm.train() with DummyLLM + objective_config\n",
+    "  5. Scalar baseline comparison (backward compat)\n",
+    "  6. Pareto mode demo + deterministic tiebreak\n",
+    "\n",
+    "Part A runs end-to-end WITHOUT API keys (StubLLM / DummyLLM).\n",
+    "Part B requires an OpenRouter API key (Colab secrets or environment variable).\n",
+    "\"\"\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a0000002",
+   "metadata": {},
+   "source": [
+    "# T6 Multi-Objective Vector Scores — M1 Implementation\n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/OpenTrace/blob/pull/61/head/examples/notebooks/t6_m1_vector_scores.ipynb)\n",
+    "\n",
+    "**Milestone 1 Deliverable** — Core multi-objective infrastructure\n",
+    "\n",
+    "This notebook demonstrates the M1 implementation:\n",
+    "1. **ObjectiveConfig**: Frozen dataclass for multi-objective selection configuration\n",
+    "2. **Vector score path**: `get_score_dict()` → `evaluate_vector()` → `aggregate_vector_scores()` → `select_best()`\n",
+    "3. **BasicSearch integration**: Training with `objective_config` parameter (weighted + Pareto modes)\n",
+    "4. **Backward compatibility**: `objective_config=None` produces identical behavior to baseline\n",
+    "\n",
+    "**Part A (StubLLM):** No API keys required. Uses `DummyLLM` for deterministic end-to-end training.\n",
+    "\n",
+    "**Part B (Real LLM):** Requires `OPENROUTER_API_KEY` via Colab secrets or env var. Uses `google/gemini-2.0-flash-001`.\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a0000003",
+   "metadata": {},
+   "source": [
+    "## How to Validate This Milestone\n",
+    "\n",
+    "After running all cells, confirm:\n",
+    "- [ ] ObjectiveConfig creation and validation work correctly\n",
+    "- [ ] MultiMetricGuide returns `Dict[str, float]` from `get_score_dict()`\n",
+    "- [ ] `evaluate_vector()` returns `List[Dict[str, float]]`\n",
+    "- [ ] `aggregate_vector_scores()` computes per-metric means\n",
+    "- [ ] BasicSearch with `objective_config=None` (scalar) trains successfully\n",
+    "- [ ] BasicSearch with weighted `objective_config` selects differently than scalar\n",
+    "- [ ] Pareto mode produces deterministic results with same seed\n",
+    "- [ ] Real LLM section (Part B) trains with actual model + multi-metric guide"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0000004",
+   "metadata": {},
+   "outputs": [],
+   "source": "import sys, os\n\n# Ensure OpenTrace root is on the path (needed when running from examples/notebooks/)\n_repo_root = os.path.abspath(os.path.join(os.getcwd(), '..', '..'))\nif os.path.isdir(os.path.join(_repo_root, 'opto')):\n    if _repo_root not in sys.path:\n        sys.path.insert(0, _repo_root)\n# Also handle running directly from the repo root\nif os.path.isdir(os.path.join(os.getcwd(), 'opto')):\n    if os.getcwd() not in sys.path:\n        sys.path.insert(0, os.getcwd())\n\nimport numpy as np\nfrom typing import Dict, Tuple, Optional\n\nprint(\"=\" * 70)\nprint(\"T6 M1 \\u2014 Multi-Objective Vector Scores\")\nprint(\"=\" * 70)"
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a0000005",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## Part A: StubLLM (No API Key Required)\n",
+    "\n",
+    "### A.1 ObjectiveConfig Creation & Validation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0000006",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from opto.trainer.objectives import (\n",
+    "    ObjectiveConfig, normalize_score, apply_minimize,\n",
+    "    weighted_scalarize, dominates, pareto_rank, select_best, select_top_k,\n",
+    ")\n",
+    "\n",
+    "print(\"--- ObjectiveConfig defaults ---\")\n",
+    "config_default = ObjectiveConfig()\n",
+    "print(f\"  mode={config_default.mode}, weights={config_default.weights}, \"\n",
+    "      f\"minimize={config_default.minimize}\")\n",
+    "\n",
+    "print(\"\\n--- ObjectiveConfig: weighted mode ---\")\n",
+    "config_weighted = ObjectiveConfig(\n",
+    "    mode=\"weighted\",\n",
+    "    weights={\"accuracy\": 0.8, \"latency_s\": 0.2},\n",
+    "    minimize=frozenset({\"latency_s\"}),\n",
+    ")\n",
+    "print(f\"  mode={config_weighted.mode}\")\n",
+    "print(f\"  weights={config_weighted.weights}\")\n",
+    "print(f\"  minimize={config_weighted.minimize}\")\n",
+    "\n",
+    "print(\"\\n--- ObjectiveConfig: Pareto mode ---\")\n",
+    "config_pareto = ObjectiveConfig(\n",
+    "    mode=\"pareto\",\n",
+    "    weights={\"accuracy\": 0.5, \"latency_s\": 0.5},\n",
+    "    minimize=frozenset({\"latency_s\"}),\n",
+    "    tie_break=\"weighted\",\n",
+    "    seed=42,\n",
+    ")\n",
+    "print(f\"  mode={config_pareto.mode}, tie_break={config_pareto.tie_break}, seed={config_pareto.seed}\")\n",
+    "\n",
+    "print(\"\\n--- ObjectiveConfig: set auto-converts to frozenset ---\")\n",
+    "config_set = ObjectiveConfig(minimize={\"lat\"})\n",
+    "print(f\"  type(minimize)={type(config_set.minimize).__name__} (auto-converted from set)\")\n",
+    "\n",
+    "print(\"\\n--- Validation: negative weight ---\")\n",
+    "try:\n",
+    "    ObjectiveConfig(weights={\"a\": -0.5})\n",
+    "except ValueError as e:\n",
+    "    print(f\"  Caught: {e}\")\n",
+    "\n",
+    "print(\"\\n--- Validation: bad mode ---\")\n",
+    "try:\n",
+    "    ObjectiveConfig(mode=\"unknown\")\n",
+    "except ValueError as e:\n",
+    "    print(f\"  Caught: {e}\")\n",
+    "\n",
+    "print(\"\\n--- Frozen (immutable) ---\")\n",
+    "try:\n",
+    "    config_default.mode = \"weighted\"\n",
+    "except AttributeError as e:\n",
+    "    print(f\"  Caught: {e}\")\n",
+    "\n",
+    "print(\"\\nObjectiveConfig validation: all checks passed.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a0000007",
+   "metadata": {},
+   "source": [
+    "### A.2 MultiMetricGuide with `get_score_dict()`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0000008",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from opto.trainer.guide import Guide\n",
+    "\n",
+    "\n",
+    "class MultiMetricGuide(Guide):\n",
+    "    \"\"\"Guide that returns multi-metric score dicts.\n",
+    "\n",
+    "    Evaluates accuracy (exact match) and brevity (inverse length difference).\n",
+    "    The training loop still calls get_feedback() -> (float, str).\n",
+    "    The selection path calls get_score_dict() -> Dict[str, float].\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def get_feedback(self, query, response, reference=None, **kwargs):\n",
+    "        accuracy = 1.0 if str(response).strip().lower() == str(reference).strip().lower() else 0.0\n",
+    "        len_diff = abs(len(str(response)) - len(str(reference)))\n",
+    "        brevity = 1.0 / (1.0 + len_diff)\n",
+    "        feedback = f\"Expected '{reference}', got '{response}'. \"\n",
+    "        if accuracy < 1.0:\n",
+    "            feedback += \"Incorrect. Please provide the exact expected answer.\"\n",
+    "        else:\n",
+    "            feedback += \"Correct!\"\n",
+    "        # Training loop gets scalar (accuracy) + feedback string\n",
+    "        return accuracy, feedback\n",
+    "\n",
+    "    def get_score_dict(self, query, response, reference=None, **kwargs):\n",
+    "        accuracy = 1.0 if str(response).strip().lower() == str(reference).strip().lower() else 0.0\n",
+    "        len_diff = abs(len(str(response)) - len(str(reference)))\n",
+    "        brevity = 1.0 / (1.0 + len_diff)\n",
+    "        return {\"accuracy\": accuracy, \"brevity\": brevity}\n",
+    "\n",
+    "\n",
+    "# Demonstrate both paths\n",
+    "guide = MultiMetricGuide()\n",
+    "\n",
+    "print(\"--- Training path: get_feedback() -> (float, str) ---\")\n",
+    "score, feedback = guide.get_feedback(\"Q: 2+2\", \"4\", \"4\")\n",
+    "print(f\"  score={score} (type={type(score).__name__})\")\n",
+    "print(f\"  feedback='{feedback}'\")\n",
+    "\n",
+    "print(\"\\n--- Selection path: get_score_dict() -> Dict[str, float] ---\")\n",
+    "sd = guide.get_score_dict(\"Q: 2+2\", \"4\", \"4\")\n",
+    "print(f\"  score_dict={sd}\")\n",
+    "\n",
+    "print(\"\\n--- metric() still returns float (backward compat) ---\")\n",
+    "m = guide.metric(\"Q: 2+2\", \"4\", \"4\")\n",
+    "print(f\"  metric()={m} (type={type(m).__name__})\")\n",
+    "\n",
+    "print(\"\\n--- Base Guide without get_score_dict override wraps scalar ---\")\n",
+    "class ScalarOnlyGuide(Guide):\n",
+    "    def get_feedback(self, query, response, reference=None, **kwargs):\n",
+    "        return 0.75, \"some feedback\"\n",
+    "\n",
+    "fallback = ScalarOnlyGuide()\n",
+    "print(f\"  get_score_dict()={fallback.get_score_dict('q', 'r', 'ref')}\")\n",
+    "print(\"  (wrapped as {{'score': 0.75}} automatically)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a0000009",
+   "metadata": {},
+   "source": [
+    "### A.3 `evaluate_vector()` + `aggregate_vector_scores()`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0000010",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from opto import trace\n",
+    "from opto.trainer.evaluators import evaluate_vector, aggregate_vector_scores\n",
+    "\n",
+    "\n",
+    "@trace.model\n",
+    "class StubAgent:\n",
+    "    \"\"\"Agent with a trainable string parameter. Returns it directly.\"\"\"\n",
+    "    def __init__(self, answer):\n",
+    "        self.answer = trace.node(answer, trainable=True)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        return self.answer\n",
+    "\n",
+    "\n",
+    "agent = StubAgent(\"4\")\n",
+    "guide = MultiMetricGuide()\n",
+    "\n",
+    "inputs = [\"What is 2+2?\", \"What is 3+1?\", \"What is 5-1?\"]\n",
+    "infos  = [\"4\",            \"4\",            \"4\"           ]  # all expect \"4\"\n",
+    "\n",
+    "print(\"--- evaluate_vector() ---\")\n",
+    "score_dicts = evaluate_vector(agent, guide, inputs, infos, num_threads=1)\n",
+    "for i, sd in enumerate(score_dicts):\n",
+    "    print(f\"  Example {i}: {sd}\")\n",
+    "\n",
+    "print(\"\\n--- aggregate_vector_scores() ---\")\n",
+    "agg = aggregate_vector_scores(score_dicts)\n",
+    "print(f\"  Aggregated (per-metric mean): {agg}\")\n",
+    "\n",
+    "# Now test with wrong answer\n",
+    "agent_wrong = StubAgent(\"five\")\n",
+    "print(\"\\n--- Wrong answer agent ---\")\n",
+    "score_dicts_wrong = evaluate_vector(agent_wrong, guide, inputs, infos, num_threads=1)\n",
+    "for i, sd in enumerate(score_dicts_wrong):\n",
+    "    print(f\"  Example {i}: {sd}\")\n",
+    "agg_wrong = aggregate_vector_scores(score_dicts_wrong)\n",
+    "print(f\"  Aggregated: {agg_wrong}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a0000011",
+   "metadata": {},
+   "source": [
+    "### A.4 Selection with `select_best()` and `select_top_k()`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0000012",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Candidates: (score_dict, payload) tuples\n",
+    "candidates = [\n",
+    "    ({\"accuracy\": 0.95, \"latency_s\": 0.200}, \"prompt_A\"),\n",
+    "    ({\"accuracy\": 0.70, \"latency_s\": 0.030}, \"prompt_B\"),\n",
+    "    ({\"accuracy\": 0.88, \"latency_s\": 0.080}, \"prompt_C\"),\n",
+    "    ({\"accuracy\": 0.60, \"latency_s\": 0.020}, \"prompt_D\"),\n",
+    "]\n",
+    "\n",
+    "print(\"Candidates:\")\n",
+    "for s, name in candidates:\n",
+    "    print(f\"  {name}: {s}\")\n",
+    "\n",
+    "# Scalar mode (backward-compat)\n",
+    "print(\"\\n--- select_best(config=None) [scalar, backward-compat] ---\")\n",
+    "idx = select_best(candidates, None)\n",
+    "print(f\"  Winner: {candidates[idx][1]} (index {idx})\")\n",
+    "\n",
+    "# Weighted: accuracy-heavy\n",
+    "print(\"\\n--- select_best(weighted, accuracy=0.8) ---\")\n",
+    "config_acc = ObjectiveConfig(\n",
+    "    mode=\"weighted\",\n",
+    "    weights={\"accuracy\": 0.8, \"latency_s\": 0.2},\n",
+    "    minimize=frozenset({\"latency_s\"}),\n",
+    ")\n",
+    "idx = select_best(candidates, config_acc)\n",
+    "print(f\"  Winner: {candidates[idx][1]} (index {idx})\")\n",
+    "\n",
+    "# Weighted: latency-heavy\n",
+    "print(\"\\n--- select_best(weighted, latency_s=0.8) ---\")\n",
+    "config_lat = ObjectiveConfig(\n",
+    "    mode=\"weighted\",\n",
+    "    weights={\"accuracy\": 0.2, \"latency_s\": 0.8},\n",
+    "    minimize=frozenset({\"latency_s\"}),\n",
+    ")\n",
+    "idx = select_best(candidates, config_lat)\n",
+    "print(f\"  Winner: {candidates[idx][1]} (index {idx})\")\n",
+    "\n",
+    "# Pareto mode\n",
+    "print(\"\\n--- select_best(pareto, tie_break=weighted) ---\")\n",
+    "config_par = ObjectiveConfig(\n",
+    "    mode=\"pareto\",\n",
+    "    weights={\"accuracy\": 0.5, \"latency_s\": 0.5},\n",
+    "    minimize=frozenset({\"latency_s\"}),\n",
+    "    tie_break=\"weighted\",\n",
+    ")\n",
+    "score_dicts_norm = [apply_minimize(normalize_score(s), config_par.minimize) for s, _ in candidates]\n",
+    "ranks = pareto_rank(score_dicts_norm)\n",
+    "print(f\"  Pareto ranks: {ranks}\")\n",
+    "print(f\"  Front (rank 0): {[candidates[i][1] for i, r in enumerate(ranks) if r == 0]}\")\n",
+    "idx = select_best(candidates, config_par)\n",
+    "print(f\"  Winner (after tie-break): {candidates[idx][1]} (index {idx})\")\n",
+    "\n",
+    "# Deterministic check\n",
+    "print(\"\\n--- Determinism: 10 runs with same config ---\")\n",
+    "results = [select_best(candidates, config_par) for _ in range(10)]\n",
+    "print(f\"  Results: {results}\")\n",
+    "print(f\"  All identical: {len(set(results)) == 1}\")\n",
+    "\n",
+    "# Top-k\n",
+    "print(\"\\n--- select_top_k(pareto, k=2) ---\")\n",
+    "top2 = select_top_k(candidates, config_par, k=2)\n",
+    "print(f\"  Top 2: {[candidates[i][1] for i in top2]}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a0000013",
+   "metadata": {},
+   "source": [
+    "### A.5 Full Training: BasicSearch with DummyLLM (scalar baseline)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0000014",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from opto.utils.llm import DummyLLM\n",
+    "from opto.optimizers import OptoPrimeV2\n",
+    "from opto.trainer.algorithms.basic_algorithms import BasicSearchAlgorithm\n",
+    "\n",
+    "# --- Dataset: simple Q&A ---\n",
+    "dataset = dict(\n",
+    "    inputs=[\"What is 2+2?\", \"What is 3+1?\", \"What is 10-6?\"],\n",
+    "    infos= [\"4\",            \"4\",            \"4\"            ],\n",
+    ")\n",
+    "\n",
+    "# --- DummyLLM: always proposes the same system prompt ---\n",
+    "def stub_llm_fn(*args, **kwargs):\n",
+    "    \"\"\"Deterministic LLM stub: always returns a fixed response.\"\"\"\n",
+    "    return \"You are a math assistant. Always answer with just the number.\"\n",
+    "\n",
+    "dummy_llm = DummyLLM(stub_llm_fn)\n",
+    "\n",
+    "# --- Agent ---\n",
+    "@trace.model\n",
+    "class MathAgent:\n",
+    "    def __init__(self, llm):\n",
+    "        self.system_prompt = trace.node(\n",
+    "            \"You are a helpful assistant.\", trainable=True\n",
+    "        )\n",
+    "        self.llm = llm\n",
+    "\n",
+    "    @trace.bundle()\n",
+    "    def call_llm(self, system_prompt, question):\n",
+    "        resp = self.llm(\n",
+    "            messages=[\n",
+    "                {\"role\": \"system\", \"content\": system_prompt},\n",
+    "                {\"role\": \"user\", \"content\": question},\n",
+    "            ]\n",
+    "        )\n",
+    "        return resp.choices[0].message.content\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        return self.call_llm(self.system_prompt, x)\n",
+    "\n",
+    "# --- Scalar baseline (objective_config=None) ---\n",
+    "print(\"=\" * 70)\n",
+    "print(\"TRAINING: Scalar baseline (objective_config=None)\")\n",
+    "print(\"=\" * 70)\n",
+    "\n",
+    "agent_scalar = MathAgent(dummy_llm)\n",
+    "optimizer_scalar = OptoPrimeV2(agent_scalar.parameters(), llm=dummy_llm)\n",
+    "trainer_scalar = BasicSearchAlgorithm(\n",
+    "    agent=agent_scalar, optimizer=optimizer_scalar\n",
+    ")\n",
+    "\n",
+    "guide_scalar = MultiMetricGuide()\n",
+    "scores_scalar, test_score_scalar = trainer_scalar.train(\n",
+    "    guide=guide_scalar,\n",
+    "    train_dataset=dataset,\n",
+    "    num_proposals=2,\n",
+    "    num_epochs=1,\n",
+    "    batch_size=1,\n",
+    "    num_threads=1,\n",
+    "    objective_config=None,  # scalar baseline\n",
+    ")\n",
+    "\n",
+    "print(f\"\\nScalar training scores: {scores_scalar}\")\n",
+    "print(f\"current_score: {trainer_scalar.current_score}\")\n",
+    "print(f\"current_score_dict: {trainer_scalar.current_score_dict}\")\n",
+    "print(\"(current_score_dict is None because scalar mode does not use vector path)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a0000015",
+   "metadata": {},
+   "source": [
+    "### A.6 Full Training: BasicSearch with DummyLLM (weighted mode)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0000016",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"=\" * 70)\n",
+    "print(\"TRAINING: Weighted mode (objective_config.mode='weighted')\")\n",
+    "print(\"=\" * 70)\n",
+    "\n",
+    "config_weighted_train = ObjectiveConfig(\n",
+    "    mode=\"weighted\",\n",
+    "    weights={\"accuracy\": 0.7, \"brevity\": 0.3},\n",
+    ")\n",
+    "\n",
+    "agent_weighted = MathAgent(dummy_llm)\n",
+    "optimizer_weighted = OptoPrimeV2(agent_weighted.parameters(), llm=dummy_llm)\n",
+    "trainer_weighted = BasicSearchAlgorithm(\n",
+    "    agent=agent_weighted, optimizer=optimizer_weighted\n",
+    ")\n",
+    "\n",
+    "guide_weighted = MultiMetricGuide()\n",
+    "scores_weighted, test_score_weighted = trainer_weighted.train(\n",
+    "    guide=guide_weighted,\n",
+    "    train_dataset=dataset,\n",
+    "    num_proposals=2,\n",
+    "    num_epochs=1,\n",
+    "    batch_size=1,\n",
+    "    num_threads=1,\n",
+    "    objective_config=config_weighted_train,\n",
+    ")\n",
+    "\n",
+    "print(f\"\\nWeighted training scores: {scores_weighted}\")\n",
+    "print(f\"current_score (float): {trainer_weighted.current_score}\")\n",
+    "print(f\"current_score_dict: {trainer_weighted.current_score_dict}\")\n",
+    "print(\"(current_score_dict stores the vector score selected by weighted mode)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a0000017",
+   "metadata": {},
+   "source": [
+    "### A.7 Full Training: BasicSearch with DummyLLM (Pareto mode)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0000018",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"=\" * 70)\n",
+    "print(\"TRAINING: Pareto mode (objective_config.mode='pareto')\")\n",
+    "print(\"=\" * 70)\n",
+    "\n",
+    "config_pareto_train = ObjectiveConfig(\n",
+    "    mode=\"pareto\",\n",
+    "    weights={\"accuracy\": 0.5, \"brevity\": 0.5},\n",
+    "    tie_break=\"weighted\",\n",
+    "    seed=42,\n",
+    ")\n",
+    "\n",
+    "agent_pareto = MathAgent(dummy_llm)\n",
+    "optimizer_pareto = OptoPrimeV2(agent_pareto.parameters(), llm=dummy_llm)\n",
+    "trainer_pareto = BasicSearchAlgorithm(\n",
+    "    agent=agent_pareto, optimizer=optimizer_pareto\n",
+    ")\n",
+    "\n",
+    "guide_pareto = MultiMetricGuide()\n",
+    "scores_pareto, test_score_pareto = trainer_pareto.train(\n",
+    "    guide=guide_pareto,\n",
+    "    train_dataset=dataset,\n",
+    "    num_proposals=2,\n",
+    "    num_epochs=1,\n",
+    "    batch_size=1,\n",
+    "    num_threads=1,\n",
+    "    objective_config=config_pareto_train,\n",
+    ")\n",
+    "\n",
+    "print(f\"\\nPareto training scores: {scores_pareto}\")\n",
+    "print(f\"current_score (float): {trainer_pareto.current_score}\")\n",
+    "print(f\"current_score_dict: {trainer_pareto.current_score_dict}\")\n",
+    "\n",
+    "# Verify determinism: run again with same seed\n",
+    "print(\"\\n--- Determinism: re-run with same seed ---\")\n",
+    "agent_pareto2 = MathAgent(dummy_llm)\n",
+    "optimizer_pareto2 = OptoPrimeV2(agent_pareto2.parameters(), llm=dummy_llm)\n",
+    "trainer_pareto2 = BasicSearchAlgorithm(\n",
+    "    agent=agent_pareto2, optimizer=optimizer_pareto2\n",
+    ")\n",
+    "scores_pareto2, _ = trainer_pareto2.train(\n",
+    "    guide=MultiMetricGuide(),\n",
+    "    train_dataset=dataset,\n",
+    "    num_proposals=2,\n",
+    "    num_epochs=1,\n",
+    "    batch_size=1,\n",
+    "    num_threads=1,\n",
+    "    objective_config=config_pareto_train,\n",
+    ")\n",
+    "print(f\"Run 1 current_score_dict: {trainer_pareto.current_score_dict}\")\n",
+    "print(f\"Run 2 current_score_dict: {trainer_pareto2.current_score_dict}\")\n",
+    "match = trainer_pareto.current_score_dict == trainer_pareto2.current_score_dict\n",
+    "print(f\"Deterministic: {match}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a0000019",
+   "metadata": {},
+   "source": [
+    "### A.8 Summary: StubLLM Section"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0000020",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "print(\"PART A COMPLETE — StubLLM Section\")\n",
+    "print(\"=\" * 70)\n",
+    "print(\"\"\"\n",
+    "Verified:\n",
+    "  ✓ ObjectiveConfig creation, validation, and immutability\n",
+    "  ✓ MultiMetricGuide: get_feedback() -> (float, str) for training loop\n",
+    "  ✓ MultiMetricGuide: get_score_dict() -> Dict[str, float] for selection path\n",
+    "  ✓ evaluate_vector() returns List[Dict[str, float]]\n",
+    "  ✓ aggregate_vector_scores() computes per-metric means\n",
+    "  ✓ select_best(): scalar, weighted, Pareto modes all work\n",
+    "  ✓ BasicSearch training: scalar baseline (objective_config=None)\n",
+    "  ✓ BasicSearch training: weighted mode with vector score selection\n",
+    "  ✓ BasicSearch training: Pareto mode with deterministic tie-break\n",
+    "  ✓ current_score stays float, current_score_dict stores vector\n",
+    "\"\"\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a0000021",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## Part B: Real LLM (API Key Required)\n",
+    "\n",
+    "This section trains a real LLM agent using `CustomLLM` with OpenRouter.\n",
+    "\n",
+    "**Requirements:**\n",
+    "- **Colab:** Set `OPENROUTER_API_KEY` in Colab Secrets (key icon in sidebar)\n",
+    "- **Local:** `export OPENROUTER_API_KEY=sk-or-v1-...` in your shell, or set in `.env`\n",
+    "\n",
+    "Uses model `google/gemini-2.0-flash-001` via OpenRouter (very cheap, fast)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0000022",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "# Try Colab secrets first, then environment variable\n",
+    "api_key = None\n",
+    "try:\n",
+    "    from google.colab import userdata\n",
+    "    api_key = userdata.get('OPENROUTER_API_KEY')\n",
+    "    print(\"API key loaded from Colab secrets.\")\n",
+    "except (ImportError, Exception):\n",
+    "    pass\n",
+    "\n",
+    "if not api_key:\n",
+    "    api_key = os.environ.get('OPENROUTER_API_KEY')\n",
+    "    if api_key:\n",
+    "        print(\"API key loaded from environment variable.\")\n",
+    "\n",
+    "if not api_key:\n",
+    "    # Try loading from .env file in project root\n",
+    "    env_path = os.path.join(os.getcwd(), '.env')\n",
+    "    if not os.path.exists(env_path):\n",
+    "        env_path = os.path.join(os.path.dirname(os.getcwd()), '.env')\n",
+    "    if os.path.exists(env_path):\n",
+    "        with open(env_path) as f:\n",
+    "            for line in f:\n",
+    "                line = line.strip()\n",
+    "                if line.startswith('OPENROUTER_API_KEY='):\n",
+    "                    api_key = line.split('=', 1)[1].strip()\n",
+    "                    print(f\"API key loaded from {env_path}\")\n",
+    "                    break\n",
+    "\n",
+    "if not api_key:\n",
+    "    print(\"WARNING: No OPENROUTER_API_KEY found. Part B cells will be skipped.\")\n",
+    "    print(\"Set it via: Colab Secrets, env var, or .env file.\")\n",
+    "else:\n",
+    "    # Configure CustomLLM environment\n",
+    "    os.environ['TRACE_CUSTOMLLM_URL'] = 'https://openrouter.ai/api/v1'\n",
+    "    os.environ['TRACE_CUSTOMLLM_API_KEY'] = api_key\n",
+    "    os.environ['TRACE_CUSTOMLLM_MODEL'] = 'google/gemini-2.0-flash-001'\n",
+    "    print(\"CustomLLM configured for OpenRouter (google/gemini-2.0-flash-001).\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0000023",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Skip this cell if no API key\n",
+    "if not api_key:\n",
+    "    print(\"Skipping: no API key. Set OPENROUTER_API_KEY to run Part B.\")\n",
+    "else:\n",
+    "    from opto.utils.llm import CustomLLM\n",
+    "\n",
+    "    real_llm = CustomLLM(model='google/gemini-2.0-flash-001')\n",
+    "\n",
+    "    # Quick smoke test\n",
+    "    print(\"--- Smoke test: real LLM call ---\")\n",
+    "    resp = real_llm(messages=[\n",
+    "        {\"role\": \"user\", \"content\": \"What is 2+2? Answer with just the number.\"}\n",
+    "    ])\n",
+    "    print(f\"  Response: {resp.choices[0].message.content}\")\n",
+    "    print(\"  LLM connection verified.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0000024",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Real LLM training with weighted multi-objective selection\n",
+    "if not api_key:\n",
+    "    print(\"Skipping: no API key.\")\n",
+    "else:\n",
+    "    print(\"=\" * 70)\n",
+    "    print(\"REAL LLM TRAINING: Weighted mode with multi-metric guide\")\n",
+    "    print(\"=\" * 70)\n",
+    "\n",
+    "    # Small dataset to keep costs low\n",
+    "    real_dataset = dict(\n",
+    "        inputs=[\"What is 7+3?\", \"What is 15-9?\", \"What is 4*3?\"],\n",
+    "        infos= [\"10\",           \"6\",            \"12\"          ],\n",
+    "    )\n",
+    "\n",
+    "    real_config = ObjectiveConfig(\n",
+    "        mode=\"weighted\",\n",
+    "        weights={\"accuracy\": 0.7, \"brevity\": 0.3},\n",
+    "    )\n",
+    "\n",
+    "    real_agent = MathAgent(real_llm)\n",
+    "    real_optimizer = OptoPrimeV2(real_agent.parameters(), llm=real_llm)\n",
+    "    real_trainer = BasicSearchAlgorithm(\n",
+    "        agent=real_agent, optimizer=real_optimizer\n",
+    "    )\n",
+    "\n",
+    "    real_guide = MultiMetricGuide()\n",
+    "    real_scores, real_test = real_trainer.train(\n",
+    "        guide=real_guide,\n",
+    "        train_dataset=real_dataset,\n",
+    "        num_proposals=2,\n",
+    "        num_epochs=1,\n",
+    "        batch_size=1,\n",
+    "        num_threads=1,\n",
+    "        objective_config=real_config,\n",
+    "    )\n",
+    "\n",
+    "    print(f\"\\nReal LLM training scores: {real_scores}\")\n",
+    "    print(f\"current_score (float): {real_trainer.current_score}\")\n",
+    "    print(f\"current_score_dict: {real_trainer.current_score_dict}\")\n",
+    "    print(f\"\\nFinal system prompt: {real_agent.system_prompt.data}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0000025",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Real LLM: Pareto mode comparison\n",
+    "if not api_key:\n",
+    "    print(\"Skipping: no API key.\")\n",
+    "else:\n",
+    "    print(\"=\" * 70)\n",
+    "    print(\"REAL LLM TRAINING: Pareto mode for comparison\")\n",
+    "    print(\"=\" * 70)\n",
+    "\n",
+    "    pareto_config = ObjectiveConfig(\n",
+    "        mode=\"pareto\",\n",
+    "        weights={\"accuracy\": 0.5, \"brevity\": 0.5},\n",
+    "        tie_break=\"weighted\",\n",
+    "        seed=42,\n",
+    "    )\n",
+    "\n",
+    "    pareto_agent = MathAgent(real_llm)\n",
+    "    pareto_optimizer = OptoPrimeV2(pareto_agent.parameters(), llm=real_llm)\n",
+    "    pareto_trainer = BasicSearchAlgorithm(\n",
+    "        agent=pareto_agent, optimizer=pareto_optimizer\n",
+    "    )\n",
+    "\n",
+    "    pareto_scores, _ = pareto_trainer.train(\n",
+    "        guide=MultiMetricGuide(),\n",
+    "        train_dataset=real_dataset,\n",
+    "        num_proposals=2,\n",
+    "        num_epochs=1,\n",
+    "        batch_size=1,\n",
+    "        num_threads=1,\n",
+    "        objective_config=pareto_config,\n",
+    "    )\n",
+    "\n",
+    "    print(f\"\\nPareto training scores: {pareto_scores}\")\n",
+    "    print(f\"current_score_dict: {pareto_trainer.current_score_dict}\")\n",
+    "\n",
+    "    print(\"\\n--- Comparison ---\")\n",
+    "    print(f\"Weighted mode final: {real_trainer.current_score_dict}\")\n",
+    "    print(f\"Pareto mode final:   {pareto_trainer.current_score_dict}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0000026",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "print(\"M1 NOTEBOOK COMPLETE\")\n",
+    "print(\"=\" * 70)\n",
+    "print(\"\"\"\n",
+    "Deliverables verified:\n",
+    "  ✓ Part A (StubLLM): All cells run without API keys\n",
+    "    - ObjectiveConfig creation + validation\n",
+    "    - MultiMetricGuide with get_score_dict()\n",
+    "    - evaluate_vector() + aggregate_vector_scores()\n",
+    "    - BasicSearch: scalar, weighted, and Pareto modes\n",
+    "    - Backward compatibility (objective_config=None)\n",
+    "    - Deterministic tie-break verification\n",
+    "\n",
+    "  ✓ Part B (Real LLM): Trained with actual model via OpenRouter\n",
+    "    - Weighted and Pareto mode with real LLM proposals\n",
+    "    - Multi-metric selection (accuracy + brevity)\n",
+    "    - current_score_dict populated with real scores\n",
+    "\"\"\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.11.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
\ No newline at end of file
diff --git a/opto/trainer/algorithms/basic_algorithms.py b/opto/trainer/algorithms/basic_algorithms.py
index 691b14a8..50ea7842 100644
--- a/opto/trainer/algorithms/basic_algorithms.py
+++ b/opto/trainer/algorithms/basic_algorithms.py
@@ -6,7 +6,8 @@
 from opto.trainer.loader import DataLoader
 from opto.trainer.utils import batch_run, async_run
 from opto.optimizers.utils import print_color
-from opto.trainer.evaluators import evaluate
+from opto.trainer.evaluators import evaluate, evaluate_vector, aggregate_vector_scores
+from opto.trainer.objectives import ObjectiveConfig, select_best
 
 
 def standard_optimization_step(agent, x, guide, info, min_score=0):
@@ -533,6 +534,7 @@ def train(self,
               validate_dataset = None, # dataset of (x, info) pairs to evaluate the agent for candidate selection
               validate_guide = None,  #  to provide scores for the validation set
               num_proposals = 4,  # number of proposals to get from the optimizer
+              objective_config = None,  # optional ObjectiveConfig for multi-objective selection
               num_epochs = 1,  # number of training epochs
               batch_size = 1,  # batch size for updating the agent
               test_dataset = None, # dataset of (x, info) pairs to evaluate the agent
@@ -549,6 +551,8 @@ def train(self,
         self.validate_guide = validate_guide or guide
         self.min_score = min_score
         self.current_score = None
+        self.objective_config = objective_config
+        self.current_score_dict = None  # stores vector score when using multi-objective
 
         return super().train(guide, train_dataset, num_epochs=num_epochs, batch_size=batch_size,
                       test_dataset=test_dataset, test_frequency=test_frequency, log_frequency=log_frequency,
@@ -571,6 +575,21 @@ def validate():
                               description="Validating proposals")
             return np.mean(scores) if all([s is not None for s in scores]) else -np.inf
 
+        def validate_vector():
+            """ Validate and return aggregated vector score dict. """
+            score_dicts = evaluate_vector(self.agent,
+                                          self.validate_guide,
+                                          self.validate_dataset['inputs'],
+                                          self.validate_dataset['infos'],
+                                          min_score=self.min_score,
+                                          num_threads=num_threads,
+                                          description="Validating proposals (vector)")
+            return aggregate_vector_scores(score_dicts)
+
+        # Determine whether to use vector scoring for selection
+        use_vector = (self.objective_config is not None
+                      and self.objective_config.mode != "scalar")
+
         # TODO perhaps we can ask for multiple updates in one query or use different temperatures in different queries
         # Generate different proposals
         step_kwargs = dict(bypassing=True, verbose='output' if verbose else False)  # we don't print the inner full message
@@ -582,25 +601,57 @@ def validate():
                                 kwargs_list=[step_kwargs] * self.num_proposals,
                                 max_workers=num_threads,
                                 description=f"Generating {self.num_proposals} proposals")  # async step
+
         # Validate the proposals
         candidates = []
         backup_dict = {p: copy.deepcopy(p.data) for p in self.agent.parameters()}  # backup the current value
-        for update_dict in update_dicts:
-            if len(update_dict) == 0:
-                continue
-            self.optimizer.update(update_dict)  # set the agent with update_dict
-            score = validate()  # check the score on the validation set
-            candidates.append((score, update_dict))
-            self.optimizer.update(backup_dict)  # restore the backup
-
-        # Include the current parameter as a candidate
-        if self.current_score is None:
-            self.current_score = validate()
-        candidates.append((self.current_score, backup_dict))
-
-        # Find the candidate with the best score
-        best_score, best_update = max(candidates, key=lambda x: x[0])
-        self.current_score = best_score
+
+        if use_vector:
+            # Vector path: collect (score_dict, update_dict) for multi-objective selection
+            vector_candidates = []
+            for update_dict in update_dicts:
+                if len(update_dict) == 0:
+                    continue
+                self.optimizer.update(update_dict)
+                score_dict = validate_vector()
+                scalar_score = float(np.mean(list(score_dict.values())))
+                candidates.append((scalar_score, update_dict))
+                vector_candidates.append((score_dict, update_dict))
+                self.optimizer.update(backup_dict)
+
+            # Include current parameters as a candidate
+            if self.current_score_dict is None:
+                self.current_score_dict = validate_vector()
+            if self.current_score is None:
+                self.current_score = float(np.mean(list(self.current_score_dict.values())))
+            candidates.append((self.current_score, backup_dict))
+            vector_candidates.append((self.current_score_dict, backup_dict))
+
+            # Select best via multi-objective config
+            best_idx = select_best(vector_candidates, self.objective_config)
+            best_score_dict = vector_candidates[best_idx][0]
+            best_update = vector_candidates[best_idx][1]
+            best_score = float(np.mean(list(best_score_dict.values())))
+            self.current_score = best_score
+            self.current_score_dict = best_score_dict
+        else:
+            # Scalar path: unchanged from original behavior
+            for update_dict in update_dicts:
+                if len(update_dict) == 0:
+                    continue
+                self.optimizer.update(update_dict)  # set the agent with update_dict
+                score = validate()  # check the score on the validation set
+                candidates.append((score, update_dict))
+                self.optimizer.update(backup_dict)  # restore the backup
+
+            # Include the current parameter as a candidate
+            if self.current_score is None:
+                self.current_score = validate()
+            candidates.append((self.current_score, backup_dict))
+
+            # Find the candidate with the best score
+            best_score, best_update = max(candidates, key=lambda x: x[0])
+            self.current_score = best_score
 
         if verbose:
             print_color(f"Best score: {best_score} out of scores {[c[0] for c in candidates]}", 'green')
@@ -609,5 +660,11 @@ def validate():
         # Make the best update
         self.optimizer.update(best_update)
 
-        # Logging
-        self.logger.log('Validation score', best_score, self.n_iters, color='green')
\ No newline at end of file
+        # Logging — always log scalar for backward compatibility
+        self.logger.log('Validation score', best_score, self.n_iters, color='green')
+
+        # Log individual vector metrics if available
+        if use_vector and isinstance(best_score_dict, dict):
+            for metric_name, metric_value in best_score_dict.items():
+                self.logger.log(f'Validation score/{metric_name}', metric_value,
+                                self.n_iters, color='green')
\ No newline at end of file
diff --git a/opto/trainer/evaluators.py b/opto/trainer/evaluators.py
index d1e99c8e..d1271fe8 100644
--- a/opto/trainer/evaluators.py
+++ b/opto/trainer/evaluators.py
@@ -39,6 +39,76 @@ def _evaluate(agent, guide, i):
     scores = np.array(scores)
     if num_samples > 1:
         # scores will be of length N * num_samples
-        # Reshape scores into an array of shape (N, num_samples)        
+        # Reshape scores into an array of shape (N, num_samples)
         scores = scores.reshape(N, num_samples)
-    return scores
\ No newline at end of file
+    return scores
+
+
+def evaluate_vector(agent, guide, inputs, infos, min_score=None,
+                    num_threads=None, description=None):
+    """Evaluate the agent and return per-example score dicts.
+
+    Like evaluate(), but calls guide.get_score_dict() instead of
+    guide.metric(), returning a list of Dict[str, float].
+
+    Args:
+        agent: The agent to evaluate
+        guide: The guide (must have get_score_dict method)
+        inputs: List of inputs to evaluate on
+        infos: List of additional information for each input
+        min_score: Fallback on exception. Dict or float (wrapped as
+                   {"score": val}). None -> {"score": -inf}.
+        num_threads: Maximum threads for parallel evaluation
+        description: Progress bar description
+
+    Returns:
+        List[Dict[str, float]] of length len(inputs)
+    """
+    assert len(inputs) == len(infos), "Inputs and infos must have the same length"
+    N = len(inputs)
+    eval_description = description or f"Evaluating {N} examples (vector)"
+
+    if min_score is None:
+        _fallback = {"score": float("-inf")}
+    elif isinstance(min_score, dict):
+        _fallback = min_score
+    else:
+        _fallback = {"score": float(min_score)}
+
+    @batch_run(max_workers=num_threads, description=eval_description)
+    def _evaluate_vector(agent, guide, i):
+        try:
+            output = agent(inputs[i]).data
+            score_dict = guide.get_score_dict(inputs[i], output, infos[i])
+        except ExecutionError:
+            score_dict = copy.copy(_fallback)
+        return score_dict
+
+    indices = list(range(N))
+    return _evaluate_vector(agent, guide, indices)
+
+
+def aggregate_vector_scores(score_dicts):
+    """Compute the per-metric mean across a list of score dicts.
+
+    Args:
+        score_dicts: List[Dict[str, float]]
+
+    Returns:
+        Dict[str, float] with the mean value for each metric key.
+        Empty dict if input is empty.
+    """
+    if not score_dicts:
+        return {}
+
+    all_keys = set()
+    for sd in score_dicts:
+        all_keys.update(sd.keys())
+
+    result = {}
+    for key in sorted(all_keys):
+        values = [sd[key] for sd in score_dicts
+                  if key in sd and sd[key] is not None]
+        if values:
+            result[key] = float(np.mean(values))
+    return result
\ No newline at end of file
diff --git a/opto/trainer/guide.py b/opto/trainer/guide.py
index 19b6d3b2..4906c831 100644
--- a/opto/trainer/guide.py
+++ b/opto/trainer/guide.py
@@ -47,6 +47,22 @@ def metric(self, query: str, response: str, reference: Optional[str] = None, **k
         """ Exact match metric """
         return self.get_feedback(query, response, reference)[0]
 
+    def get_score_dict(self, query: str, response: str, reference: Optional[str] = None, **kwargs) -> Dict[str, float]:
+        """Return the evaluation score as a dictionary.
+
+        Default implementation wraps the scalar from get_feedback() as
+        {"score": float_value}. Subclasses returning multi-metric scores
+        should override this method to return e.g.
+        {"accuracy": 0.9, "fluency": 0.8, "latency_s": 0.05}.
+
+        If get_feedback() returns a dict as its first element, that dict
+        is returned directly (with values cast to float).
+        """
+        score = self.get_feedback(query, response, reference, **kwargs)[0]
+        if isinstance(score, dict):
+            return {k: float(v) for k, v in score.items()}
+        return {"score": float(score)}
+
     def copy(self):
         """ Create a copy of the guide instance.
 
diff --git a/opto/trainer/objectives.py b/opto/trainer/objectives.py
new file mode 100644
index 00000000..3c21ca67
--- /dev/null
+++ b/opto/trainer/objectives.py
@@ -0,0 +1,312 @@
+"""Multi-objective configuration and selection utilities.
+
+Provides ObjectiveConfig and pure functions for multi-objective candidate
+selection: weighted scalarization, Pareto ranking, and backward-compatible
+scalar max.
+
+All functions are pure (no side effects) and depend only on numpy, typing,
+and dataclasses. No imports from opto.trainer to avoid circular dependencies.
+"""
+from dataclasses import dataclass, field
+from typing import Any, Dict, List, Optional, Tuple, Union
+import numpy as np
+
+
+# --- Type aliases ---
+ScalarScore = float
+VectorScore = Dict[str, float]
+ScoreLike = Union[int, float, bool, Dict[str, float]]
+
+
+@dataclass(frozen=True)
+class ObjectiveConfig:
+    """Immutable configuration for multi-objective candidate selection.
+
+    Attributes:
+        mode: Selection strategy.
+            - "scalar": existing scalar comparison (default, backward-compatible).
+            - "weighted": scalarize via weighted sum, then select max.
+            - "pareto": Pareto dominance ranking with configurable tie-break.
+        weights: Per-metric weights for weighted scalarization.
+            Missing metrics use missing_value. Metrics not in weights are ignored.
+            Empty dict in weighted mode -> equal weight 1.0 for all metrics.
+        minimize: Frozenset of metric names where lower is better.
+            These are negated internally ("higher-is-better" normalization).
+            Users can pass a set; it is auto-converted to frozenset.
+        missing_value: Score assigned to missing metrics (default: -inf).
+        pareto_metrics: Subset of metrics for Pareto dominance.
+            None -> use all metrics present across candidates.
+        tie_break: Strategy for Pareto-equivalent candidates.
+            - "weighted": fall back to weighted scalarization.
+            - "lexicographic": sort by metric names alphabetically.
+            - "random_seeded": seeded random shuffle.
+        seed: Random seed for deterministic tie-breaking.
+    """
+    mode: str = "scalar"
+    weights: Dict[str, float] = field(default_factory=dict)
+    minimize: frozenset = field(default_factory=frozenset)
+    missing_value: float = float("-inf")
+    pareto_metrics: Optional[Tuple[str, ...]] = None
+    tie_break: str = "weighted"
+    seed: int = 0
+
+    def __post_init__(self):
+        if isinstance(self.minimize, set):
+            object.__setattr__(self, 'minimize', frozenset(self.minimize))
+        if self.mode not in ("scalar", "weighted", "pareto"):
+            raise ValueError(
+                f"mode must be 'scalar', 'weighted', or 'pareto', got '{self.mode}'"
+            )
+        if self.tie_break not in ("weighted", "lexicographic", "random_seeded"):
+            raise ValueError(
+                f"tie_break must be 'weighted', 'lexicographic', or "
+                f"'random_seeded', got '{self.tie_break}'"
+            )
+        for k, v in self.weights.items():
+            if v < 0:
+                raise ValueError(f"Weight for '{k}' must be non-negative, got {v}")
+        if self.pareto_metrics is not None and len(self.pareto_metrics) == 0:
+            raise ValueError(
+                "pareto_metrics must be None (auto) or non-empty tuple"
+            )
+
+
+# ---------------------------------------------------------------------------
+# Pure utility functions
+# ---------------------------------------------------------------------------
+
+def normalize_score(score: ScoreLike) -> Dict[str, float]:
+    """Convert any score to dict form.
+
+    - bool/int/float -> {"score": float(value)}
+    - Dict[str, float] -> returned as-is (validated: all values finite)
+
+    Raises:
+        TypeError: if score is not int, float, bool, or dict.
+        ValueError: if dict is empty or contains non-finite values.
+    """
+    if isinstance(score, bool):
+        return {"score": float(score)}
+    if isinstance(score, (int, float)):
+        val = float(score)
+        if not np.isfinite(val):
+            raise ValueError(f"Score must be finite, got {score}")
+        return {"score": val}
+    if isinstance(score, dict):
+        if len(score) == 0:
+            raise ValueError("Score dict must not be empty")
+        for k, v in score.items():
+            if not isinstance(v, (int, float)) or not np.isfinite(float(v)):
+                raise ValueError(
+                    f"Score dict value for '{k}' must be finite float, got {v}"
+                )
+        return {k: float(v) for k, v in score.items()}
+    raise TypeError(
+        f"Score must be int, float, bool, or Dict[str, float], "
+        f"got {type(score).__name__}"
+    )
+
+
+def apply_minimize(score_dict: Dict[str, float],
+                   minimize: frozenset) -> Dict[str, float]:
+    """Negate values for minimize metrics (higher-is-better normalization).
+
+    Returns a new dict; metrics not in *minimize* are unchanged.
+    """
+    return {k: -v if k in minimize else v for k, v in score_dict.items()}
+
+
+def weighted_scalarize(score_dict: Dict[str, float],
+                       weights: Dict[str, float],
+                       missing_value: float = float("-inf")) -> float:
+    """Compute weighted sum of score dict.
+
+    If *weights* is empty, all present metrics get equal weight 1.0.
+    Metrics in *score_dict* but NOT in *weights* are ignored.
+    """
+    if not weights:
+        weights = {k: 1.0 for k in score_dict}
+    total = 0.0
+    for metric, weight in weights.items():
+        value = score_dict.get(metric, missing_value)
+        total += weight * value
+    return total
+
+
+def dominates(a: Dict[str, float], b: Dict[str, float],
+              metrics: Optional[Tuple[str, ...]] = None) -> bool:
+    """Check if candidate *a* Pareto-dominates candidate *b*.
+
+    a dominates b iff:
+      - a[m] >= b[m] for ALL metrics m, AND
+      - a[m] >  b[m] for AT LEAST ONE metric m
+
+    Both dicts must be in "higher-is-better" form (post apply_minimize).
+    """
+    if metrics is None:
+        metrics = tuple(sorted(set(a.keys()) | set(b.keys())))
+    at_least_one_better = False
+    for m in metrics:
+        va = a.get(m, float("-inf"))
+        vb = b.get(m, float("-inf"))
+        if va < vb:
+            return False
+        if va > vb:
+            at_least_one_better = True
+    return at_least_one_better
+
+
+def pareto_rank(candidates: List[Dict[str, float]],
+                metrics: Optional[Tuple[str, ...]] = None) -> List[int]:
+    """Assign Pareto rank to each candidate (0 = non-dominated front).
+
+    Uses standard non-dominated sorting.
+    """
+    n = len(candidates)
+    ranks = [0] * n
+    remaining = set(range(n))
+    current_rank = 0
+
+    while remaining:
+        front = []
+        for i in remaining:
+            dominated = False
+            for j in remaining:
+                if i != j and dominates(candidates[j], candidates[i], metrics):
+                    dominated = True
+                    break
+            if not dominated:
+                front.append(i)
+        for i in front:
+            ranks[i] = current_rank
+            remaining.remove(i)
+        current_rank += 1
+
+    return ranks
+
+
+def select_best(candidates: List[Tuple[ScoreLike, Any]],
+                config: Optional[ObjectiveConfig] = None) -> int:
+    """Select index of the single best candidate.
+
+    Args:
+        candidates: List of (score, payload) tuples.
+        config: Selection config. None -> scalar max (backward-compatible).
+
+    Returns:
+        Index of the best candidate.
+    """
+    if config is None or config.mode == "scalar":
+        scores = []
+        for score, _ in candidates:
+            if isinstance(score, dict):
+                scores.append(np.mean(list(score.values())))
+            else:
+                scores.append(float(score))
+        return int(np.argmax(scores))
+
+    score_dicts = [normalize_score(s) for s, _ in candidates]
+    score_dicts = [apply_minimize(sd, config.minimize) for sd in score_dicts]
+
+    if config.mode == "weighted":
+        weighted = [
+            weighted_scalarize(sd, config.weights, config.missing_value)
+            for sd in score_dicts
+        ]
+        return int(np.argmax(weighted))
+
+    if config.mode == "pareto":
+        ranks = pareto_rank(score_dicts, config.pareto_metrics)
+        front_indices = [i for i, r in enumerate(ranks) if r == 0]
+
+        if len(front_indices) == 1:
+            return front_indices[0]
+
+        # Tie-break among front
+        if config.tie_break == "weighted":
+            front_scores = [
+                weighted_scalarize(
+                    score_dicts[i], config.weights, config.missing_value
+                )
+                for i in front_indices
+            ]
+            return front_indices[int(np.argmax(front_scores))]
+
+        if config.tie_break == "lexicographic":
+            metrics = sorted(score_dicts[front_indices[0]].keys())
+
+            def lex_key(idx):
+                return tuple(
+                    score_dicts[idx].get(m, config.missing_value) for m in metrics
+                )
+
+            return max(front_indices, key=lex_key)
+
+        if config.tie_break == "random_seeded":
+            rng = np.random.RandomState(config.seed)
+            return front_indices[rng.randint(len(front_indices))]
+
+    raise ValueError(f"Unknown mode: {config.mode}")
+
+
+def select_top_k(candidates: List[Tuple[ScoreLike, Any]],
+                 config: Optional[ObjectiveConfig] = None,
+                 k: int = 1) -> List[int]:
+    """Select the top-k candidate indices.
+
+    Same logic as select_best but returns *k* indices.
+    For Pareto mode: rank-0 front first (up to k), then rank-1, etc.
+    """
+    if config is None or config.mode == "scalar":
+        scores = []
+        for score, _ in candidates:
+            if isinstance(score, dict):
+                scores.append(np.mean(list(score.values())))
+            else:
+                scores.append(float(score))
+        return list(np.argsort(scores)[::-1][:k])
+
+    score_dicts = [normalize_score(s) for s, _ in candidates]
+    score_dicts = [apply_minimize(sd, config.minimize) for sd in score_dicts]
+
+    if config.mode == "weighted":
+        weighted = [
+            weighted_scalarize(sd, config.weights, config.missing_value)
+            for sd in score_dicts
+        ]
+        return list(np.argsort(weighted)[::-1][:k])
+
+    if config.mode == "pareto":
+        ranks = pareto_rank(score_dicts, config.pareto_metrics)
+        result: List[int] = []
+        max_rank = max(ranks)
+        for rank in range(max_rank + 1):
+            rank_indices = [i for i, r in enumerate(ranks) if r == rank]
+            if config.tie_break == "weighted":
+                rank_indices.sort(
+                    key=lambda i: weighted_scalarize(
+                        score_dicts[i], config.weights, config.missing_value
+                    ),
+                    reverse=True,
+                )
+            elif config.tie_break == "lexicographic":
+                metrics = (
+                    sorted(score_dicts[rank_indices[0]].keys())
+                    if rank_indices else []
+                )
+                rank_indices.sort(
+                    key=lambda i: tuple(
+                        score_dicts[i].get(m, config.missing_value)
+                        for m in metrics
+                    ),
+                    reverse=True,
+                )
+            elif config.tie_break == "random_seeded":
+                rng = np.random.RandomState(config.seed + rank)
+                rng.shuffle(rank_indices)
+            result.extend(rank_indices)
+            if len(result) >= k:
+                break
+        return result[:k]
+
+    raise ValueError(f"Unknown mode: {config.mode}")
diff --git a/tests/unit_tests/test_evaluators_vector.py b/tests/unit_tests/test_evaluators_vector.py
new file mode 100644
index 00000000..61cfa1f1
--- /dev/null
+++ b/tests/unit_tests/test_evaluators_vector.py
@@ -0,0 +1,154 @@
+"""Tests for evaluate_vector and aggregate_vector_scores in opto.trainer.evaluators."""
+import pytest
+import numpy as np
+from opto import trace
+from opto.trainer.evaluators import evaluate_vector, aggregate_vector_scores
+from opto.trainer.guide import Guide
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+@trace.model
+class SimpleAgent:
+    """Deterministic agent: returns input + param."""
+    def __init__(self, param):
+        self.param = trace.node(param, trainable=True)
+
+    def forward(self, x):
+        return x + self.param
+
+
+class MultiMetricGuide(Guide):
+    """Guide returning multi-metric score dict."""
+    def __init__(self, target):
+        super().__init__()
+        self.target = target
+
+    def get_feedback(self, query, response, reference=None, **kwargs):
+        accuracy = float(response == self.target)
+        brevity = 1.0 / max(abs(response - self.target) + 1, 1)
+        feedback = f"response={response}, target={self.target}"
+        return accuracy, feedback
+
+    def get_score_dict(self, query, response, reference=None, **kwargs):
+        accuracy = float(response == self.target)
+        brevity = 1.0 / max(abs(response - self.target) + 1, 1)
+        return {"accuracy": accuracy, "brevity": brevity}
+
+
+class ScalarGuide(Guide):
+    """Guide using only scalar get_feedback (no get_score_dict override)."""
+    def __init__(self, target):
+        super().__init__()
+        self.target = target
+
+    def get_feedback(self, query, response, reference=None, **kwargs):
+        score = float(response == self.target)
+        feedback = f"response={response}"
+        return score, feedback
+
+
+# ---------------------------------------------------------------------------
+# evaluate_vector
+# ---------------------------------------------------------------------------
+
+def test_evaluate_vector_basic():
+    """evaluate_vector returns a list of dicts with correct metric values."""
+    agent = SimpleAgent(10)
+    guide = MultiMetricGuide(target=11)
+    inputs = [1, 2, 3]
+    infos = [None, None, None]
+
+    results = evaluate_vector(agent, guide, inputs, infos, num_threads=1)
+
+    assert len(results) == 3
+    assert isinstance(results[0], dict)
+    # input=1 + param=10 = 11 == target=11 -> accuracy=1.0, brevity=1.0
+    assert results[0]["accuracy"] == 1.0
+    assert results[0]["brevity"] == 1.0
+    # input=2 + param=10 = 12 != target=11 -> accuracy=0.0
+    assert results[1]["accuracy"] == 0.0
+    assert results[1]["brevity"] == pytest.approx(0.5)  # 1/(|12-11|+1) = 0.5
+    # input=3 + param=10 = 13 != target=11 -> accuracy=0.0
+    assert results[2]["accuracy"] == 0.0
+    assert results[2]["brevity"] == pytest.approx(1.0 / 3.0)  # 1/(|13-11|+1)
+
+
+def test_evaluate_vector_all_keys_present():
+    """Every result dict contains the same set of metric keys."""
+    agent = SimpleAgent(5)
+    guide = MultiMetricGuide(target=10)
+    inputs = [1, 2, 3, 4, 5]
+    infos = [None] * 5
+
+    results = evaluate_vector(agent, guide, inputs, infos, num_threads=1)
+
+    expected_keys = {"accuracy", "brevity"}
+    for rd in results:
+        assert set(rd.keys()) == expected_keys
+
+
+def test_evaluate_vector_scalar_guide_fallback():
+    """Guide without get_score_dict override returns {"score": float}."""
+    agent = SimpleAgent(10)
+    guide = ScalarGuide(target=11)
+    inputs = [1, 2]
+    infos = [None, None]
+
+    results = evaluate_vector(agent, guide, inputs, infos, num_threads=1)
+
+    assert len(results) == 2
+    # input=1 + param=10 = 11 == target=11 -> score=1.0
+    assert results[0] == {"score": 1.0}
+    # input=2 + param=10 = 12 != target=11 -> score=0.0
+    assert results[1] == {"score": 0.0}
+
+
+def test_evaluate_vector_empty_inputs():
+    """Empty inputs produce empty results."""
+    agent = SimpleAgent(0)
+    guide = MultiMetricGuide(target=0)
+
+    results = evaluate_vector(agent, guide, [], [], num_threads=1)
+    assert results == []
+
+
+# ---------------------------------------------------------------------------
+# aggregate_vector_scores
+# ---------------------------------------------------------------------------
+
+def test_aggregate_basic():
+    """Per-metric mean is computed correctly."""
+    score_dicts = [
+        {"accuracy": 1.0, "brevity": 0.5},
+        {"accuracy": 0.0, "brevity": 1.0},
+    ]
+    agg = aggregate_vector_scores(score_dicts)
+    assert agg["accuracy"] == pytest.approx(0.5)
+    assert agg["brevity"] == pytest.approx(0.75)
+
+
+def test_aggregate_empty():
+    """Empty input returns empty dict."""
+    assert aggregate_vector_scores([]) == {}
+
+
+def test_aggregate_single():
+    """Single dict returns the same values."""
+    score_dicts = [{"a": 0.42, "b": 0.99}]
+    agg = aggregate_vector_scores(score_dicts)
+    assert agg == {"a": pytest.approx(0.42), "b": pytest.approx(0.99)}
+
+
+def test_aggregate_missing_keys():
+    """Handles dicts with partially overlapping keys."""
+    score_dicts = [
+        {"accuracy": 1.0},
+        {"accuracy": 0.0, "brevity": 0.8},
+    ]
+    agg = aggregate_vector_scores(score_dicts)
+    assert agg["accuracy"] == pytest.approx(0.5)
+    # brevity only present in one dict
+    assert agg["brevity"] == pytest.approx(0.8)
diff --git a/tests/unit_tests/test_objectives.py b/tests/unit_tests/test_objectives.py
new file mode 100644
index 00000000..04fbccc2
--- /dev/null
+++ b/tests/unit_tests/test_objectives.py
@@ -0,0 +1,383 @@
+"""Tests for opto.trainer.objectives — ObjectiveConfig and selection utilities."""
+import pytest
+import numpy as np
+from opto.trainer.objectives import (
+    ObjectiveConfig, normalize_score, apply_minimize, weighted_scalarize,
+    dominates, pareto_rank, select_best, select_top_k,
+)
+
+
+# ---------------------------------------------------------------------------
+# normalize_score
+# ---------------------------------------------------------------------------
+
+def test_normalize_score_float():
+    assert normalize_score(0.85) == {"score": 0.85}
+
+
+def test_normalize_score_zero():
+    assert normalize_score(0.0) == {"score": 0.0}
+
+
+def test_normalize_score_int():
+    assert normalize_score(1) == {"score": 1.0}
+
+
+def test_normalize_score_int_zero():
+    assert normalize_score(0) == {"score": 0.0}
+
+
+def test_normalize_score_bool_true():
+    assert normalize_score(True) == {"score": 1.0}
+
+
+def test_normalize_score_bool_false():
+    assert normalize_score(False) == {"score": 0.0}
+
+
+def test_normalize_score_dict():
+    result = normalize_score({"acc": 0.9, "lat": 50.0})
+    assert result == {"acc": 0.9, "lat": 50.0}
+
+
+def test_normalize_score_dict_with_int_values():
+    result = normalize_score({"acc": 1, "lat": 0})
+    assert result == {"acc": 1.0, "lat": 0.0}
+
+
+def test_normalize_score_empty_dict_raises():
+    with pytest.raises(ValueError, match="must not be empty"):
+        normalize_score({})
+
+
+def test_normalize_score_nan_raises():
+    with pytest.raises(ValueError, match="finite"):
+        normalize_score({"a": float("nan")})
+
+
+def test_normalize_score_inf_raises():
+    with pytest.raises(ValueError, match="finite"):
+        normalize_score(float("inf"))
+
+
+def test_normalize_score_neg_inf_raises():
+    with pytest.raises(ValueError, match="finite"):
+        normalize_score(float("-inf"))
+
+
+def test_normalize_score_string_raises():
+    with pytest.raises(TypeError, match="str"):
+        normalize_score("bad")
+
+
+def test_normalize_score_none_raises():
+    with pytest.raises(TypeError):
+        normalize_score(None)
+
+
+# ---------------------------------------------------------------------------
+# apply_minimize
+# ---------------------------------------------------------------------------
+
+def test_apply_minimize_negates():
+    result = apply_minimize({"acc": 0.9, "lat": 100.0}, frozenset({"lat"}))
+    assert result == {"acc": 0.9, "lat": -100.0}
+
+
+def test_apply_minimize_empty_set():
+    result = apply_minimize({"acc": 0.9, "lat": 100.0}, frozenset())
+    assert result == {"acc": 0.9, "lat": 100.0}
+
+
+def test_apply_minimize_all():
+    result = apply_minimize({"a": 1.0, "b": 2.0}, frozenset({"a", "b"}))
+    assert result == {"a": -1.0, "b": -2.0}
+
+
+# ---------------------------------------------------------------------------
+# weighted_scalarize
+# ---------------------------------------------------------------------------
+
+def test_weighted_scalarize_basic():
+    result = weighted_scalarize({"a": 0.8, "b": 0.2}, {"a": 0.7, "b": 0.3})
+    assert result == pytest.approx(0.7 * 0.8 + 0.3 * 0.2)
+
+
+def test_weighted_scalarize_empty_weights():
+    result = weighted_scalarize({"a": 1.0, "b": 2.0}, {})
+    assert result == pytest.approx(3.0)  # equal weight 1.0 each
+
+
+def test_weighted_scalarize_missing_metric():
+    result = weighted_scalarize({"a": 1.0}, {"a": 0.5, "b": 0.5}, missing_value=0.0)
+    assert result == pytest.approx(0.5)  # 0.5*1.0 + 0.5*0.0
+
+
+def test_weighted_scalarize_ignores_extra_metrics():
+    result = weighted_scalarize({"a": 1.0, "b": 2.0, "c": 99.0}, {"a": 1.0})
+    assert result == pytest.approx(1.0)  # only "a" is weighted
+
+
+# ---------------------------------------------------------------------------
+# dominates
+# ---------------------------------------------------------------------------
+
+def test_dominates_yes():
+    assert dominates({"a": 2.0, "b": 2.0}, {"a": 1.0, "b": 1.0}) is True
+
+
+def test_dominates_yes_one_equal():
+    assert dominates({"a": 2.0, "b": 1.0}, {"a": 1.0, "b": 1.0}) is True
+
+
+def test_dominates_no_equal():
+    assert dominates({"a": 1.0, "b": 1.0}, {"a": 1.0, "b": 1.0}) is False
+
+
+def test_dominates_no_tradeoff():
+    assert dominates({"a": 2.0, "b": 0.5}, {"a": 1.0, "b": 1.0}) is False
+
+
+def test_dominates_with_metric_subset():
+    assert dominates({"a": 2.0, "b": 0.5}, {"a": 1.0, "b": 1.0},
+                      metrics=("a",)) is True
+
+
+# ---------------------------------------------------------------------------
+# pareto_rank
+# ---------------------------------------------------------------------------
+
+def test_pareto_rank_clear_hierarchy():
+    candidates = [
+        {"a": 3.0, "b": 3.0},  # dominates everything -> rank 0
+        {"a": 2.0, "b": 2.0},  # dominated by [0] -> rank 1
+        {"a": 1.0, "b": 1.0},  # dominated by [0],[1] -> rank 2
+    ]
+    ranks = pareto_rank(candidates)
+    assert ranks == [0, 1, 2]
+
+
+def test_pareto_rank_all_nondominated():
+    candidates = [
+        {"a": 3.0, "b": 1.0},
+        {"a": 1.0, "b": 3.0},
+        {"a": 2.0, "b": 2.0},
+    ]
+    ranks = pareto_rank(candidates)
+    # All are tradeoffs — none dominates another
+    assert ranks == [0, 0, 0]
+
+
+def test_pareto_rank_mixed():
+    candidates = [
+        {"a": 3.0, "b": 1.0},  # front 0
+        {"a": 1.0, "b": 3.0},  # front 0
+        {"a": 0.5, "b": 0.5},  # dominated by both -> rank 1
+    ]
+    ranks = pareto_rank(candidates)
+    assert ranks[0] == 0
+    assert ranks[1] == 0
+    assert ranks[2] == 1
+
+
+# ---------------------------------------------------------------------------
+# select_best
+# ---------------------------------------------------------------------------
+
+def test_select_best_none_config():
+    candidates = [(0.5, "A"), (0.9, "B"), (0.7, "C")]
+    assert select_best(candidates, None) == 1
+
+
+def test_select_best_scalar_mode():
+    config = ObjectiveConfig(mode="scalar")
+    candidates = [(0.5, "A"), (0.9, "B"), (0.7, "C")]
+    assert select_best(candidates, config) == 1
+
+
+def test_select_best_scalar_with_dict_scores():
+    """Scalar mode with dict scores uses mean of values."""
+    config = ObjectiveConfig(mode="scalar")
+    candidates = [
+        ({"a": 0.5, "b": 0.3}, "X"),  # mean = 0.4
+        ({"a": 0.8, "b": 0.6}, "Y"),  # mean = 0.7
+    ]
+    assert select_best(candidates, config) == 1
+
+
+def test_select_best_weighted():
+    config = ObjectiveConfig(
+        mode="weighted",
+        weights={"accuracy": 0.8, "latency_s": 0.2},
+        minimize=frozenset({"latency_s"}),
+    )
+    candidates = [
+        ({"accuracy": 0.95, "latency_s": 0.200}, "A"),  # 0.8*0.95 + 0.2*(-0.2) = 0.72
+        ({"accuracy": 0.70, "latency_s": 0.030}, "B"),  # 0.8*0.70 + 0.2*(-0.03) = 0.554
+    ]
+    assert select_best(candidates, config) == 0
+
+
+def test_select_best_weighted_latency_heavy():
+    config = ObjectiveConfig(
+        mode="weighted",
+        weights={"accuracy": 0.2, "latency_s": 0.8},
+        minimize=frozenset({"latency_s"}),
+    )
+    candidates = [
+        ({"accuracy": 0.95, "latency_s": 0.200}, "A"),  # 0.2*0.95 + 0.8*(-0.2) = 0.03
+        ({"accuracy": 0.70, "latency_s": 0.030}, "B"),  # 0.2*0.70 + 0.8*(-0.03) = 0.116
+    ]
+    assert select_best(candidates, config) == 1
+
+
+def test_select_best_pareto_tiebreak_weighted():
+    config = ObjectiveConfig(
+        mode="pareto",
+        weights={"a": 0.5, "b": 0.5},
+        tie_break="weighted",
+    )
+    candidates = [
+        ({"a": 0.9, "b": 0.1}, "X"),  # front 0, weighted = 0.5
+        ({"a": 0.1, "b": 0.9}, "Y"),  # front 0, weighted = 0.5
+        ({"a": 0.6, "b": 0.6}, "Z"),  # front 0, weighted = 0.6 -> winner
+    ]
+    assert select_best(candidates, config) == 2
+
+
+def test_select_best_pareto_deterministic():
+    config = ObjectiveConfig(
+        mode="pareto",
+        weights={"a": 0.5, "b": 0.5},
+        tie_break="weighted",
+        seed=42,
+    )
+    candidates = [
+        ({"a": 0.9, "b": 0.1}, "X"),
+        ({"a": 0.1, "b": 0.9}, "Y"),
+    ]
+    results = [select_best(candidates, config) for _ in range(10)]
+    assert len(set(results)) == 1  # same result every time
+
+
+def test_select_best_pareto_random_seeded_deterministic():
+    config = ObjectiveConfig(
+        mode="pareto",
+        tie_break="random_seeded",
+        seed=42,
+    )
+    candidates = [
+        ({"a": 0.9, "b": 0.1}, "X"),
+        ({"a": 0.1, "b": 0.9}, "Y"),
+    ]
+    results = [select_best(candidates, config) for _ in range(20)]
+    assert len(set(results)) == 1
+
+
+def test_select_best_pareto_different_seeds_may_differ():
+    results = set()
+    for seed in range(50):
+        config = ObjectiveConfig(
+            mode="pareto",
+            tie_break="random_seeded",
+            seed=seed,
+        )
+        candidates = [
+            ({"a": 0.9, "b": 0.1}, "X"),
+            ({"a": 0.1, "b": 0.9}, "Y"),
+        ]
+        results.add(select_best(candidates, config))
+    # With 50 different seeds across 2 candidates, we expect both to appear
+    assert len(results) == 2
+
+
+# ---------------------------------------------------------------------------
+# select_top_k
+# ---------------------------------------------------------------------------
+
+def test_select_top_k_scalar_none_config():
+    candidates = [(0.5, "A"), (0.9, "B"), (0.7, "C")]
+    indices = select_top_k(candidates, None, k=2)
+    assert len(indices) == 2
+    assert indices[0] == 1  # B is best
+    assert indices[1] == 2  # C is second
+
+
+@pytest.mark.parametrize("k", [1, 2, 3])
+def test_select_top_k_scalar_k(k):
+    candidates = [(0.5, "A"), (0.9, "B"), (0.7, "C")]
+    indices = select_top_k(candidates, None, k=k)
+    assert len(indices) == k
+    assert indices[0] == 1  # B always best
+
+
+def test_select_top_k_weighted():
+    config = ObjectiveConfig(
+        mode="weighted",
+        weights={"a": 1.0, "b": 1.0},
+    )
+    candidates = [
+        ({"a": 0.5, "b": 0.5}, "X"),  # weighted = 1.0
+        ({"a": 0.9, "b": 0.1}, "Y"),  # weighted = 1.0
+        ({"a": 0.8, "b": 0.8}, "Z"),  # weighted = 1.6
+    ]
+    indices = select_top_k(candidates, config, k=2)
+    assert indices[0] == 2  # Z is best
+
+
+def test_select_top_k_pareto():
+    config = ObjectiveConfig(
+        mode="pareto",
+        weights={"a": 0.5, "b": 0.5},
+        tie_break="weighted",
+    )
+    candidates = [
+        ({"a": 0.9, "b": 0.1}, "X"),  # front 0
+        ({"a": 0.1, "b": 0.9}, "Y"),  # front 0
+        ({"a": 0.05, "b": 0.05}, "Z"),  # front 1 (dominated)
+    ]
+    indices = select_top_k(candidates, config, k=2)
+    assert set(indices) == {0, 1}  # both front-0 candidates
+
+
+# ---------------------------------------------------------------------------
+# ObjectiveConfig validation
+# ---------------------------------------------------------------------------
+
+def test_config_default():
+    config = ObjectiveConfig()
+    assert config.mode == "scalar"
+    assert config.weights == {}
+    assert config.minimize == frozenset()
+
+
+def test_config_set_to_frozenset():
+    config = ObjectiveConfig(minimize={"lat"})
+    assert isinstance(config.minimize, frozenset)
+    assert "lat" in config.minimize
+
+
+def test_config_negative_weight_raises():
+    with pytest.raises(ValueError, match="non-negative"):
+        ObjectiveConfig(weights={"a": -1.0})
+
+
+def test_config_bad_mode_raises():
+    with pytest.raises(ValueError, match="mode"):
+        ObjectiveConfig(mode="unknown")
+
+
+def test_config_bad_tie_break_raises():
+    with pytest.raises(ValueError, match="tie_break"):
+        ObjectiveConfig(tie_break="bad")
+
+
+def test_config_empty_pareto_metrics_raises():
+    with pytest.raises(ValueError, match="non-empty"):
+        ObjectiveConfig(pareto_metrics=())
+
+
+def test_config_frozen():
+    config = ObjectiveConfig()
+    with pytest.raises(AttributeError):
+        config.mode = "weighted"

From 45901029613159d83f18819ff6062541d316d734 Mon Sep 17 00:00:00 2001
From: Jose Carlos Rodriguez <josecarlosrodriguez@Carlos-MacBook-Pro.local>
Date: Thu, 12 Feb 2026 12:18:05 -0400
Subject: [PATCH 5/5] T6 M1: Fix Colab install cell for Python 3.12
 compatibility

---
 examples/notebooks/t6_m1_vector_scores.ipynb | 576 +++++++++++++++++--
 1 file changed, 530 insertions(+), 46 deletions(-)

diff --git a/examples/notebooks/t6_m1_vector_scores.ipynb b/examples/notebooks/t6_m1_vector_scores.ipynb
index 637322d0..6363aee2 100644
--- a/examples/notebooks/t6_m1_vector_scores.ipynb
+++ b/examples/notebooks/t6_m1_vector_scores.ipynb
@@ -6,23 +6,7 @@
    "id": "a0000001",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "T6 Milestone 1 — Multi-Objective Vector Scores\n",
-    "\n",
-    "This notebook is the M1 deliverable for the T6 Multi-Objective Vector Scores project.\n",
-    "It demonstrates:\n",
-    "  1. ObjectiveConfig creation and validation\n",
-    "  2. MultiMetricGuide with get_score_dict()\n",
-    "  3. evaluate_vector() + aggregate_vector_scores()\n",
-    "  4. Full BasicSearchAlgorithm.train() with DummyLLM + objective_config\n",
-    "  5. Scalar baseline comparison (backward compat)\n",
-    "  6. Pareto mode demo + deterministic tiebreak\n",
-    "\n",
-    "Part A runs end-to-end WITHOUT API keys (StubLLM / DummyLLM).\n",
-    "Part B requires an OpenRouter API key (Colab secrets or environment variable).\n",
-    "\"\"\""
-   ]
+   "source": "!git clone https://github.com/carlosrod723/OpenTrace.git Trace\n%cd Trace\n!git checkout t6-multi-objective-m0\n!sed -i 's/python_requires=\">=3.13\"/python_requires=\">=3.12\"/' setup.py\n!pip install -e ."
   },
   {
    "cell_type": "markdown",
@@ -72,7 +56,7 @@
    "id": "a0000004",
    "metadata": {},
    "outputs": [],
-   "source": "import sys, os\n\n# Ensure OpenTrace root is on the path (needed when running from examples/notebooks/)\n_repo_root = os.path.abspath(os.path.join(os.getcwd(), '..', '..'))\nif os.path.isdir(os.path.join(_repo_root, 'opto')):\n    if _repo_root not in sys.path:\n        sys.path.insert(0, _repo_root)\n# Also handle running directly from the repo root\nif os.path.isdir(os.path.join(os.getcwd(), 'opto')):\n    if os.getcwd() not in sys.path:\n        sys.path.insert(0, os.getcwd())\n\nimport numpy as np\nfrom typing import Dict, Tuple, Optional\n\nprint(\"=\" * 70)\nprint(\"T6 M1 \\u2014 Multi-Objective Vector Scores\")\nprint(\"=\" * 70)"
+   "source": "import numpy as np\nfrom typing import Dict, Tuple, Optional\n\nprint(\"=\" * 70)\nprint(\"T6 M1 \\u2014 Multi-Objective Vector Scores\")\nprint(\"=\" * 70)"
   },
   {
    "cell_type": "markdown",
@@ -87,10 +71,41 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 3,
    "id": "a0000006",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "--- ObjectiveConfig defaults ---\n",
+      "  mode=scalar, weights={}, minimize=frozenset()\n",
+      "\n",
+      "--- ObjectiveConfig: weighted mode ---\n",
+      "  mode=weighted\n",
+      "  weights={'accuracy': 0.8, 'latency_s': 0.2}\n",
+      "  minimize=frozenset({'latency_s'})\n",
+      "\n",
+      "--- ObjectiveConfig: Pareto mode ---\n",
+      "  mode=pareto, tie_break=weighted, seed=42\n",
+      "\n",
+      "--- ObjectiveConfig: set auto-converts to frozenset ---\n",
+      "  type(minimize)=frozenset (auto-converted from set)\n",
+      "\n",
+      "--- Validation: negative weight ---\n",
+      "  Caught: Weight for 'a' must be non-negative, got -0.5\n",
+      "\n",
+      "--- Validation: bad mode ---\n",
+      "  Caught: mode must be 'scalar', 'weighted', or 'pareto', got 'unknown'\n",
+      "\n",
+      "--- Frozen (immutable) ---\n",
+      "  Caught: cannot assign to field 'mode'\n",
+      "\n",
+      "ObjectiveConfig validation: all checks passed.\n"
+     ]
+    }
+   ],
    "source": [
     "from opto.trainer.objectives import (\n",
     "    ObjectiveConfig, normalize_score, apply_minimize,\n",
@@ -157,10 +172,30 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 4,
    "id": "a0000008",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "--- Training path: get_feedback() -> (float, str) ---\n",
+      "  score=1.0 (type=float)\n",
+      "  feedback='Expected '4', got '4'. Correct!'\n",
+      "\n",
+      "--- Selection path: get_score_dict() -> Dict[str, float] ---\n",
+      "  score_dict={'accuracy': 1.0, 'brevity': 1.0}\n",
+      "\n",
+      "--- metric() still returns float (backward compat) ---\n",
+      "  metric()=1.0 (type=float)\n",
+      "\n",
+      "--- Base Guide without get_score_dict override wraps scalar ---\n",
+      "  get_score_dict()={'score': 0.75}\n",
+      "  (wrapped as {{'score': 0.75}} automatically)\n"
+     ]
+    }
+   ],
    "source": [
     "from opto.trainer.guide import Guide\n",
     "\n",
@@ -228,10 +263,32 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 5,
    "id": "a0000010",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "--- evaluate_vector() ---\n",
+      "Evaluating 3 examples (vector) (Running sequentially).\n",
+      "  Example 0: {'accuracy': 1.0, 'brevity': 1.0}\n",
+      "  Example 1: {'accuracy': 1.0, 'brevity': 1.0}\n",
+      "  Example 2: {'accuracy': 1.0, 'brevity': 1.0}\n",
+      "\n",
+      "--- aggregate_vector_scores() ---\n",
+      "  Aggregated (per-metric mean): {'accuracy': 1.0, 'brevity': 1.0}\n",
+      "\n",
+      "--- Wrong answer agent ---\n",
+      "Evaluating 3 examples (vector) (Running sequentially).\n",
+      "  Example 0: {'accuracy': 0.0, 'brevity': 0.25}\n",
+      "  Example 1: {'accuracy': 0.0, 'brevity': 0.25}\n",
+      "  Example 2: {'accuracy': 0.0, 'brevity': 0.25}\n",
+      "  Aggregated: {'accuracy': 0.0, 'brevity': 0.25}\n"
+     ]
+    }
+   ],
    "source": [
     "from opto import trace\n",
     "from opto.trainer.evaluators import evaluate_vector, aggregate_vector_scores\n",
@@ -282,10 +339,43 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 6,
    "id": "a0000012",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Candidates:\n",
+      "  prompt_A: {'accuracy': 0.95, 'latency_s': 0.2}\n",
+      "  prompt_B: {'accuracy': 0.7, 'latency_s': 0.03}\n",
+      "  prompt_C: {'accuracy': 0.88, 'latency_s': 0.08}\n",
+      "  prompt_D: {'accuracy': 0.6, 'latency_s': 0.02}\n",
+      "\n",
+      "--- select_best(config=None) [scalar, backward-compat] ---\n",
+      "  Winner: prompt_A (index 0)\n",
+      "\n",
+      "--- select_best(weighted, accuracy=0.8) ---\n",
+      "  Winner: prompt_A (index 0)\n",
+      "\n",
+      "--- select_best(weighted, latency_s=0.8) ---\n",
+      "  Winner: prompt_B (index 1)\n",
+      "\n",
+      "--- select_best(pareto, tie_break=weighted) ---\n",
+      "  Pareto ranks: [0, 0, 0, 0]\n",
+      "  Front (rank 0): ['prompt_A', 'prompt_B', 'prompt_C', 'prompt_D']\n",
+      "  Winner (after tie-break): prompt_C (index 2)\n",
+      "\n",
+      "--- Determinism: 10 runs with same config ---\n",
+      "  Results: [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]\n",
+      "  All identical: True\n",
+      "\n",
+      "--- select_top_k(pareto, k=2) ---\n",
+      "  Top 2: ['prompt_C', 'prompt_A']\n"
+     ]
+    }
+   ],
    "source": [
     "# Candidates: (score_dict, payload) tuples\n",
     "candidates = [\n",
@@ -361,10 +451,55 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 7,
    "id": "a0000014",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "======================================================================\n",
+      "TRAINING: Scalar baseline (objective_config=None)\n",
+      "======================================================================\n",
+      "Evaluating agent (iteration 0) (Running sequentially).\n",
+      "[Step 0] \u001b[92mAverage test score: 0.0\u001b[0m\n",
+      "Forward pass (batch size: 1) (Running sequentially).\n",
+      "Generating 2 proposals (Running sequentially).\n",
+      "Validating proposals (Running sequentially).\n",
+      "[Step 0] \u001b[92mValidation score: 0.0\u001b[0m\n",
+      "Evaluating agent (iteration 1) (Running sequentially).\n",
+      "[Step 1] \u001b[92mAverage test score: 0.0\u001b[0m\n",
+      "Epoch: 0. Iteration: 1\n",
+      "[Step 1] Instantaneous train score: 0.0\n",
+      "[Step 1] Average train score: 0.0\n",
+      "[Step 1] \u001b[91mParameter: str:2: You are a helpful assistant.\u001b[0m\n",
+      "Forward pass (batch size: 1) (Running sequentially).\n",
+      "Generating 2 proposals (Running sequentially).\n",
+      "[Step 1] \u001b[92mValidation score: 0.0\u001b[0m\n",
+      "Evaluating agent (iteration 2) (Running sequentially).\n",
+      "[Step 2] \u001b[92mAverage test score: 0.0\u001b[0m\n",
+      "Epoch: 0. Iteration: 2\n",
+      "[Step 2] Instantaneous train score: 0.0\n",
+      "[Step 2] Average train score: 0.0\n",
+      "[Step 2] \u001b[91mParameter: str:2: You are a helpful assistant.\u001b[0m\n",
+      "Forward pass (batch size: 1) (Running sequentially).\n",
+      "Generating 2 proposals (Running sequentially).\n",
+      "[Step 2] \u001b[92mValidation score: 0.0\u001b[0m\n",
+      "Evaluating agent (iteration 3) (Running sequentially).\n",
+      "[Step 3] \u001b[92mAverage test score: 0.0\u001b[0m\n",
+      "Epoch: 0. Iteration: 3\n",
+      "[Step 3] Instantaneous train score: 0.0\n",
+      "[Step 3] Average train score: 0.0\n",
+      "[Step 3] \u001b[91mParameter: str:2: You are a helpful assistant.\u001b[0m\n",
+      "\n",
+      "Scalar training scores: [np.float64(0.0), np.float64(0.0), np.float64(0.0)]\n",
+      "current_score: 0.0\n",
+      "current_score_dict: None\n",
+      "(current_score_dict is None because scalar mode does not use vector path)\n"
+     ]
+    }
+   ],
    "source": [
     "from opto.utils.llm import DummyLLM\n",
     "from opto.optimizers import OptoPrimeV2\n",
@@ -443,10 +578,61 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 8,
    "id": "a0000016",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "======================================================================\n",
+      "TRAINING: Weighted mode (objective_config.mode='weighted')\n",
+      "======================================================================\n",
+      "Evaluating agent (iteration 0) (Running sequentially).\n",
+      "[Step 0] \u001b[92mAverage test score: 0.0\u001b[0m\n",
+      "Forward pass (batch size: 1) (Running sequentially).\n",
+      "Generating 2 proposals (Running sequentially).\n",
+      "Validating proposals (vector) (Running sequentially).\n",
+      "[Step 0] \u001b[92mValidation score: 0.00819672131147541\u001b[0m\n",
+      "[Step 0] \u001b[92mValidation score/accuracy: 0.0\u001b[0m\n",
+      "[Step 0] \u001b[92mValidation score/brevity: 0.01639344262295082\u001b[0m\n",
+      "Evaluating agent (iteration 1) (Running sequentially).\n",
+      "[Step 1] \u001b[92mAverage test score: 0.0\u001b[0m\n",
+      "Epoch: 0. Iteration: 1\n",
+      "[Step 1] Instantaneous train score: 0.0\n",
+      "[Step 1] Average train score: 0.0\n",
+      "[Step 1] \u001b[91mParameter: str:9: You are a helpful assistant.\u001b[0m\n",
+      "Forward pass (batch size: 1) (Running sequentially).\n",
+      "Generating 2 proposals (Running sequentially).\n",
+      "[Step 1] \u001b[92mValidation score: 0.00819672131147541\u001b[0m\n",
+      "[Step 1] \u001b[92mValidation score/accuracy: 0.0\u001b[0m\n",
+      "[Step 1] \u001b[92mValidation score/brevity: 0.01639344262295082\u001b[0m\n",
+      "Evaluating agent (iteration 2) (Running sequentially).\n",
+      "[Step 2] \u001b[92mAverage test score: 0.0\u001b[0m\n",
+      "Epoch: 0. Iteration: 2\n",
+      "[Step 2] Instantaneous train score: 0.0\n",
+      "[Step 2] Average train score: 0.0\n",
+      "[Step 2] \u001b[91mParameter: str:9: You are a helpful assistant.\u001b[0m\n",
+      "Forward pass (batch size: 1) (Running sequentially).\n",
+      "Generating 2 proposals (Running sequentially).\n",
+      "[Step 2] \u001b[92mValidation score: 0.00819672131147541\u001b[0m\n",
+      "[Step 2] \u001b[92mValidation score/accuracy: 0.0\u001b[0m\n",
+      "[Step 2] \u001b[92mValidation score/brevity: 0.01639344262295082\u001b[0m\n",
+      "Evaluating agent (iteration 3) (Running sequentially).\n",
+      "[Step 3] \u001b[92mAverage test score: 0.0\u001b[0m\n",
+      "Epoch: 0. Iteration: 3\n",
+      "[Step 3] Instantaneous train score: 0.0\n",
+      "[Step 3] Average train score: 0.0\n",
+      "[Step 3] \u001b[91mParameter: str:9: You are a helpful assistant.\u001b[0m\n",
+      "\n",
+      "Weighted training scores: [np.float64(0.0), np.float64(0.0), np.float64(0.0)]\n",
+      "current_score (float): 0.00819672131147541\n",
+      "current_score_dict: {'accuracy': 0.0, 'brevity': 0.01639344262295082}\n",
+      "(current_score_dict stores the vector score selected by weighted mode)\n"
+     ]
+    }
+   ],
    "source": [
     "print(\"=\" * 70)\n",
     "print(\"TRAINING: Weighted mode (objective_config.mode='weighted')\")\n",
@@ -490,10 +676,101 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 9,
    "id": "a0000018",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "======================================================================\n",
+      "TRAINING: Pareto mode (objective_config.mode='pareto')\n",
+      "======================================================================\n",
+      "Evaluating agent (iteration 0) (Running sequentially).\n",
+      "[Step 0] \u001b[92mAverage test score: 0.0\u001b[0m\n",
+      "Forward pass (batch size: 1) (Running sequentially).\n",
+      "Generating 2 proposals (Running sequentially).\n",
+      "Validating proposals (vector) (Running sequentially).\n",
+      "[Step 0] \u001b[92mValidation score: 0.00819672131147541\u001b[0m\n",
+      "[Step 0] \u001b[92mValidation score/accuracy: 0.0\u001b[0m\n",
+      "[Step 0] \u001b[92mValidation score/brevity: 0.01639344262295082\u001b[0m\n",
+      "Evaluating agent (iteration 1) (Running sequentially).\n",
+      "[Step 1] \u001b[92mAverage test score: 0.0\u001b[0m\n",
+      "Epoch: 0. Iteration: 1\n",
+      "[Step 1] Instantaneous train score: 0.0\n",
+      "[Step 1] Average train score: 0.0\n",
+      "[Step 1] \u001b[91mParameter: str:16: You are a helpful assistant.\u001b[0m\n",
+      "Forward pass (batch size: 1) (Running sequentially).\n",
+      "Generating 2 proposals (Running sequentially).\n",
+      "[Step 1] \u001b[92mValidation score: 0.00819672131147541\u001b[0m\n",
+      "[Step 1] \u001b[92mValidation score/accuracy: 0.0\u001b[0m\n",
+      "[Step 1] \u001b[92mValidation score/brevity: 0.01639344262295082\u001b[0m\n",
+      "Evaluating agent (iteration 2) (Running sequentially).\n",
+      "[Step 2] \u001b[92mAverage test score: 0.0\u001b[0m\n",
+      "Epoch: 0. Iteration: 2\n",
+      "[Step 2] Instantaneous train score: 0.0\n",
+      "[Step 2] Average train score: 0.0\n",
+      "[Step 2] \u001b[91mParameter: str:16: You are a helpful assistant.\u001b[0m\n",
+      "Forward pass (batch size: 1) (Running sequentially).\n",
+      "Generating 2 proposals (Running sequentially).\n",
+      "[Step 2] \u001b[92mValidation score: 0.00819672131147541\u001b[0m\n",
+      "[Step 2] \u001b[92mValidation score/accuracy: 0.0\u001b[0m\n",
+      "[Step 2] \u001b[92mValidation score/brevity: 0.01639344262295082\u001b[0m\n",
+      "Evaluating agent (iteration 3) (Running sequentially).\n",
+      "[Step 3] \u001b[92mAverage test score: 0.0\u001b[0m\n",
+      "Epoch: 0. Iteration: 3\n",
+      "[Step 3] Instantaneous train score: 0.0\n",
+      "[Step 3] Average train score: 0.0\n",
+      "[Step 3] \u001b[91mParameter: str:16: You are a helpful assistant.\u001b[0m\n",
+      "\n",
+      "Pareto training scores: [np.float64(0.0), np.float64(0.0), np.float64(0.0)]\n",
+      "current_score (float): 0.00819672131147541\n",
+      "current_score_dict: {'accuracy': 0.0, 'brevity': 0.01639344262295082}\n",
+      "\n",
+      "--- Determinism: re-run with same seed ---\n",
+      "Evaluating agent (iteration 0) (Running sequentially).\n",
+      "[Step 0] \u001b[92mAverage test score: 0.0\u001b[0m\n",
+      "Forward pass (batch size: 1) (Running sequentially).\n",
+      "Generating 2 proposals (Running sequentially).\n",
+      "Validating proposals (vector) (Running sequentially).\n",
+      "[Step 0] \u001b[92mValidation score: 0.00819672131147541\u001b[0m\n",
+      "[Step 0] \u001b[92mValidation score/accuracy: 0.0\u001b[0m\n",
+      "[Step 0] \u001b[92mValidation score/brevity: 0.01639344262295082\u001b[0m\n",
+      "Evaluating agent (iteration 1) (Running sequentially).\n",
+      "[Step 1] \u001b[92mAverage test score: 0.0\u001b[0m\n",
+      "Epoch: 0. Iteration: 1\n",
+      "[Step 1] Instantaneous train score: 0.0\n",
+      "[Step 1] Average train score: 0.0\n",
+      "[Step 1] \u001b[91mParameter: str:23: You are a helpful assistant.\u001b[0m\n",
+      "Forward pass (batch size: 1) (Running sequentially).\n",
+      "Generating 2 proposals (Running sequentially).\n",
+      "[Step 1] \u001b[92mValidation score: 0.00819672131147541\u001b[0m\n",
+      "[Step 1] \u001b[92mValidation score/accuracy: 0.0\u001b[0m\n",
+      "[Step 1] \u001b[92mValidation score/brevity: 0.01639344262295082\u001b[0m\n",
+      "Evaluating agent (iteration 2) (Running sequentially).\n",
+      "[Step 2] \u001b[92mAverage test score: 0.0\u001b[0m\n",
+      "Epoch: 0. Iteration: 2\n",
+      "[Step 2] Instantaneous train score: 0.0\n",
+      "[Step 2] Average train score: 0.0\n",
+      "[Step 2] \u001b[91mParameter: str:23: You are a helpful assistant.\u001b[0m\n",
+      "Forward pass (batch size: 1) (Running sequentially).\n",
+      "Generating 2 proposals (Running sequentially).\n",
+      "[Step 2] \u001b[92mValidation score: 0.00819672131147541\u001b[0m\n",
+      "[Step 2] \u001b[92mValidation score/accuracy: 0.0\u001b[0m\n",
+      "[Step 2] \u001b[92mValidation score/brevity: 0.01639344262295082\u001b[0m\n",
+      "Evaluating agent (iteration 3) (Running sequentially).\n",
+      "[Step 3] \u001b[92mAverage test score: 0.0\u001b[0m\n",
+      "Epoch: 0. Iteration: 3\n",
+      "[Step 3] Instantaneous train score: 0.0\n",
+      "[Step 3] Average train score: 0.0\n",
+      "[Step 3] \u001b[91mParameter: str:23: You are a helpful assistant.\u001b[0m\n",
+      "Run 1 current_score_dict: {'accuracy': 0.0, 'brevity': 0.01639344262295082}\n",
+      "Run 2 current_score_dict: {'accuracy': 0.0, 'brevity': 0.01639344262295082}\n",
+      "Deterministic: True\n"
+     ]
+    }
+   ],
    "source": [
     "print(\"=\" * 70)\n",
     "print(\"TRAINING: Pareto mode (objective_config.mode='pareto')\")\n",
@@ -559,10 +836,34 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 10,
    "id": "a0000020",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "======================================================================\n",
+      "PART A COMPLETE — StubLLM Section\n",
+      "======================================================================\n",
+      "\n",
+      "Verified:\n",
+      "  ✓ ObjectiveConfig creation, validation, and immutability\n",
+      "  ✓ MultiMetricGuide: get_feedback() -> (float, str) for training loop\n",
+      "  ✓ MultiMetricGuide: get_score_dict() -> Dict[str, float] for selection path\n",
+      "  ✓ evaluate_vector() returns List[Dict[str, float]]\n",
+      "  ✓ aggregate_vector_scores() computes per-metric means\n",
+      "  ✓ select_best(): scalar, weighted, Pareto modes all work\n",
+      "  ✓ BasicSearch training: scalar baseline (objective_config=None)\n",
+      "  ✓ BasicSearch training: weighted mode with vector score selection\n",
+      "  ✓ BasicSearch training: Pareto mode with deterministic tie-break\n",
+      "  ✓ current_score stays float, current_score_dict stores vector\n",
+      "\n"
+     ]
+    }
+   ],
    "source": [
     "print(\"\\n\" + \"=\" * 70)\n",
     "print(\"PART A COMPLETE — StubLLM Section\")\n",
@@ -601,10 +902,19 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 11,
    "id": "a0000022",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "API key loaded from environment variable.\n",
+      "CustomLLM configured for OpenRouter (google/gemini-2.0-flash-001).\n"
+     ]
+    }
+   ],
    "source": [
     "import os\n",
     "\n",
@@ -649,10 +959,21 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 12,
    "id": "a0000023",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "--- Smoke test: real LLM call ---\n",
+      "  Response: 4\n",
+      "\n",
+      "  LLM connection verified.\n"
+     ]
+    }
+   ],
    "source": [
     "# Skip this cell if no API key\n",
     "if not api_key:\n",
@@ -673,10 +994,74 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 13,
    "id": "a0000024",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "======================================================================\n",
+      "REAL LLM TRAINING: Weighted mode with multi-metric guide\n",
+      "======================================================================\n",
+      "Evaluating agent (iteration 0) (Running sequentially).\n",
+      "[Step 0] \u001b[92mAverage test score: 0.0\u001b[0m\n",
+      "Forward pass (batch size: 1) (Running sequentially).\n",
+      "Generating 2 proposals (Running sequentially).\n",
+      "Validating proposals (vector) (Running sequentially).\n",
+      "Validating proposals (vector) (Running sequentially).\n",
+      "Validating proposals (vector) (Running sequentially).\n",
+      "[Step 0] \u001b[92mValidation score: 0.75\u001b[0m\n",
+      "[Step 0] \u001b[92mValidation score/accuracy: 1.0\u001b[0m\n",
+      "[Step 0] \u001b[92mValidation score/brevity: 0.5\u001b[0m\n",
+      "Checking improvement (iteration 0) (Running sequentially).\n",
+      "\u001b[92mUpdate accepted: Current score 0.0, New score 1.0\u001b[0m\n",
+      "Evaluating agent (iteration 1) (Running sequentially).\n",
+      "[Step 1] \u001b[92mAverage test score: 1.0\u001b[0m\n",
+      "Epoch: 0. Iteration: 1\n",
+      "[Step 1] Instantaneous train score: 0.0\n",
+      "[Step 1] Average train score: 0.0\n",
+      "[Step 1] \u001b[91mParameter: str:30: You are a helpful assistant. Your task is to calculate the answer to the question. You should respond with the numerical answer only.\u001b[0m\n",
+      "Forward pass (batch size: 1) (Running sequentially).\n",
+      "Generating 2 proposals (Running sequentially).\n",
+      "Validating proposals (vector) (Running sequentially).\n",
+      "Validating proposals (vector) (Running sequentially).\n",
+      "[Step 1] \u001b[92mValidation score: 0.75\u001b[0m\n",
+      "[Step 1] \u001b[92mValidation score/accuracy: 1.0\u001b[0m\n",
+      "[Step 1] \u001b[92mValidation score/brevity: 0.5\u001b[0m\n",
+      "Checking improvement (iteration 1) (Running sequentially).\n",
+      "\u001b[91mUpdate rejected: Current score 1.0, New score 1.0\u001b[0m\n",
+      "Evaluating agent (iteration 2) (Running sequentially).\n",
+      "[Step 2] \u001b[92mAverage test score: 1.0\u001b[0m\n",
+      "Epoch: 0. Iteration: 2\n",
+      "[Step 2] Instantaneous train score: 1.0\n",
+      "[Step 2] Average train score: 0.5\n",
+      "[Step 2] \u001b[91mParameter: str:30: You are a helpful assistant. Your task is to calculate the answer to the question. You should respond with the numerical answer only.\u001b[0m\n",
+      "Forward pass (batch size: 1) (Running sequentially).\n",
+      "Generating 2 proposals (Running sequentially).\n",
+      "Validating proposals (vector) (Running sequentially).\n",
+      "Validating proposals (vector) (Running sequentially).\n",
+      "[Step 2] \u001b[92mValidation score: 0.75\u001b[0m\n",
+      "[Step 2] \u001b[92mValidation score/accuracy: 1.0\u001b[0m\n",
+      "[Step 2] \u001b[92mValidation score/brevity: 0.5\u001b[0m\n",
+      "Checking improvement (iteration 2) (Running sequentially).\n",
+      "\u001b[91mUpdate rejected: Current score 1.0, New score 1.0\u001b[0m\n",
+      "Evaluating agent (iteration 3) (Running sequentially).\n",
+      "[Step 3] \u001b[92mAverage test score: 1.0\u001b[0m\n",
+      "Epoch: 0. Iteration: 3\n",
+      "[Step 3] Instantaneous train score: 1.0\n",
+      "[Step 3] Average train score: 0.6666666666666666\n",
+      "[Step 3] \u001b[91mParameter: str:30: You are a helpful assistant. Your task is to calculate the answer to the question. You should respond with the numerical answer only.\u001b[0m\n",
+      "\n",
+      "Real LLM training scores: [np.float64(0.0), np.float64(1.0), np.float64(1.0)]\n",
+      "current_score (float): 0.75\n",
+      "current_score_dict: {'accuracy': 1.0, 'brevity': 0.5}\n",
+      "\n",
+      "Final system prompt: You are a helpful assistant. Your task is to calculate the answer to the question. You should respond with the numerical answer only.\n"
+     ]
+    }
+   ],
    "source": [
     "# Real LLM training with weighted multi-objective selection\n",
     "if not api_key:\n",
@@ -722,10 +1107,75 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 14,
    "id": "a0000025",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "======================================================================\n",
+      "REAL LLM TRAINING: Pareto mode for comparison\n",
+      "======================================================================\n",
+      "Evaluating agent (iteration 0) (Running sequentially).\n",
+      "[Step 0] \u001b[92mAverage test score: 0.0\u001b[0m\n",
+      "Forward pass (batch size: 1) (Running sequentially).\n",
+      "Generating 2 proposals (Running sequentially).\n",
+      "Validating proposals (vector) (Running sequentially).\n",
+      "Validating proposals (vector) (Running sequentially).\n",
+      "Validating proposals (vector) (Running sequentially).\n",
+      "[Step 0] \u001b[92mValidation score: 0.75\u001b[0m\n",
+      "[Step 0] \u001b[92mValidation score/accuracy: 1.0\u001b[0m\n",
+      "[Step 0] \u001b[92mValidation score/brevity: 0.5\u001b[0m\n",
+      "Checking improvement (iteration 0) (Running sequentially).\n",
+      "\u001b[92mUpdate accepted: Current score 0.0, New score 1.0\u001b[0m\n",
+      "Evaluating agent (iteration 1) (Running sequentially).\n",
+      "[Step 1] \u001b[92mAverage test score: 1.0\u001b[0m\n",
+      "Epoch: 0. Iteration: 1\n",
+      "[Step 1] Instantaneous train score: 0.0\n",
+      "[Step 1] Average train score: 0.0\n",
+      "[Step 1] \u001b[91mParameter: str:37: You are a helpful assistant. Your task is to answer math questions. You should only provide the numerical answer without any explanation or problem description.\u001b[0m\n",
+      "Forward pass (batch size: 1) (Running sequentially).\n",
+      "Generating 2 proposals (Running sequentially).\n",
+      "Validating proposals (vector) (Running sequentially).\n",
+      "Validating proposals (vector) (Running sequentially).\n",
+      "[Step 1] \u001b[92mValidation score: 0.75\u001b[0m\n",
+      "[Step 1] \u001b[92mValidation score/accuracy: 1.0\u001b[0m\n",
+      "[Step 1] \u001b[92mValidation score/brevity: 0.5\u001b[0m\n",
+      "Checking improvement (iteration 1) (Running sequentially).\n",
+      "\u001b[91mUpdate rejected: Current score 1.0, New score 1.0\u001b[0m\n",
+      "Evaluating agent (iteration 2) (Running sequentially).\n",
+      "[Step 2] \u001b[92mAverage test score: 1.0\u001b[0m\n",
+      "Epoch: 0. Iteration: 2\n",
+      "[Step 2] Instantaneous train score: 1.0\n",
+      "[Step 2] Average train score: 0.5\n",
+      "[Step 2] \u001b[91mParameter: str:37: You are a helpful assistant. Your task is to answer math questions. You should only provide the numerical answer without any explanation or problem description.\u001b[0m\n",
+      "Forward pass (batch size: 1) (Running sequentially).\n",
+      "Generating 2 proposals (Running sequentially).\n",
+      "Validating proposals (vector) (Running sequentially).\n",
+      "Validating proposals (vector) (Running sequentially).\n",
+      "[Step 2] \u001b[92mValidation score: 0.8333333333333333\u001b[0m\n",
+      "[Step 2] \u001b[92mValidation score/accuracy: 1.0\u001b[0m\n",
+      "[Step 2] \u001b[92mValidation score/brevity: 0.6666666666666666\u001b[0m\n",
+      "Checking improvement (iteration 2) (Running sequentially).\n",
+      "\u001b[91mUpdate rejected: Current score 1.0, New score 1.0\u001b[0m\n",
+      "Evaluating agent (iteration 3) (Running sequentially).\n",
+      "[Step 3] \u001b[92mAverage test score: 1.0\u001b[0m\n",
+      "Epoch: 0. Iteration: 3\n",
+      "[Step 3] Instantaneous train score: 1.0\n",
+      "[Step 3] Average train score: 0.6666666666666666\n",
+      "[Step 3] \u001b[91mParameter: str:37: You are a helpful assistant. Your task is to answer math questions. You should only provide the numerical answer without any explanation or problem description.\u001b[0m\n",
+      "\n",
+      "Pareto training scores: [np.float64(0.0), np.float64(1.0), np.float64(1.0)]\n",
+      "current_score_dict: {'accuracy': 1.0, 'brevity': 0.6666666666666666}\n",
+      "\n",
+      "--- Comparison ---\n",
+      "Weighted mode final: {'accuracy': 1.0, 'brevity': 0.5}\n",
+      "Pareto mode final:   {'accuracy': 1.0, 'brevity': 0.6666666666666666}\n"
+     ]
+    }
+   ],
    "source": [
     "# Real LLM: Pareto mode comparison\n",
     "if not api_key:\n",
@@ -768,10 +1218,36 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 15,
    "id": "a0000026",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "======================================================================\n",
+      "M1 NOTEBOOK COMPLETE\n",
+      "======================================================================\n",
+      "\n",
+      "Deliverables verified:\n",
+      "  ✓ Part A (StubLLM): All cells run without API keys\n",
+      "    - ObjectiveConfig creation + validation\n",
+      "    - MultiMetricGuide with get_score_dict()\n",
+      "    - evaluate_vector() + aggregate_vector_scores()\n",
+      "    - BasicSearch: scalar, weighted, and Pareto modes\n",
+      "    - Backward compatibility (objective_config=None)\n",
+      "    - Deterministic tie-break verification\n",
+      "\n",
+      "  ✓ Part B (Real LLM): Trained with actual model via OpenRouter\n",
+      "    - Weighted and Pareto mode with real LLM proposals\n",
+      "    - Multi-metric selection (accuracy + brevity)\n",
+      "    - current_score_dict populated with real scores\n",
+      "\n"
+     ]
+    }
+   ],
    "source": [
     "print(\"\\n\" + \"=\" * 70)\n",
     "print(\"M1 NOTEBOOK COMPLETE\")\n",
@@ -796,13 +1272,21 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "venv",
    "language": "python",
    "name": "python3"
   },
   "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
    "name": "python",
-   "version": "3.11.0"
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.6"
   }
  },
  "nbformat": 4,