From 95069214f6b09063969546ca21decf3d3c07c0b2 Mon Sep 17 00:00:00 2001 From: Jose Carlos Rodriguez Date: Mon, 9 Feb 2026 16:10:38 -0400 Subject: [PATCH 1/5] T6 M0: Technical plan + analysis notebook for multi-objective vector scores --- docs/T6_technical_plan.md | 837 +++++++++++++++++++++ examples/notebooks/t6_m0_analysis.ipynb | 950 ++++++++++++++++++++++++ 2 files changed, 1787 insertions(+) create mode 100644 docs/T6_technical_plan.md create mode 100644 examples/notebooks/t6_m0_analysis.ipynb diff --git a/docs/T6_technical_plan.md b/docs/T6_technical_plan.md new file mode 100644 index 00000000..e37c0c8c --- /dev/null +++ b/docs/T6_technical_plan.md @@ -0,0 +1,837 @@ +# T6 Technical Plan — Multi-Objective Vector Scores for Trainer Selection + +**Version:** 1.0 (Refined) +**Author:** Carlos Rodriguez +**Date:** February 9, 2025 +**Status:** M0 Deliverable — Analysis + Architecture + Interface Spec + +**Target repos / branches:** +- **Primary (implementation + PR):** [`AgentOpt/OpenTrace@experimental`](https://github.com/AgentOpt/OpenTrace/tree/experimental) +- **Benchmark integration (M3):** [`AgentOpt/Trace-Bench`](https://github.com/AgentOpt/Trace-Bench) + +--- + +## Table of Contents + +1. [Executive Summary](#1-executive-summary) +2. [Goals, Non-Goals, Success Criteria](#2-goals-non-goals-success-criteria) +3. [Current Code Reality (Baseline)](#3-current-code-reality-baseline) +4. [Proposed Architecture (Minimal Delta)](#4-proposed-architecture-minimal-delta) +5. [Public API & Data Contracts](#5-public-api--data-contracts) +6. [Module Modifications (Files to Create / Modify)](#6-module-modifications) +7. [Edge Cases & Defensive Design](#7-edge-cases--defensive-design) +8. [Milestones & Validation Gates](#8-milestones--validation-gates) +9. [Test Plan](#9-test-plan) +10. [Risks & Mitigation](#10-risks--mitigation) +11. [Design Decisions (Resolved)](#11-design-decisions-resolved) +12. [Appendix: Code Touchpoints](#12-appendix-code-touchpoints) + +--- + +## 1. Executive Summary + +Today, trainer selection in Trace is driven by a **single scalar score**. Guides return `Tuple[float, str]` via `get_feedback()`, evaluators produce `np.array` of floats, and trainers (`BasicSearchAlgorithm`, `BeamsearchAlgorithm`) select candidates via scalar comparison (`max(candidates, key=lambda x: x[0])` and `sorted(..., key=lambda x: x[0])` respectively). This blocks trainer-side search from exploiting multiple metrics like `{accuracy, latency_ms, cost}`. + +### What this plan adds + +| Component | Change | +|-----------|--------| +| **Score contract** | `Dict[str, float]` returned by guides (optional), with backward-compatible scalar fallback | +| **ObjectiveConfig** | Frozen dataclass defining selection mode: `scalar` (default), `weighted`, or `pareto` | +| **objectives.py** (new) | All multi-objective logic isolated in pure, testable functions | +| **Evaluators** | Vector-score aggregation helpers (`evaluate_vector`, `aggregate_vector_scores`) | +| **BasicSearchAlgorithm** | Selection via `select_best(candidates, objective_config)` | +| **BeamsearchAlgorithm** | Selection via `select_top_k(candidates, objective_config, k)` | +| **PrioritySearch** (optional) | Scalarize heap priority via ObjectiveConfig; store dict for logging | +| **Benchmarks** (M3) | 3 simple benchmarks integrated into Trace-Bench | + +### Guiding principles + +- **Backward compatibility is non-negotiable.** `mode="scalar"` (the default) preserves identical behavior. +- **Isolate complexity.** All multi-objective logic lives in `objectives.py` — pure functions, easy to test. +- **Minimal churn.** Trainers gain an optional `objective_config` parameter; existing call sites are untouched. +- **Determinism.** Fixed `seed` → deterministic selection, especially Pareto tie-breaks. + +--- + +## 2. Goals, Non-Goals, Success Criteria + +### 2.1 Goals + +| ID | Goal | Acceptance Signal | +|----|------|-------------------| +| G1 | **Backward compatibility** | Existing scalar-score guides/trainers produce identical results when `objective_config` is `None` or `mode="scalar"` | +| G2 | **Vector score support** | Guide returns `{"accuracy": 1.0, "latency_ms": 120.0}` and trainers select candidates using weighted or Pareto mode | +| G3 | **Determinism** | Fixed `seed` → identical selection across runs (tested in CI) | +| G4 | **Actionability** | Every milestone: Colab notebook + pytest coverage (M1+) | +| G5 | **Benchmarks** | 3 benchmarks defined, integrated into Trace-Bench, runnable from notebooks | + +### 2.2 Non-goals (explicit) + +- No multi-objective UCB (MO-UCB) — too risky for v1 scope. +- No Pareto archive / non-dominated set management inside PrioritySearch. +- No changes to optimizer internals or new telemetry infrastructure. +- No modification to `get_feedback()` return signature (we use a helper instead). + +### 2.3 Crisp success criteria + +All of the following must be true: + +1. Scalar-only trainers still work and produce same results by default. +2. Multi-objective guide dict works end-to-end for BasicSearch + Beamsearch. +3. Deterministic behavior with fixed seed (tests + notebook). +4. Each milestone delivers a runnable Colab notebook. +5. From M1 onward, new functions have pytest tests and CI is green. +6. M3: three benchmarks exist, run, and Trace-Bench integration works. + +--- + +## 3. Current Code Reality (Baseline) + +### 3.1 Guide — scalar score contract + +```python +# opto/trainer/guide.py + +class Guide: + def get_feedback(self, query, response, reference=None, **kwargs) -> Tuple[float, str]: + raise NotImplementedError + + def metric(self, query, response, reference=None, **kwargs) -> float: + return self.get_feedback(query, response, reference)[0] # extracts scalar +``` + +**Implication:** `metric()` always returns `float`. Multi-metric feedback is not usable for selection. + +### 3.2 Evaluators — scalar arrays + +```python +# opto/trainer/evaluators.py + +def evaluate(agent, guide, inputs, infos, ...) -> np.ndarray: + # Calls guide.metric() per example → float + # Returns np.array of shape (N,) or (N, num_samples) +``` + +**Implication:** All scores are numeric scalars aggregated via `np.mean()`. + +### 3.3 BasicSearchAlgorithm — scalar max selection + +```python +# opto/trainer/algorithms/basic_algorithms.py :: BasicSearchAlgorithm.optimizer_step() + +def validate(): + scores = evaluate(self.agent, self.validate_guide, ...) + return np.mean(scores) if all([s is not None for s in scores]) else -np.inf + +# Selection: +candidates.append((score, update_dict)) # score is float +best_score, best_update = max(candidates, key=lambda x: x[0]) # scalar max +``` + +**Insertion point:** Replace `max(candidates, ...)` with `select_best(candidates, objective_config)`. + +### 3.4 BeamsearchAlgorithm — scalar sort selection + +```python +# opto/trainer/algorithms/beamsearch_algorithm.py :: BeamsearchAlgorithm.select() + +scored_candidates.append((validation_score, candidate_params)) # float +sorted_candidates = sorted(scored_candidates, key=lambda x: x[0], reverse=True) +selected_candidates = sorted_candidates[:beam_width] # take top-k by scalar +``` + +**Insertion point:** Replace scalar sort with `select_top_k(scored_candidates, objective_config, k=beam_width)`. + +### 3.5 Shared patterns across both trainers + +| Pattern | BasicSearch | Beamsearch | +|---------|-------------|------------| +| Validate | `np.mean(scores)` → float | `np.mean(validation_scores)` → float | +| Store | `(score, update_dict)` | `(validation_score, candidate_params)` | +| Select | `max(candidates, key=λ x: x[0])` | `sorted(candidates, key=λ x: x[0])[:k]` | +| Fallback | `-np.inf` | `-np.inf` | + +Both converge to the same abstraction: **given a list of `(score, params)` pairs, select the best or top-k.** This is exactly what `objectives.py` will provide. + +### 3.6 Existing infrastructure we leverage + +- **Logger abstraction:** `BaseLogger` with `log(name, value, step)` — can log each metric in a vector score. +- **StubLLM / DummyLLM:** Wraps deterministic callables — usable for CI and no-keys notebooks. +- **`batch_run` / `async_run`:** Parallelism utilities already in place. + +--- + +## 4. Proposed Architecture (Minimal Delta) + +### 4.1 Core idea + +Isolate all multi-objective logic into one new module (`opto/trainer/objectives.py`) containing **pure functions**: + +``` +normalize_score() → scalar ↔ dict conversion +apply_minimize() → flip signs for minimize metrics +weighted_scalarize()→ dict → float via weighted sum +pareto_rank() → dominance ranking + tie-break +select_best() → given candidates + config, return best index +select_top_k() → given candidates + config, return top-k indices +``` + +Trainers call these functions instead of inline `max()` / `sorted()`. When `objective_config` is `None`, the functions fall through to scalar comparison — **identical to current behavior**. + +### 4.2 Data flow (target) + +``` +Guide.get_feedback() + │ + ├── returns (float, str) ← existing path, unchanged + └── returns (Dict[str,float], str) ← new path (via get_score_dict helper) + │ + ▼ +Evaluator.evaluate_vector() + │ + ├── per-example: List[Dict[str, float]] + └── aggregated: Dict[str, float] (mean per metric) + │ + ▼ +Trainer selection (objectives.py) + │ + ├── mode="scalar" → max(mean_scores) ← unchanged + ├── mode="weighted" → max(weighted_scalarize()) ← new + └── mode="pareto" → pareto_rank() + tie-break ← new +``` + +### 4.3 Backward compatibility guarantee + +The entire vector-score path is **opt-in**: + +1. If `objective_config` is `None` → existing scalar path, no new code executed. +2. If guide returns `float` and `objective_config` is provided → `normalize_score()` wraps it as `{"score": float}`, weights default to `{"score": 1.0}`. +3. If guide returns `Dict[str, float]` and `objective_config` is `None` → `mean(values)` used as scalar fallback, preserving scalar selection. + +--- + +## 5. Public API & Data Contracts + +### 5.1 Score types + +```python +from typing import Union, Dict + +ScalarScore = float +VectorScore = Dict[str, float] # JSON-serializable, all values finite +ScoreLike = Union[int, float, bool, Dict[str, float]] +``` + +**Contract:** +- "Higher is better" by default for all metrics. +- Metrics to minimize are declared in `ObjectiveConfig.minimize` (semantics: negate internally). +- All dict values must be finite floats. `NaN` / `±inf` in a dict raises `ValueError`. +- `int` and `bool` scalar scores are accepted and converted to `float` (e.g., `LLMJudge` returns `int` 0/1, test guides return `bool`). + +### 5.2 ObjectiveConfig + +```python +from dataclasses import dataclass, field +from typing import Literal, Optional, Dict, Tuple + +@dataclass(frozen=True) +class ObjectiveConfig: + """Configuration for multi-objective candidate selection. + + Attributes: + mode: Selection strategy. + - "scalar": Use existing scalar comparison (default, backward-compatible). + - "weighted": Scalarize via weighted sum, then select max. + - "pareto": Pareto dominance ranking with configurable tie-break. + weights: Per-metric weights for weighted scalarization. + Missing metrics use missing_value. Metrics not present in the weights dict + are ignored (not included in the weighted sum). + If empty dict in weighted mode, all present metrics get equal weight 1.0. + minimize: Frozenset of metric names where lower is better (users can pass set; auto-converted). + These are negated internally before comparison ("higher-is-better" normalization). + missing_value: Score assigned to missing metrics in a candidate's score dict. + Default: float('-inf') (effectively disqualifies candidates missing required metrics). + pareto_metrics: Subset of metrics to use for Pareto dominance. + If None, all metrics present across candidates are used. + tie_break: Strategy for breaking ties among Pareto-equivalent candidates. + - "weighted": Fall back to weighted scalarization among tied candidates. + - "lexicographic": Sort by metrics in alphabetical order. + - "random_seeded": Seeded random shuffle. + seed: Random seed for deterministic tie-breaking. + """ + mode: Literal["scalar", "weighted", "pareto"] = "scalar" + weights: Dict[str, float] = field(default_factory=dict) + minimize: frozenset = field(default_factory=frozenset) + missing_value: float = float("-inf") + pareto_metrics: Optional[Tuple[str, ...]] = None + tie_break: Literal["weighted", "lexicographic", "random_seeded"] = "weighted" + seed: int = 0 + + def __post_init__(self): + # Convert set → frozenset for true immutability + hashability + if isinstance(self.minimize, set): + object.__setattr__(self, 'minimize', frozenset(self.minimize)) + # Validate weights are non-negative + for k, v in self.weights.items(): + if v < 0: + raise ValueError(f"Weight for '{k}' must be non-negative, got {v}") + # Validate pareto_metrics + if self.pareto_metrics is not None and len(self.pareto_metrics) == 0: + raise ValueError("pareto_metrics must be None (auto) or non-empty tuple") +``` + +**Validation rules (enforced in `__post_init__`):** +- `minimize` is stored as `frozenset` for true immutability (users can pass `set` for convenience; it's auto-converted). +- `mode="weighted"` with empty `weights` → auto-assign equal weight 1.0 to all encountered metrics. +- `mode="pareto"` with `pareto_metrics=None` → use union of all metric keys across candidates. +- `mode="pareto"` with `pareto_metrics=()` → `ValueError`. +- All weight values must be non-negative. +- `minimize` metric names must be valid strings (warning if not found in any candidate). + +### 5.3 Guide helper method + +```python +# Added to Guide base class (non-breaking) + +class Guide: + # ... existing methods unchanged ... + + def get_score_dict(self, query: str, response: str, reference=None, **kwargs) -> Dict[str, float]: + """Return the evaluation score as a dictionary. + + Wraps get_feedback() for backward compatibility: + - If get_feedback returns (float, str): returns {"score": float} + - If get_feedback returns (dict, str): returns dict directly + + Subclasses returning multi-metric scores should override get_feedback() + to return (Dict[str, float], str) instead of (float, str). + """ + score, _ = self.get_feedback(query, response, reference, **kwargs) + if isinstance(score, dict): + return score + return {"score": float(score)} + + def metric(self, query: str, response: str, reference=None, **kwargs) -> float: + """Always returns float. For dict scores, returns mean of values as scalar fallback. + + This ensures evaluate() and the training loop (which call metric()) remain + completely safe. Dict scores only flow through get_score_dict() → evaluate_vector(). + """ + score, _ = self.get_feedback(query, response, reference, **kwargs) + if isinstance(score, dict): + return float(np.mean(list(score.values()))) + return float(score) +``` + +**Why this approach:** +- `get_score_dict()` is a new method — zero risk of breaking existing subclasses. +- `metric()` always returns `float` — the existing `evaluate()` function (which calls `guide.metric()` and passes results to `np.array()`) and the training loop (which calls `np.mean(scores)`) are completely unaffected. +- Dict scores are only accessible via `get_score_dict()` → `evaluate_vector()`, keeping the two data paths cleanly separated. + +### 5.4 Evaluator additions + +```python +# Added to opto/trainer/evaluators.py + +def evaluate_vector(agent, guide, inputs, infos, min_score=None, + num_samples=1, num_threads=None, description=None + ) -> list: + """Like evaluate(), but returns List[ScoreLike] (float or dict per example). + + Uses guide.get_score_dict() to obtain dict scores per example. + When guide returns scalar, get_score_dict() wraps it as {"score": float}. + + When num_samples > 1: for each example, collects num_samples score dicts, + computes per-key mean across the samples, and returns one aggregated dict + per example. Final output is always List[Dict[str, float]] of length N. + """ + ... + +def aggregate_vector_scores(scores: list) -> Union[float, Dict[str, float]]: + """Aggregate per-example scores into a single summary score. + + - If all scores are float: returns np.mean (existing behavior). + - If all scores are dict: returns per-metric mean dict. + - Mixed float/dict: normalizes all to dict via normalize_score(), then averages. + + Args: + scores: List of float or Dict[str, float] values. + + Returns: + float (if all scalar) or Dict[str, float] (if any dicts present). + """ + ... +``` + +### 5.5 objectives.py — complete function signatures + +```python +# opto/trainer/objectives.py (NEW FILE) + +from typing import Union, Dict, List, Set, Optional, Tuple, Literal +from dataclasses import dataclass, field + +# --- ObjectiveConfig defined here (see §5.2) --- + +# --- Score type aliases --- +ScalarScore = float +VectorScore = Dict[str, float] +ScoreLike = Union[float, Dict[str, float]] + +# --- Pure utility functions --- + +def normalize_score(score: ScoreLike) -> Dict[str, float]: + """Convert any score to dict form. + + - int/float/bool → {"score": float(value)} + - Dict[str, float] → returned as-is (validated: all values finite) + + Handles int (LLMJudge returns 0/1) and bool (test guides) via isinstance(score, (int, float, bool)). + + Raises: + TypeError: if score is not int, float, bool, or dict + ValueError: if dict contains non-finite values or is empty + """ + ... + +def apply_minimize(score_dict: Dict[str, float], + minimize: Set[str]) -> Dict[str, float]: + """Negate values for minimize metrics (higher-is-better normalization). + + Returns a new dict with minimize metrics negated. + Metrics not in minimize set are unchanged. + """ + ... + +def weighted_scalarize(score_dict: Dict[str, float], + weights: Dict[str, float], + missing_value: float = float("-inf")) -> float: + """Compute weighted sum of score dict. + + For each metric in weights: + - If present in score_dict: weight * value + - If missing: weight * missing_value + + Metrics in score_dict but NOT in weights are ignored. + If weights is empty, all metrics get equal weight 1.0. + + Returns: + Weighted scalar score. + """ + ... + +def dominates(a: Dict[str, float], b: Dict[str, float], + metrics: Optional[Tuple[str, ...]] = None) -> bool: + """Check if candidate 'a' Pareto-dominates candidate 'b'. + + a dominates b iff: + - a[m] >= b[m] for all metrics m, AND + - a[m] > b[m] for at least one metric m + + Both dicts must already be in "higher-is-better" form (post apply_minimize). + Missing metrics are treated as missing_value (caller should handle before call). + + Args: + a, b: Score dicts (higher-is-better normalized). + metrics: Subset of metrics to compare. If None, use union of keys. + """ + ... + +def pareto_rank(candidates: List[Dict[str, float]], + metrics: Optional[Tuple[str, ...]] = None) -> List[int]: + """Assign Pareto rank to each candidate (0 = non-dominated front). + + Uses standard non-dominated sorting. + + Args: + candidates: List of score dicts (higher-is-better normalized). + metrics: Subset of metrics for dominance. If None, use all present. + + Returns: + List of integer ranks (same length as candidates). Rank 0 = Pareto front. + """ + ... + +def select_best(candidates: List[Tuple[ScoreLike, any]], + objective_config: Optional['ObjectiveConfig'] = None) -> int: + """Select the single best candidate index. + + Args: + candidates: List of (score, payload) tuples. + objective_config: Selection config. If None, uses scalar max (backward-compatible). + + Returns: + Index of best candidate. + + Behavior by mode: + - scalar/None: max(score) where score is float (or mean of dict values). + - weighted: max(weighted_scalarize(normalize(score), config.weights)). + - pareto: rank candidates, tie-break among rank-0 front, return winner. + + Call-site transformation (BasicSearch): + # Current: + best_score, best_update = max(candidates, key=lambda x: x[0]) + # Target: + best_idx = select_best(candidates, objective_config) + best_score, best_update = candidates[best_idx] + """ + ... + +def select_top_k(candidates: List[Tuple[ScoreLike, any]], + objective_config: Optional['ObjectiveConfig'] = None, + k: int = 1) -> List[int]: + """Select the top-k candidate indices. + + Same logic as select_best, but returns k indices. + + For pareto mode: returns rank-0 front (up to k). If front < k, + includes rank-1 candidates by tie-break order, etc. + + Deterministic ordering guaranteed with fixed seed. + """ + ... +``` + +--- + +## 6. Module Modifications + +### 6.1 Files to CREATE + +| File | Contents | Milestone | +|------|----------|-----------| +| `opto/trainer/objectives.py` | `ObjectiveConfig`, `normalize_score`, `apply_minimize`, `weighted_scalarize`, `dominates`, `pareto_rank`, `select_best`, `select_top_k` | M1 | +| `tests/test_objectives.py` | Unit tests for all functions in objectives.py | M1 | +| `tests/test_evaluators_vector.py` | Tests for evaluate_vector + aggregate_vector_scores | M1 | +| `tests/test_trainers_multiobjective.py` | Integration tests for BasicSearch + Beamsearch with ObjectiveConfig | M2 | +| `examples/notebooks/t6_m0_analysis.ipynb` | M0 analysis notebook | M0 | +| `examples/notebooks/t6_m1_vector_scores.ipynb` | M1 demo notebook | M1 | +| `examples/notebooks/t6_m2_trainers.ipynb` | M2 demo notebook | M2 | +| `examples/notebooks/t6_m3_benchmarks.ipynb` | M3 benchmark notebook | M3 | +| `docs/T6_technical_plan.md` | This document | M0 | +| `docs/multi_objective_scores.md` | User-facing documentation | M4 | + +### 6.2 Files to MODIFY + +| File | Change | Milestone | +|------|--------|-----------| +| `opto/trainer/guide.py` | Add `get_score_dict()` method to `Guide` base class. Update `metric()` to collapse dict scores to `float` via `mean(values)` (return type stays `float`). | M1 | +| `opto/trainer/evaluators.py` | Add `evaluate_vector()` and `aggregate_vector_scores()`. Existing `evaluate()` unchanged. | M1 | +| `opto/trainer/algorithms/basic_algorithms.py` | Add `objective_config` param to `BasicSearchAlgorithm.train()`. Replace `max(candidates, ...)` with `select_best()` in `optimizer_step()`. | M1 (minimal) / M2 (robust) | +| `opto/trainer/algorithms/beamsearch_algorithm.py` | Add `objective_config` param to `BeamsearchAlgorithm.train()`. Replace scalar sort in `select()` with `select_top_k()`. | M2 | +| `opto/features/priority_search/priority_search.py` | (Optional) Add `objective_config` param. Scalarize heap key via weighted mode. Store dict for logging. Pareto falls back to weighted. | M2 | + +### 6.3 Files NOT modified + +- `opto/trace/` — no changes to trace primitives. +- `opto/optimizers/` — optimizers are upstream of selection; they produce candidates, not rank them. +- Existing tests — no modifications; they validate backward compatibility by continuing to pass. + +--- + +## 7. Edge Cases & Defensive Design + +### 7.1 Score validation + +| Case | Behavior | +|------|----------| +| `score = 0.85` (float) | `normalize_score()` → `{"score": 0.85}` | +| `score = 1` (int) | `normalize_score()` → `{"score": 1.0}` (LLMJudge returns int 0/1) | +| `score = True` (bool) | `normalize_score()` → `{"score": 1.0}` (test guides return bool) | +| `score = {"accuracy": 0.9, "latency_ms": 120.0}` | Returned as-is after validation | +| `score = {}` (empty dict) | `ValueError("Score dict must not be empty")` | +| `score = {"accuracy": float('nan')}` | `ValueError("Score dict contains non-finite value")` | +| `score = {"accuracy": float('inf')}` | `ValueError("Score dict contains non-finite value")` | +| `score = "text"` (wrong type) | `TypeError("Score must be int, float, bool, or Dict[str, float]")` | + +### 7.2 Missing metrics across candidates + +| Case | Behavior | +|------|----------| +| Candidate A has `{accuracy, latency}`, B has `{accuracy}` | B gets `latency = missing_value` (default `-inf`) | +| `weights = {"accuracy": 0.7, "latency": 0.3}`, candidate missing `latency` | Weighted sum uses `0.3 * missing_value` | +| All candidates missing a weighted metric | Warning logged; metric still contributes `weight * missing_value` | + +### 7.3 Mixed scalar/dict batches + +| Case | Behavior | +|------|----------| +| All scores are `float` (or `int`/`bool`) | `aggregate_vector_scores()` returns `float` via `np.mean()` (existing behavior) | +| All scores are `dict` with same keys | `aggregate_vector_scores()` returns per-metric mean `Dict[str, float]` | +| Mixed `float` and `dict` in same batch | `ValueError("All scores in a batch must be the same type (all float or all dict)")` | + +A mixed batch most likely indicates a bug in the guide implementation (e.g., returning `float` on some inputs and `dict` on others). Failing loudly prevents silent incorrect aggregation. + +### 7.4 Single-metric dict + +| Case | Behavior | +|------|----------| +| Guide returns `{"accuracy": 0.9}` with `mode="weighted"` | Weighted sum = `weight * 0.9` (trivially correct) | +| Guide returns `{"accuracy": 0.9}` with `mode="pareto"` | Pareto degenerates to scalar max (single dimension — no tradeoffs). Warning logged. | + +### 7.5 Tie-breaking + +| Case | Behavior | +|------|----------| +| Two candidates with identical weighted score | Deterministic: lower original index wins (stable sort) | +| Pareto front with 3 equivalent candidates, `tie_break="weighted"` | Fall back to weighted scalarization among the 3; select max | +| Pareto front with 3 equivalent candidates, `tie_break="lexicographic"` | Sort by metric names alphabetically, compare values in order | +| Pareto front with 3 equivalent candidates, `tie_break="random_seeded"` | Seeded shuffle with `config.seed`; same seed → same order always | + +### 7.7 Training loop safety + +The training loop has a **separate data path** from evaluation/selection. In `standard_optimization_step()` (basic_algorithms.py:46) and `standard_forward()` (sampler.py:130): + +```python +score, feedback = guide(x, target.data, info) +``` + +This `score` flows into `MinibatchAlgorithm.update()` where `np.mean(scores)` is computed (basic_algorithms.py:511). **This path must always receive `float`.** + +| Constraint | Enforcement | +|-----------|-------------| +| `guide.__call__()` / `get_feedback()` return type is **NOT widened** | No changes to `get_feedback()` signature; it still returns `Tuple[float, str]` | +| Training loop always receives scalar `score` | `metric()` always returns `float` (collapses dict via `mean(values)` if needed) | +| Dict scores flow through a separate path | `get_score_dict()` → `evaluate_vector()` → `select_best()` / `select_top_k()` | +| A multi-objective guide must return `(float, str)` from `get_feedback()` for the training loop | The float is a collapsed scalar summary; the full dict is extracted via `get_score_dict()` during selection | + +**Two data paths (by design):** +``` +Training loop: guide() → score (float) → np.mean(scores) ← UNCHANGED +Selection path: get_score_dict() → evaluate_vector() → objectives.py ← NEW +``` + +### 7.6 ObjectiveConfig validation + +| Case | Behavior | +|------|----------| +| `mode="weighted"`, `weights={}` | Auto-assign equal weight 1.0 to all metrics encountered at selection time | +| `mode="pareto"`, `pareto_metrics=()` (empty tuple) | `ValueError("pareto_metrics must be None (auto) or non-empty tuple")` | +| `weights={"accuracy": -0.5}` (negative weight) | `ValueError("All weights must be non-negative")` | +| `minimize={"unknown_metric"}` | Warning logged at selection time if metric never appears; no error (tolerant) | + +--- + +## 8. Milestones & Validation Gates + +### Milestone 0 — Analysis + technical plan + interface spec + +**Deliverables:** +- `docs/T6_technical_plan.md` — this document, finalized +- `notebooks/t6_m0_analysis.ipynb` — Colab-ready notebook + +**Notebook demonstrates:** +- Current Guide score contract (`get_feedback` → `Tuple[float, str]`, `metric` → `float`) +- Where scalar selection happens in BasicSearch (`max(candidates, ...)`) and Beamsearch (`sorted(...)[:k]`) +- Planned behavior prototype: deterministic toy guide returning dict metrics, showing weighted vs Pareto selection on dummy candidates + +**SMART validation:** +- Plan includes final API signatures and precise file list (create/modify) ✓ +- Notebook runs without API keys ✓ +- Notebook prints: current score contract, selection touchpoints, planned selection outputs ✓ + +--- + +### Milestone 1 — ObjectiveConfig + utilities + evaluator support + BasicSearch minimal + +**Deliverables:** +- `opto/trainer/objectives.py` (new) +- `opto/trainer/guide.py` (add `get_score_dict`) +- `opto/trainer/evaluators.py` (add `evaluate_vector`, `aggregate_vector_scores`) +- `opto/trainer/algorithms/basic_algorithms.py` (BasicSearch: accept/use ObjectiveConfig) +- `tests/test_objectives.py`, `tests/test_evaluators_vector.py` +- `notebooks/t6_m1_vector_scores.ipynb` + +**Notebook demonstrates:** +- StubLLM mode: BasicSearchAlgorithm on small candidate set (5-10) with deterministic dummy guide returning dict metrics +- Shows: (a) scalar baseline, (b) weighted mode, (c) Pareto mode, (d) deterministic tie-break under fixed seed +- Real LLM mode (optional): tiny dataset (≤5 items) producing ≥2 metrics + +**SMART validation:** +- `pytest -q` passes (all new functions covered) +- Notebook runs in Colab: weighted selection result changes when weights change +- Pareto returns tradeoffs and is deterministic under fixed seed +- Scalar path produces identical results to pre-change behavior + +--- + +### Milestone 2 — Trainer upgrades (Beamsearch + robust BasicSearch) + +**Deliverables:** +- `opto/trainer/algorithms/beamsearch_algorithm.py` (accept ObjectiveConfig, vector selection) +- Expanded BasicSearch tests (edge cases, missing metrics, tie-break policies) +- Optional: minimal PrioritySearch support (weighted scalarization for heap, dict stored for logging) +- `tests/test_trainers_multiobjective.py` +- `notebooks/t6_m2_trainers.ipynb` + +**Notebook demonstrates:** +- BasicSearch + Beamsearch in: scalar mode (baseline), weighted mode, Pareto mode +- StubLLM + real LLM sections + +**SMART validation:** +- `pytest -q` green +- Integration test confirms: weighted vs Pareto select different candidates where expected +- Scalar-only example produces same final best score when `objective_config=None` +- Deterministic tie-break is stable across runs + +--- + +### Milestone 3 — Benchmarks (Trace-Bench integration) + +**Deliverables:** +- PR to Trace-Bench: benchmark configs/tasks + notebook +- 3 benchmarks: + 1. **Accuracy vs latency** (toy QA dataset) + 2. **Accuracy vs response length** (penalize verbosity) + 3. **Accuracy vs tool calls** (penalize excessive tool usage) +- `notebooks/t6_m3_benchmarks.ipynb` + +**SMART validation:** +- Notebook outputs per-benchmark table: weighted-mode best candidate metrics + Pareto-mode set of tradeoffs +- Benchmarks run in StubLLM mode (fast/deterministic) and real LLM mode (small sample) +- Trace-Bench run completes without private datasets +- `pytest -q` green (smoke tests for benchmark integration) + +--- + +### Milestone 4 — Documentation + polished notebooks + +**Deliverables:** +- `docs/multi_objective_scores.md` — user-facing documentation +- README update with pointers to docs and notebooks +- Polished "How-to" notebook: installs from GitHub, runs BasicSearch weighted + Pareto, prints metric tradeoffs + +**SMART validation:** +- Fresh Colab runtime runs how-to notebook without manual patching +- CI green, no behavioral changes beyond documentation/polish + +--- + +## 9. Test Plan + +### 9.1 Unit tests — `tests/test_objectives.py` (M1) + +| Test | Validates | +|------|-----------| +| `test_normalize_score_from_float` | `0.85` → `{"score": 0.85}` | +| `test_normalize_score_from_dict` | `{"a": 1.0, "b": 2.0}` → same dict | +| `test_normalize_score_empty_dict_raises` | `{}` → `ValueError` | +| `test_normalize_score_nan_raises` | `{"a": float('nan')}` → `ValueError` | +| `test_normalize_score_wrong_type_raises` | `"text"` → `TypeError` | +| `test_apply_minimize` | `{"acc": 0.9, "lat": 100}` with `minimize={"lat"}` → `{"acc": 0.9, "lat": -100}` | +| `test_apply_minimize_empty_set` | No metrics negated | +| `test_weighted_scalarize_basic` | `{"a": 0.8, "b": 0.2}` with `weights={"a": 0.7, "b": 0.3}` → `0.7*0.8 + 0.3*0.2` | +| `test_weighted_scalarize_missing_metric` | Missing metric uses `missing_value` | +| `test_weighted_scalarize_empty_weights` | Equal weight 1.0 for all metrics | +| `test_dominates_true` | A dominates B (all ≥, at least one >) | +| `test_dominates_false_equal` | A == B → does not dominate | +| `test_dominates_false_tradeoff` | A better on one, B better on another | +| `test_pareto_rank_simple` | 3 candidates with clear rank 0, 1, 2 | +| `test_pareto_rank_all_nondominated` | All candidates rank 0 | +| `test_select_best_scalar_mode` | Falls back to scalar max | +| `test_select_best_weighted_mode` | Returns highest weighted score | +| `test_select_best_pareto_mode` | Returns Pareto-optimal by tie-break | +| `test_select_best_none_config` | `objective_config=None` → scalar max (backward compat) | +| `test_select_top_k_weighted` | Returns k highest weighted scores | +| `test_select_top_k_pareto` | Returns k from Pareto front + spillover | +| `test_deterministic_tie_break_seeded` | Same seed → same result across 100 runs | +| `test_deterministic_tie_break_different_seeds` | Different seeds → potentially different result | + +### 9.2 Unit tests — `tests/test_evaluators_vector.py` (M1) + +| Test | Validates | +|------|-----------| +| `test_aggregate_vector_scores_all_scalar` | `[0.8, 0.9, 0.7]` → `np.mean` (backward compat) | +| `test_aggregate_vector_scores_all_dict` | Per-metric mean computed correctly | +| `test_aggregate_vector_scores_mixed` | Scalars normalized to dict, then averaged | +| `test_evaluate_vector_returns_correct_types` | Returns list of ScoreLike matching guide output | + +### 9.3 Integration tests — `tests/test_trainers_multiobjective.py` (M2) + +| Test | Validates | +|------|-----------| +| `test_basicsearch_scalar_unchanged` | Default behavior identical to pre-change | +| `test_basicsearch_weighted_selects_expected` | Weighted mode picks correct candidate | +| `test_basicsearch_pareto_selects_expected` | Pareto mode picks different candidate than weighted | +| `test_beamsearch_scalar_unchanged` | Default behavior identical | +| `test_beamsearch_weighted_selects_top_k` | Weighted mode picks correct top-k | +| `test_beamsearch_pareto_selects_front` | Pareto mode returns non-dominated front | +| `test_deterministic_across_runs` | Fixed seed → same selections in 5 repeated runs | + +### 9.4 Notebook validation (human / Trace team) + +Each notebook contains: +- **StubLLM (no keys) section:** deterministic dummy guide, runs quickly +- **Real LLM section (optional):** small N (5-20 examples), prints cost/latency caveats, requires API key + +--- + +## 10. Risks & Mitigation + +| Risk | Severity | Mitigation | +|------|----------|------------| +| **R1: Missing metrics across candidates** | Medium | `missing_value` in ObjectiveConfig (default `-inf`). Enforce metric presence for configured weights (or warn). | +| **R2: Pareto nondeterminism** | High | Deterministic ordering via stable sort + explicit tie-break rules. Seeded randomness only when requested. | +| **R3: Multi-thread eval ordering** | Medium | Tests run with `num_threads=1` to guarantee stability. Document thread-safety considerations. | +| **R4: Breaking Guide subclasses** | High | Use `get_score_dict()` helper — never change `get_feedback()` signature. Union type on `metric()` is safe because existing callers only receive floats. | +| **R5: Performance regression** | Low | `objectives.py` functions are O(n²) for Pareto ranking on n candidates, but n is typically ≤20 (num_proposals). No concern at this scale. | +| **R6: Mixed scalar/dict in same batch** | Medium | `aggregate_vector_scores()` rejects mixed batches with `ValueError`. A mixed batch indicates a bug in the guide. | +| **R7: Training loop receives dict score** | High | `guide.__call__()` / `get_feedback()` return type is NOT widened. `metric()` always returns `float`. Dict scores only flow through `get_score_dict()` → `evaluate_vector()`. See §7.7. | + +--- + +## 11. Design Decisions (Resolved) + +### D1: Where to implement scalar→dict normalization? + +**Decision: Option A — `Guide.get_score_dict()` helper + `objectives.normalize_score()`** + +- `get_score_dict()` on Guide provides a clean entry point for subclasses. +- `normalize_score()` in objectives.py is the canonical utility (pure function, testable). +- Avoids widening `get_feedback()` return type (higher churn, breaks typing). + +### D2: Pareto selection definition + +**Decision: Option A — Standard dominance on aggregated metrics, return single best by tie-break.** + +- `select_best()` returns one winner. `select_top_k()` returns k winners. +- Trainers don't need to manage a "front" — they just get indices. +- Beamsearch naturally uses `select_top_k(k=beam_width)`. + +### D3: PrioritySearch scope + +**Decision: Minimal (in-scope).** + +- Scalarize heap priority via `weighted_scalarize()`. +- Store full `score_dict` on each candidate for logging. +- `mode="pareto"` falls back to weighted with documented warning. +- Pareto archive is out-of-scope for v1. + +--- + +## 12. Appendix: Code Touchpoints + +### OpenTrace / experimental + +| File | URL | +|------|-----| +| Guide base | [guide.py](https://github.com/AgentOpt/OpenTrace/blob/experimental/opto/trainer/guide.py) | +| Evaluators | [evaluators.py](https://github.com/AgentOpt/OpenTrace/blob/experimental/opto/trainer/evaluators.py) | +| BasicSearch | [basic_algorithms.py](https://github.com/AgentOpt/OpenTrace/blob/experimental/opto/trainer/algorithms/basic_algorithms.py) | +| Beamsearch | [beamsearch_algorithm.py](https://github.com/AgentOpt/OpenTrace/blob/experimental/opto/trainer/algorithms/beamsearch_algorithm.py) | +| PrioritySearch | [priority_search.py](https://github.com/AgentOpt/OpenTrace/blob/experimental/opto/features/priority_search/priority_search.py) | + +### Trace-Bench + +| File | URL | +|------|-----| +| Repo | [Trace-Bench](https://github.com/AgentOpt/Trace-Bench) | + +### Selection logic summary (current → target) + +| Trainer | Current Code | Target Code | +|---------|-------------|-------------| +| BasicSearch | `max(candidates, key=lambda x: x[0])` | `select_best(candidates, objective_config)` | +| Beamsearch | `sorted(candidates, key=lambda x: x[0], reverse=True)[:k]` | `select_top_k(candidates, objective_config, k)` | +| PrioritySearch | scalar heap key | `weighted_scalarize(score_dict, config)` for heap key | diff --git a/examples/notebooks/t6_m0_analysis.ipynb b/examples/notebooks/t6_m0_analysis.ipynb new file mode 100644 index 00000000..90eefcad --- /dev/null +++ b/examples/notebooks/t6_m0_analysis.ipynb @@ -0,0 +1,950 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "id": "275808ea", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'\\nT6 Milestone 0 — Analysis Notebook\\n\\nThis notebook is the M0 deliverable for the T6 Multi-Objective Vector Scores project.\\nIt demonstrates:\\n 1. Current baseline behavior (Guide score contract, evaluator aggregation, trainer selection)\\n 2. Exact code touchpoints and signatures in the OpenTrace codebase\\n 3. Planned behavior prototype: Pareto front vs weighted selection on deterministic toy candidates\\n\\nRuns end-to-end WITHOUT API keys.\\n'" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\"\"\"\n", + "T6 Milestone 0 — Analysis Notebook\n", + "\n", + "This notebook is the M0 deliverable for the T6 Multi-Objective Vector Scores project.\n", + "It demonstrates:\n", + " 1. Current baseline behavior (Guide score contract, evaluator aggregation, trainer selection)\n", + " 2. Exact code touchpoints and signatures in the OpenTrace codebase\n", + " 3. Planned behavior prototype: Pareto front vs weighted selection on deterministic toy candidates\n", + "\n", + "Runs end-to-end WITHOUT API keys.\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "b1a58d26", + "metadata": {}, + "source": [ + "# T6 Multi-Objective Vector Scores — M0 Analysis\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/)\n", + "\n", + "**Milestone 0 Deliverable** — Analysis + Technical Plan + Interface Spec\n", + "\n", + "This notebook demonstrates:\n", + "1. **Current baseline**: How Guide returns scalar scores, how evaluators aggregate, where selection happens\n", + "2. **Exact touchpoints**: The specific lines of code in BasicSearch and Beamsearch that perform scalar selection\n", + "3. **Planned behavior**: A deterministic prototype showing weighted vs Pareto selection on toy candidates\n", + "\n", + "**No API keys required.** All examples use deterministic dummy data.\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "a252270b", + "metadata": {}, + "source": [ + "## How to Validate This Milestone\n", + "\n", + "After running all cells, confirm:\n", + "- [ ] Current Guide score contract is printed (`get_feedback → Tuple[float, str]`, `metric → float`)\n", + "- [ ] Scalar selection points in BasicSearch and Beamsearch are identified\n", + "- [ ] Weighted selection produces different results when weights change\n", + "- [ ] Pareto selection returns non-dominated candidates (tradeoff set)\n", + "- [ ] Deterministic tie-break produces identical results across repeated runs with same seed" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "067cd49e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "======================================================================\n", + "T6 M0 Analysis — Multi-Objective Vector Scores\n", + "======================================================================\n" + ] + } + ], + "source": [ + "# Setup — no external dependencies beyond numpy\n", + "import numpy as np\n", + "from typing import Dict, List, Tuple, Optional, Set, Union, Literal\n", + "from dataclasses import dataclass, field\n", + "import json\n", + "\n", + "print(\"=\" * 70)\n", + "print(\"T6 M0 Analysis — Multi-Objective Vector Scores\")\n", + "print(\"=\" * 70)" + ] + }, + { + "cell_type": "markdown", + "id": "54b6022f", + "metadata": {}, + "source": [ + "---\n", + "## Part 1: Current Baseline Behavior\n", + "\n", + "### 1.1 Guide Score Contract\n", + "\n", + "The `Guide` base class defines the current score interface:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "2ab12cbf", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "======================================================================\n", + "PART 1: CURRENT BASELINE BEHAVIOR\n", + "======================================================================\n", + "\n", + "=== Current Guide Score Contract ===\n", + "\n", + "class Guide:\n", + " def get_feedback(self, query, response, reference=None, **kwargs) -> Tuple[float, str]:\n", + " raise NotImplementedError\n", + "\n", + " def metric(self, query, response, reference=None, **kwargs) -> float:\n", + " return self.get_feedback(query, response, reference)[0] # extracts scalar\n", + "\n", + "Key observations:\n", + " • get_feedback() returns Tuple[float, str] — a SCALAR score + feedback string\n", + " • metric() returns float — just extracts the first element\n", + " • LLMJudge (subclass) returns binary 0/1 scores\n", + " • No mechanism to return Dict[str, float] for multiple metrics\n", + "\n", + "Example — get_feedback(): score=1.0 (type=float), feedback='Correct!'\n", + "Example — metric(): 1.0 (type=float)\n" + ] + } + ], + "source": [ + "print(\"\\n\" + \"=\" * 70)\n", + "print(\"PART 1: CURRENT BASELINE BEHAVIOR\")\n", + "print(\"=\" * 70)\n", + "\n", + "print(\"\"\"\n", + "=== Current Guide Score Contract ===\n", + "\n", + "class Guide:\n", + " def get_feedback(self, query, response, reference=None, **kwargs) -> Tuple[float, str]:\n", + " raise NotImplementedError\n", + "\n", + " def metric(self, query, response, reference=None, **kwargs) -> float:\n", + " return self.get_feedback(query, response, reference)[0] # extracts scalar\n", + "\n", + "Key observations:\n", + " • get_feedback() returns Tuple[float, str] — a SCALAR score + feedback string\n", + " • metric() returns float — just extracts the first element\n", + " • LLMJudge (subclass) returns binary 0/1 scores\n", + " • No mechanism to return Dict[str, float] for multiple metrics\n", + "\"\"\")\n", + "\n", + "# Simulate current behavior\n", + "class CurrentGuide:\n", + " \"\"\"Simulates the current Guide behavior — scalar scores only.\"\"\"\n", + " def get_feedback(self, query, response, reference=None) -> Tuple[float, str]:\n", + " score = 1.0 if response == reference else 0.0\n", + " feedback = \"Correct!\" if score == 1.0 else f\"Expected '{reference}', got '{response}'\"\n", + " return score, feedback\n", + "\n", + " def metric(self, query, response, reference=None) -> float:\n", + " return self.get_feedback(query, response, reference)[0]\n", + "\n", + "guide = CurrentGuide()\n", + "score, feedback = guide.get_feedback(\"What is 2+2?\", \"4\", \"4\")\n", + "print(f\"Example — get_feedback(): score={score} (type={type(score).__name__}), feedback='{feedback}'\")\n", + "print(f\"Example — metric(): {guide.metric('What is 2+2?', '4', '4')} (type={type(guide.metric('What is 2+2?', '4', '4')).__name__})\")" + ] + }, + { + "cell_type": "markdown", + "id": "fcbb5663", + "metadata": {}, + "source": [ + "### 1.2 Evaluator Aggregation" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "55bf7801", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "=== Current Evaluator Behavior ===\n", + "\n", + "def evaluate(agent, guide, inputs, infos, ...) -> np.ndarray:\n", + " # For each input: calls guide.metric(input, agent(input), info) → float\n", + " # Returns np.array of shape (N,) or (N, num_samples)\n", + " # Aggregated via np.mean(scores)\n", + "\n", + "Key observations:\n", + " • All scores are numeric scalars\n", + " • Aggregation: np.mean() over all examples\n", + " • No support for Dict[str, float] scores\n", + "\n", + "Example — evaluate() returns: [0.9 0.85 0.95 0.7 0.88] (shape=(5,), dtype=float64)\n", + "Example — np.mean(scores): 0.8560 (single scalar used for selection)\n" + ] + } + ], + "source": [ + "print(\"\"\"\n", + "=== Current Evaluator Behavior ===\n", + "\n", + "def evaluate(agent, guide, inputs, infos, ...) -> np.ndarray:\n", + " # For each input: calls guide.metric(input, agent(input), info) → float\n", + " # Returns np.array of shape (N,) or (N, num_samples)\n", + " # Aggregated via np.mean(scores)\n", + "\n", + "Key observations:\n", + " • All scores are numeric scalars\n", + " • Aggregation: np.mean() over all examples\n", + " • No support for Dict[str, float] scores\n", + "\"\"\")\n", + "\n", + "# Simulate current evaluator\n", + "scores_array = np.array([0.9, 0.85, 0.95, 0.7, 0.88])\n", + "mean_score = np.mean(scores_array)\n", + "print(f\"Example — evaluate() returns: {scores_array} (shape={scores_array.shape}, dtype={scores_array.dtype})\")\n", + "print(f\"Example — np.mean(scores): {mean_score:.4f} (single scalar used for selection)\")" + ] + }, + { + "cell_type": "markdown", + "id": "7ab684f0", + "metadata": {}, + "source": [ + "### 1.3 Selection Points in Trainers" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "b8b0032f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "=== BasicSearchAlgorithm — Selection Logic ===\n", + "\n", + "File: opto/trainer/algorithms/basic_algorithms.py\n", + "Method: BasicSearchAlgorithm.optimizer_step()\n", + "\n", + " def validate():\n", + " scores = evaluate(self.agent, self.validate_guide, ...)\n", + " return np.mean(scores) if all([s is not None for s in scores]) else -np.inf\n", + " ^^^^^^^^^^^^^^^^\n", + " Returns: single float\n", + "\n", + " candidates.append((score, update_dict)) # score is float\n", + " candidates.append((self.current_score, backup_dict)) # include current\n", + "\n", + " best_score, best_update = max(candidates, key=lambda x: x[0])\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " SELECTION: scalar max — single metric only\n", + "\n", + ">>> This is the PRIMARY insertion point for multi-objective selection. <<<\n", + "\n", + "BasicSearch candidates: [(0.72, 'proposal_A'), (0.85, 'proposal_B'), (0.78, 'proposal_C'), (0.85, 'current_params')]\n", + "Selected (scalar max): score=0.85, params='proposal_B'\n", + "Note: Tie between proposal_B and current_params — max() picks first occurrence (proposal_B)\n" + ] + } + ], + "source": [ + "print(\"\"\"\n", + "=== BasicSearchAlgorithm — Selection Logic ===\n", + "\n", + "File: opto/trainer/algorithms/basic_algorithms.py\n", + "Method: BasicSearchAlgorithm.optimizer_step()\n", + "\n", + " def validate():\n", + " scores = evaluate(self.agent, self.validate_guide, ...)\n", + " return np.mean(scores) if all([s is not None for s in scores]) else -np.inf\n", + " ^^^^^^^^^^^^^^^^\n", + " Returns: single float\n", + "\n", + " candidates.append((score, update_dict)) # score is float\n", + " candidates.append((self.current_score, backup_dict)) # include current\n", + "\n", + " best_score, best_update = max(candidates, key=lambda x: x[0])\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " SELECTION: scalar max — single metric only\n", + "\n", + ">>> This is the PRIMARY insertion point for multi-objective selection. <<<\n", + "\"\"\")\n", + "\n", + "# Simulate current BasicSearch selection\n", + "candidates_basic = [\n", + " (0.72, \"proposal_A\"),\n", + " (0.85, \"proposal_B\"),\n", + " (0.78, \"proposal_C\"),\n", + " (0.85, \"current_params\"), # tie with proposal_B\n", + "]\n", + "best_score, best_update = max(candidates_basic, key=lambda x: x[0])\n", + "print(f\"BasicSearch candidates: {[(s, name) for s, name in candidates_basic]}\")\n", + "print(f\"Selected (scalar max): score={best_score}, params='{best_update}'\")\n", + "print(f\"Note: Tie between proposal_B and current_params — max() picks first occurrence (proposal_B)\")" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "8db5aa87", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "=== BeamsearchAlgorithm — Selection Logic ===\n", + "\n", + "File: opto/trainer/algorithms/beamsearch_algorithm.py\n", + "Method: BeamsearchAlgorithm.select()\n", + "\n", + " scored_candidates.append((validation_score, candidate_params)) # float\n", + "\n", + " sorted_candidates = sorted(scored_candidates, key=lambda x: x[0], reverse=True)\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " SELECTION: scalar sort descending\n", + "\n", + " selected_candidates = sorted_candidates[:beam_width] # take top-k\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " Top-k by scalar score only\n", + "\n", + ">>> This is the SECONDARY insertion point for multi-objective selection. <<<\n", + "\n", + "Beamsearch candidates: [(0.72, 'candidate_1'), (0.91, 'candidate_2'), (0.85, 'candidate_3'), (0.91, 'candidate_4'), (0.78, 'candidate_5')]\n", + "Selected (top-3 by scalar): [(0.91, 'candidate_2'), (0.91, 'candidate_4'), (0.85, 'candidate_3')]\n", + "Note: Tie between candidate_2 and candidate_4 — sorted() preserves input order (stable sort)\n" + ] + } + ], + "source": [ + "print(\"\"\"\n", + "=== BeamsearchAlgorithm — Selection Logic ===\n", + "\n", + "File: opto/trainer/algorithms/beamsearch_algorithm.py\n", + "Method: BeamsearchAlgorithm.select()\n", + "\n", + " scored_candidates.append((validation_score, candidate_params)) # float\n", + "\n", + " sorted_candidates = sorted(scored_candidates, key=lambda x: x[0], reverse=True)\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " SELECTION: scalar sort descending\n", + "\n", + " selected_candidates = sorted_candidates[:beam_width] # take top-k\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " Top-k by scalar score only\n", + "\n", + ">>> This is the SECONDARY insertion point for multi-objective selection. <<<\n", + "\"\"\")\n", + "\n", + "# Simulate current Beamsearch selection\n", + "candidates_beam = [\n", + " (0.72, \"candidate_1\"),\n", + " (0.91, \"candidate_2\"),\n", + " (0.85, \"candidate_3\"),\n", + " (0.91, \"candidate_4\"), # tie with candidate_2\n", + " (0.78, \"candidate_5\"),\n", + "]\n", + "beam_width = 3\n", + "sorted_candidates = sorted(candidates_beam, key=lambda x: x[0], reverse=True)\n", + "selected = sorted_candidates[:beam_width]\n", + "print(f\"Beamsearch candidates: {[(s, name) for s, name in candidates_beam]}\")\n", + "print(f\"Selected (top-{beam_width} by scalar): {[(s, name) for s, name in selected]}\")\n", + "print(f\"Note: Tie between candidate_2 and candidate_4 — sorted() preserves input order (stable sort)\")" + ] + }, + { + "cell_type": "markdown", + "id": "7119b4a4", + "metadata": {}, + "source": [ + "### 1.4 Summary: What's Missing\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fbf9e98b", + "metadata": {}, + "outputs": [], + "source": "print(\"\"\"\n=== Summary: Current Limitations ===\n\n1. Guide.metric() → float only (and stays float BY DESIGN)\n metric() will NOT be widened to return dicts.\n Dict scores flow through the NEW get_score_dict() path instead.\n\n2. evaluate() → np.array of floats\n Cannot aggregate per-metric means across examples.\n New evaluate_vector() will handle dict aggregation separately.\n\n3. BasicSearch: max(candidates, key=scalar)\n Cannot do weighted multi-metric selection or Pareto ranking\n\n4. Beamsearch: sorted(candidates, key=scalar)[:k]\n Cannot select top-k by Pareto dominance\n\n5. No ObjectiveConfig\n No way to declare minimize metrics, weights, or selection mode\n\n>>> All of the above will be addressed in M1-M2 without breaking existing behavior. <<<\n>>> Training loop (guide.__call__ → float) is NEVER modified. <<<\n\"\"\")" + }, + { + "cell_type": "markdown", + "id": "8e97b2fd", + "metadata": {}, + "source": [ + "---\n", + "## Part 2: Planned Behavior — Prototype\n", + "\n", + "The following cells implement the **planned multi-objective selection** as pure functions.\n", + "This is a standalone prototype (no OpenTrace dependency) demonstrating the exact behavior\n", + "that `opto/trainer/objectives.py` will provide.\n", + "\n", + "### 2.1 ObjectiveConfig" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bad5944d", + "metadata": {}, + "outputs": [], + "source": "print(\"\\n\" + \"=\" * 70)\nprint(\"PART 2: PLANNED BEHAVIOR — PROTOTYPE\")\nprint(\"=\" * 70)\n\n@dataclass(frozen=True)\nclass ObjectiveConfig:\n \"\"\"Configuration for multi-objective candidate selection.\"\"\"\n mode: str = \"scalar\" # \"scalar\", \"weighted\", \"pareto\"\n weights: Dict[str, float] = field(default_factory=dict)\n minimize: frozenset = field(default_factory=frozenset)\n missing_value: float = float(\"-inf\")\n pareto_metrics: Optional[Tuple[str, ...]] = None\n tie_break: str = \"weighted\" # \"weighted\", \"lexicographic\", \"random_seeded\"\n seed: int = 0\n\n def __post_init__(self):\n # Convert set → frozenset for true immutability + hashability\n if isinstance(self.minimize, set):\n object.__setattr__(self, 'minimize', frozenset(self.minimize))\n # Validate weights are non-negative\n for k, v in self.weights.items():\n if v < 0:\n raise ValueError(f\"Weight for '{k}' must be non-negative, got {v}\")\n # Validate pareto_metrics\n if self.pareto_metrics is not None and len(self.pareto_metrics) == 0:\n raise ValueError(\"pareto_metrics must be None (auto) or non-empty tuple\")\n\nprint(\"ObjectiveConfig defined with modes: scalar | weighted | pareto\")\nprint(f\"Default config: {ObjectiveConfig()}\")\n\n# Verify set → frozenset auto-conversion\nconfig_with_set = ObjectiveConfig(minimize={\"latency_s\"})\nprint(f\"minimize=set auto-converts: type={type(config_with_set.minimize).__name__}, value={config_with_set.minimize}\")" + }, + { + "cell_type": "markdown", + "id": "478f806d", + "metadata": {}, + "source": [ + "### 2.2 Core Utility Functions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2ed7c83c", + "metadata": {}, + "outputs": [], + "source": "# --- Score type aliases ---\nScoreLike = Union[int, float, bool, Dict[str, float]]\n\n\ndef normalize_score(score: ScoreLike) -> Dict[str, float]:\n \"\"\"Convert any score to dict form.\n \n - int/float/bool → {\"score\": float(value)}\n - Dict[str, float] → returned as-is (validated)\n \n Handles int (LLMJudge returns 0/1) and bool (test guides) explicitly.\n \"\"\"\n if isinstance(score, (bool, int, float)):\n # bool check must come before int since bool is subclass of int\n val = float(score)\n if not np.isfinite(val):\n raise ValueError(f\"Score must be finite, got {score}\")\n return {\"score\": val}\n elif isinstance(score, dict):\n if len(score) == 0:\n raise ValueError(\"Score dict must not be empty\")\n for k, v in score.items():\n if not isinstance(v, (int, float)) or not np.isfinite(float(v)):\n raise ValueError(f\"Score dict value for '{k}' must be finite float, got {v}\")\n return {k: float(v) for k, v in score.items()}\n else:\n raise TypeError(f\"Score must be int, float, bool, or Dict[str, float], got {type(score).__name__}\")\n\n\ndef apply_minimize(score_dict: Dict[str, float], minimize: set) -> Dict[str, float]:\n \"\"\"Negate values for minimize metrics (higher-is-better normalization).\"\"\"\n return {\n k: -v if k in minimize else v\n for k, v in score_dict.items()\n }\n\n\ndef weighted_scalarize(score_dict: Dict[str, float], weights: Dict[str, float],\n missing_value: float = float(\"-inf\")) -> float:\n \"\"\"Compute weighted sum. Empty weights → equal weight 1.0.\"\"\"\n if not weights:\n weights = {k: 1.0 for k in score_dict}\n total = 0.0\n for metric, weight in weights.items():\n value = score_dict.get(metric, missing_value)\n total += weight * value\n return total\n\n\ndef dominates(a: Dict[str, float], b: Dict[str, float],\n metrics: Optional[Tuple[str, ...]] = None) -> bool:\n \"\"\"Check if candidate 'a' Pareto-dominates candidate 'b'.\n \n a dominates b iff:\n - a[m] >= b[m] for ALL metrics m, AND\n - a[m] > b[m] for AT LEAST ONE metric m\n \"\"\"\n if metrics is None:\n metrics = tuple(sorted(set(a.keys()) | set(b.keys())))\n \n at_least_one_better = False\n for m in metrics:\n va = a.get(m, float(\"-inf\"))\n vb = b.get(m, float(\"-inf\"))\n if va < vb:\n return False # a is worse on this metric\n if va > vb:\n at_least_one_better = True\n return at_least_one_better\n\n\ndef pareto_rank(candidates: List[Dict[str, float]],\n metrics: Optional[Tuple[str, ...]] = None) -> List[int]:\n \"\"\"Assign Pareto rank (0 = non-dominated front).\"\"\"\n n = len(candidates)\n ranks = [0] * n\n assigned = [False] * n\n current_rank = 0\n\n remaining = set(range(n))\n while remaining:\n # Find non-dominated set among remaining\n front = []\n for i in remaining:\n dominated = False\n for j in remaining:\n if i != j and dominates(candidates[j], candidates[i], metrics):\n dominated = True\n break\n if not dominated:\n front.append(i)\n\n for i in front:\n ranks[i] = current_rank\n remaining.remove(i)\n current_rank += 1\n\n return ranks\n\n\ndef select_best(candidates: List[Tuple[ScoreLike, any]],\n config: Optional[ObjectiveConfig] = None) -> int:\n \"\"\"Select the single best candidate index.\"\"\"\n if config is None or config.mode == \"scalar\":\n # Backward-compatible: scalar max\n scores = []\n for score, _ in candidates:\n if isinstance(score, dict):\n scores.append(np.mean(list(score.values())))\n else:\n scores.append(float(score))\n return int(np.argmax(scores))\n\n # Normalize all scores to dict\n score_dicts = [normalize_score(s) for s, _ in candidates]\n\n # Apply minimize\n score_dicts = [apply_minimize(sd, config.minimize) for sd in score_dicts]\n\n if config.mode == \"weighted\":\n weighted_scores = [weighted_scalarize(sd, config.weights, config.missing_value) for sd in score_dicts]\n return int(np.argmax(weighted_scores))\n\n elif config.mode == \"pareto\":\n ranks = pareto_rank(score_dicts, config.pareto_metrics)\n # Get indices of rank-0 (Pareto front)\n front_indices = [i for i, r in enumerate(ranks) if r == 0]\n\n if len(front_indices) == 1:\n return front_indices[0]\n\n # Tie-break among front\n if config.tie_break == \"weighted\":\n front_scores = [weighted_scalarize(score_dicts[i], config.weights, config.missing_value)\n for i in front_indices]\n return front_indices[int(np.argmax(front_scores))]\n elif config.tie_break == \"lexicographic\":\n metrics = sorted(score_dicts[front_indices[0]].keys())\n def lex_key(idx):\n return tuple(score_dicts[idx].get(m, config.missing_value) for m in metrics)\n return max(front_indices, key=lex_key)\n elif config.tie_break == \"random_seeded\":\n rng = np.random.RandomState(config.seed)\n return front_indices[rng.randint(len(front_indices))]\n\n raise ValueError(f\"Unknown mode: {config.mode}\")\n\n\ndef select_top_k(candidates: List[Tuple[ScoreLike, any]],\n config: Optional[ObjectiveConfig] = None,\n k: int = 1) -> List[int]:\n \"\"\"Select the top-k candidate indices.\"\"\"\n if config is None or config.mode == \"scalar\":\n scores = []\n for score, _ in candidates:\n if isinstance(score, dict):\n scores.append(np.mean(list(score.values())))\n else:\n scores.append(float(score))\n return list(np.argsort(scores)[::-1][:k])\n\n score_dicts = [normalize_score(s) for s, _ in candidates]\n score_dicts = [apply_minimize(sd, config.minimize) for sd in score_dicts]\n\n if config.mode == \"weighted\":\n weighted_scores = [weighted_scalarize(sd, config.weights, config.missing_value) for sd in score_dicts]\n return list(np.argsort(weighted_scores)[::-1][:k])\n\n elif config.mode == \"pareto\":\n ranks = pareto_rank(score_dicts, config.pareto_metrics)\n # Collect by rank, then tie-break within each rank\n result = []\n max_rank = max(ranks)\n for rank in range(max_rank + 1):\n rank_indices = [i for i, r in enumerate(ranks) if r == rank]\n # Sort within rank by tie-break\n if config.tie_break == \"weighted\":\n rank_indices.sort(\n key=lambda i: weighted_scalarize(score_dicts[i], config.weights, config.missing_value),\n reverse=True\n )\n elif config.tie_break == \"lexicographic\":\n metrics = sorted(score_dicts[rank_indices[0]].keys()) if rank_indices else []\n rank_indices.sort(\n key=lambda i: tuple(score_dicts[i].get(m, config.missing_value) for m in metrics),\n reverse=True\n )\n elif config.tie_break == \"random_seeded\":\n rng = np.random.RandomState(config.seed + rank)\n rng.shuffle(rank_indices)\n result.extend(rank_indices)\n if len(result) >= k:\n break\n return result[:k]\n\n raise ValueError(f\"Unknown mode: {config.mode}\")\n\n\nprint(\"Core utility functions defined:\")\nprint(\" \\u2022 normalize_score() — handles float, int, bool, and dict\")\nprint(\" \\u2022 apply_minimize()\")\nprint(\" \\u2022 weighted_scalarize()\")\nprint(\" \\u2022 dominates()\")\nprint(\" \\u2022 pareto_rank()\")\nprint(\" \\u2022 select_best()\")\nprint(\" \\u2022 select_top_k()\")" + }, + { + "cell_type": "markdown", + "id": "6233d2c7", + "metadata": {}, + "source": [ + "### 2.3 Validation: normalize_score()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "25003e79", + "metadata": {}, + "outputs": [], + "source": "print(\"\\n--- normalize_score() examples ---\")\nprint(f\" normalize_score(0.85) = {normalize_score(0.85)}\")\nprint(f\" normalize_score({{'acc': 0.9, 'lat': 50}}) = {normalize_score({'acc': 0.9, 'lat': 50})}\")\n\n# int and bool edge cases (LLMJudge returns int 0/1, test guides return bool)\nprint(f\"\\n --- int / bool edge cases ---\")\nprint(f\" normalize_score(1) = {normalize_score(1)} # LLMJudge returns int 0/1\")\nprint(f\" normalize_score(0) = {normalize_score(0)} # LLMJudge incorrect → int 0\")\nprint(f\" normalize_score(True) = {normalize_score(True)} # test guide correct → bool\")\nprint(f\" normalize_score(False) = {normalize_score(False)} # test guide incorrect → bool\")\n\n# Error edge cases\nprint(f\"\\n --- Error edge cases ---\")\ntry:\n normalize_score({})\nexcept ValueError as e:\n print(f\" normalize_score({{}}) → ValueError: {e}\")\n\ntry:\n normalize_score(\"bad\")\nexcept TypeError as e:\n print(f\" normalize_score('bad') → TypeError: {e}\")" + }, + { + "cell_type": "markdown", + "id": "a5c0fef1", + "metadata": {}, + "source": [ + "### 2.4 Validation: apply_minimize() + weighted_scalarize()" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "b9e31bec", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "--- apply_minimize() examples ---\n", + " Original: {'accuracy': 0.9, 'latency_ms': 120.0, 'cost': 0.05}\n", + " Minimize: {'latency_ms', 'cost'}\n", + " Result: {'accuracy': 0.9, 'latency_ms': -120.0, 'cost': -0.05}\n", + " (latency and cost negated → higher-is-better)\n", + "\n", + "--- weighted_scalarize() examples ---\n", + " Score (normalized): {'accuracy': 0.9, 'latency_ms': -120.0, 'cost': -0.05}\n", + " Weights: {'accuracy': 0.6, 'latency_ms': 0.3, 'cost': 0.1}\n", + " Weighted sum: -35.4650\n", + " = 0.6*0.9 + 0.3*(-120.0) + 0.1*(-0.05) = -35.4650\n" + ] + } + ], + "source": [ + "print(\"\\n--- apply_minimize() examples ---\")\n", + "score = {\"accuracy\": 0.9, \"latency_ms\": 120.0, \"cost\": 0.05}\n", + "minimized = apply_minimize(score, minimize={\"latency_ms\", \"cost\"})\n", + "print(f\" Original: {score}\")\n", + "print(f\" Minimize: {{'latency_ms', 'cost'}}\")\n", + "print(f\" Result: {minimized}\")\n", + "print(f\" (latency and cost negated → higher-is-better)\")\n", + "\n", + "print(\"\\n--- weighted_scalarize() examples ---\")\n", + "weights = {\"accuracy\": 0.6, \"latency_ms\": 0.3, \"cost\": 0.1}\n", + "ws = weighted_scalarize(minimized, weights)\n", + "print(f\" Score (normalized): {minimized}\")\n", + "print(f\" Weights: {weights}\")\n", + "print(f\" Weighted sum: {ws:.4f}\")\n", + "print(f\" = 0.6*0.9 + 0.3*(-120.0) + 0.1*(-0.05) = {0.6*0.9 + 0.3*(-120.0) + 0.1*(-0.05):.4f}\")" + ] + }, + { + "cell_type": "markdown", + "id": "f1725c01", + "metadata": {}, + "source": [ + "### 2.5 Demonstration: Weighted vs Pareto Selection\n", + "\n", + "We create 6 candidates with realistic multi-metric scores to show how weighted and Pareto selection differ." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "d3023945", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "======================================================================\n", + "DEMONSTRATION: WEIGHTED vs PARETO SELECTION\n", + "======================================================================\n", + "\n", + "Candidate Scores:\n", + " Name Accuracy Latency (s)\n", + " --------------- ---------- ------------\n", + " candidate_A 0.95 0.200\n", + " candidate_B 0.70 0.030\n", + " candidate_C 0.88 0.080\n", + " candidate_D 0.92 0.150\n", + " candidate_E 0.60 0.020\n", + " candidate_F 0.85 0.085\n", + "\n", + "--- Mode: SCALAR (baseline) ---\n", + " Selection: mean of dict values → max\n", + " Winner: candidate_A (index 0)\n", + " Score: {'accuracy': 0.95, 'latency_s': 0.2}\n", + " Note: This is the CURRENT behavior — treats multi-metric as mean scalar.\n", + "\n", + "--- Mode: WEIGHTED (accuracy-heavy) ---\n", + " Weights: accuracy=0.8, latency_s=0.2 (minimized)\n", + " Winner: candidate_A (index 0)\n", + " Weighted score: 0.7200\n", + "\n", + "--- Mode: WEIGHTED (latency-heavy) ---\n", + " Weights: accuracy=0.2, latency_s=0.8 (minimized)\n", + " Winner: candidate_B (index 1)\n", + " Weighted score: 0.1160\n", + "\n", + " >>> Changing weights changes the winner!\n", + " >>> Accuracy-heavy → candidate_A, Latency-heavy → candidate_B\n", + "\n", + "--- Mode: PARETO ---\n", + "\n", + " Pareto Ranking (after minimize normalization):\n", + " Name Accuracy Neg Latency Pareto Rank\n", + " --------------- ---------- ------------ ------------\n", + " candidate_A 0.95 -0.200 0\n", + " candidate_B 0.70 -0.030 0\n", + " candidate_C 0.88 -0.080 0\n", + " candidate_D 0.92 -0.150 0\n", + " candidate_E 0.60 -0.020 0\n", + " candidate_F 0.85 -0.085 1\n", + "\n", + " Pareto Front (Rank 0): ['candidate_A', 'candidate_B', 'candidate_C', 'candidate_D', 'candidate_E']\n", + " These candidates represent TRADEOFFS — none is dominated by another.\n", + "\n", + " After tie-break (weighted, weights={acc: 0.5, lat: 0.5}):\n", + " Winner: candidate_C (index 2)\n", + "\n", + "--- Mode: PARETO (top-k for Beamsearch, k=3) ---\n", + " Selected top-3:\n", + " #1: candidate_C (Pareto rank 0, scores: {'accuracy': 0.88, 'latency_s': 0.08})\n", + " #2: candidate_D (Pareto rank 0, scores: {'accuracy': 0.92, 'latency_s': 0.15})\n", + " #3: candidate_A (Pareto rank 0, scores: {'accuracy': 0.95, 'latency_s': 0.2})\n" + ] + } + ], + "source": [ + "print(\"\\n\" + \"=\" * 70)\n", + "print(\"DEMONSTRATION: WEIGHTED vs PARETO SELECTION\")\n", + "print(\"=\" * 70)\n", + "\n", + "# 6 candidates with accuracy (higher=better) and latency_s (lower=better)\n", + "# Using latency_s (seconds, 0-1 scale) so metrics are comparable and weight changes matter\n", + "candidates = [\n", + " ({\"accuracy\": 0.95, \"latency_s\": 0.200}, \"candidate_A\"), # high accuracy, high latency\n", + " ({\"accuracy\": 0.70, \"latency_s\": 0.030}, \"candidate_B\"), # low accuracy, low latency\n", + " ({\"accuracy\": 0.88, \"latency_s\": 0.080}, \"candidate_C\"), # balanced\n", + " ({\"accuracy\": 0.92, \"latency_s\": 0.150}, \"candidate_D\"), # good accuracy, moderate latency\n", + " ({\"accuracy\": 0.60, \"latency_s\": 0.020}, \"candidate_E\"), # lowest accuracy, fastest\n", + " ({\"accuracy\": 0.85, \"latency_s\": 0.085}, \"candidate_F\"), # similar to C\n", + "]\n", + "\n", + "print(\"\\nCandidate Scores:\")\n", + "print(f\" {'Name':<15} {'Accuracy':>10} {'Latency (s)':>12}\")\n", + "print(f\" {'-'*15} {'-'*10} {'-'*12}\")\n", + "for score, name in candidates:\n", + " print(f\" {name:<15} {score['accuracy']:>10.2f} {score['latency_s']:>12.3f}\")\n", + "\n", + "# --- Scalar mode (baseline) ---\n", + "print(\"\\n--- Mode: SCALAR (baseline) ---\")\n", + "config_scalar = ObjectiveConfig(mode=\"scalar\")\n", + "best_idx = select_best(candidates, config_scalar)\n", + "print(f\" Selection: mean of dict values → max\")\n", + "print(f\" Winner: {candidates[best_idx][1]} (index {best_idx})\")\n", + "print(f\" Score: {candidates[best_idx][0]}\")\n", + "print(f\" Note: This is the CURRENT behavior — treats multi-metric as mean scalar.\")\n", + "\n", + "# --- Weighted mode: accuracy-heavy ---\n", + "print(\"\\n--- Mode: WEIGHTED (accuracy-heavy) ---\")\n", + "config_weighted_acc = ObjectiveConfig(\n", + " mode=\"weighted\",\n", + " weights={\"accuracy\": 0.8, \"latency_s\": 0.2},\n", + " minimize=frozenset({\"latency_s\"})\n", + ")\n", + "best_idx = select_best(candidates, config_weighted_acc)\n", + "print(f\" Weights: accuracy=0.8, latency_s=0.2 (minimized)\")\n", + "print(f\" Winner: {candidates[best_idx][1]} (index {best_idx})\")\n", + "score_dict = apply_minimize(candidates[best_idx][0], config_weighted_acc.minimize)\n", + "ws = weighted_scalarize(score_dict, config_weighted_acc.weights)\n", + "print(f\" Weighted score: {ws:.4f}\")\n", + "\n", + "# --- Weighted mode: latency-heavy ---\n", + "print(\"\\n--- Mode: WEIGHTED (latency-heavy) ---\")\n", + "config_weighted_lat = ObjectiveConfig(\n", + " mode=\"weighted\",\n", + " weights={\"accuracy\": 0.2, \"latency_s\": 0.8},\n", + " minimize=frozenset({\"latency_s\"})\n", + ")\n", + "best_idx_lat = select_best(candidates, config_weighted_lat)\n", + "print(f\" Weights: accuracy=0.2, latency_s=0.8 (minimized)\")\n", + "print(f\" Winner: {candidates[best_idx_lat][1]} (index {best_idx_lat})\")\n", + "score_dict_lat = apply_minimize(candidates[best_idx_lat][0], config_weighted_lat.minimize)\n", + "ws_lat = weighted_scalarize(score_dict_lat, config_weighted_lat.weights)\n", + "print(f\" Weighted score: {ws_lat:.4f}\")\n", + "\n", + "print(f\"\\n >>> Changing weights changes the winner!\")\n", + "print(f\" >>> Accuracy-heavy → {candidates[best_idx][1]}, Latency-heavy → {candidates[best_idx_lat][1]}\")\n", + "\n", + "# --- Pareto mode ---\n", + "print(\"\\n--- Mode: PARETO ---\")\n", + "config_pareto = ObjectiveConfig(\n", + " mode=\"pareto\",\n", + " weights={\"accuracy\": 0.5, \"latency_s\": 0.5}, # used for tie-breaking\n", + " minimize=frozenset({\"latency_s\"}),\n", + " tie_break=\"weighted\",\n", + " seed=42\n", + ")\n", + "\n", + "# Show full Pareto ranking\n", + "score_dicts_norm = [apply_minimize(normalize_score(s), config_pareto.minimize) for s, _ in candidates]\n", + "ranks = pareto_rank(score_dicts_norm)\n", + "\n", + "print(f\"\\n Pareto Ranking (after minimize normalization):\")\n", + "print(f\" {'Name':<15} {'Accuracy':>10} {'Neg Latency':>12} {'Pareto Rank':>12}\")\n", + "print(f\" {'-'*15} {'-'*10} {'-'*12} {'-'*12}\")\n", + "for i, ((score, name), rank) in enumerate(zip(candidates, ranks)):\n", + " nd = score_dicts_norm[i]\n", + " print(f\" {name:<15} {nd['accuracy']:>10.2f} {nd['latency_s']:>12.3f} {rank:>12}\")\n", + "\n", + "front_indices = [i for i, r in enumerate(ranks) if r == 0]\n", + "print(f\"\\n Pareto Front (Rank 0): {[candidates[i][1] for i in front_indices]}\")\n", + "print(f\" These candidates represent TRADEOFFS — none is dominated by another.\")\n", + "\n", + "best_idx_pareto = select_best(candidates, config_pareto)\n", + "print(f\"\\n After tie-break (weighted, weights={{acc: 0.5, lat: 0.5}}):\")\n", + "print(f\" Winner: {candidates[best_idx_pareto][1]} (index {best_idx_pareto})\")\n", + "\n", + "# --- Top-k selection (Beamsearch simulation) ---\n", + "print(\"\\n--- Mode: PARETO (top-k for Beamsearch, k=3) ---\")\n", + "top_k_indices = select_top_k(candidates, config_pareto, k=3)\n", + "print(f\" Selected top-3:\")\n", + "for rank_pos, idx in enumerate(top_k_indices):\n", + " r = ranks[idx]\n", + " print(f\" #{rank_pos+1}: {candidates[idx][1]} (Pareto rank {r}, scores: {candidates[idx][0]})\")" + ] + }, + { + "cell_type": "markdown", + "id": "c1bdf524", + "metadata": {}, + "source": [ + "### 2.6 Deterministic Tie-Break Validation" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "dc6ea71d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "======================================================================\n", + "DETERMINISTIC TIE-BREAK VALIDATION\n", + "======================================================================\n", + "\n", + "--- Repeated runs with seed=42 ---\n", + " 10 runs with seed=42: indices = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]\n", + " All identical: True ✓\n", + "\n", + "--- Different seeds with random_seeded tie-break ---\n", + " seed= 0: winner = candidate_E (index 4)\n", + " seed= 1: winner = candidate_D (index 3)\n", + " seed= 2: winner = candidate_A (index 0)\n", + " seed=42: winner = candidate_D (index 3)\n", + " seed=99: winner = candidate_B (index 1)\n", + "\n", + "--- Determinism check for random_seeded (seed=42, 10 runs) ---\n", + " 10 runs: indices = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3]\n", + " All identical: True ✓\n" + ] + } + ], + "source": [ + "print(\"\\n\" + \"=\" * 70)\n", + "print(\"DETERMINISTIC TIE-BREAK VALIDATION\")\n", + "print(\"=\" * 70)\n", + "\n", + "# Run selection 10 times with same seed — must produce identical results\n", + "print(\"\\n--- Repeated runs with seed=42 ---\")\n", + "results = []\n", + "for run in range(10):\n", + " config = ObjectiveConfig(\n", + " mode=\"pareto\",\n", + " weights={\"accuracy\": 0.5, \"latency_s\": 0.5},\n", + " minimize=frozenset({\"latency_s\"}),\n", + " tie_break=\"weighted\",\n", + " seed=42\n", + " )\n", + " idx = select_best(candidates, config)\n", + " results.append(idx)\n", + "\n", + "all_same = len(set(results)) == 1\n", + "print(f\" 10 runs with seed=42: indices = {results}\")\n", + "print(f\" All identical: {all_same} ✓\" if all_same else f\" NOT identical: FAIL ✗\")\n", + "\n", + "# Different seed should potentially give different tie-break (if random_seeded)\n", + "print(\"\\n--- Different seeds with random_seeded tie-break ---\")\n", + "for seed in [0, 1, 2, 42, 99]:\n", + " config = ObjectiveConfig(\n", + " mode=\"pareto\",\n", + " weights={\"accuracy\": 0.5, \"latency_s\": 0.5},\n", + " minimize=frozenset({\"latency_s\"}),\n", + " tie_break=\"random_seeded\",\n", + " seed=seed\n", + " )\n", + " idx = select_best(candidates, config)\n", + " print(f\" seed={seed:>2}: winner = {candidates[idx][1]} (index {idx})\")\n", + "\n", + "# Verify same seed is deterministic for random_seeded too\n", + "print(\"\\n--- Determinism check for random_seeded (seed=42, 10 runs) ---\")\n", + "results_random = []\n", + "for _ in range(10):\n", + " config = ObjectiveConfig(\n", + " mode=\"pareto\",\n", + " weights={\"accuracy\": 0.5, \"latency_s\": 0.5},\n", + " minimize=frozenset({\"latency_s\"}),\n", + " tie_break=\"random_seeded\",\n", + " seed=42\n", + " )\n", + " idx = select_best(candidates, config)\n", + " results_random.append(idx)\n", + "all_same_random = len(set(results_random)) == 1\n", + "print(f\" 10 runs: indices = {results_random}\")\n", + "print(f\" All identical: {all_same_random} ✓\" if all_same_random else f\" NOT identical: FAIL ✗\")" + ] + }, + { + "cell_type": "markdown", + "id": "3cc966d2", + "metadata": {}, + "source": [ + "### 2.7 Edge Cases" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d4545dc3", + "metadata": {}, + "outputs": [], + "source": "print(\"\\n\" + \"=\" * 70)\nprint(\"EDGE CASES\")\nprint(\"=\" * 70)\n\n# Single-metric dict\nprint(\"\\n--- Single-metric dict with Pareto mode ---\")\nsingle_metric_candidates = [\n ({\"accuracy\": 0.9}, \"A\"),\n ({\"accuracy\": 0.8}, \"B\"),\n ({\"accuracy\": 0.95}, \"C\"),\n]\nconfig_single = ObjectiveConfig(mode=\"pareto\", tie_break=\"weighted\")\nbest = select_best(single_metric_candidates, config_single)\nprint(f\" Candidates: {[s for s, _ in single_metric_candidates]}\")\nprint(f\" Winner: {single_metric_candidates[best][1]} (index {best})\")\nprint(f\" Note: Pareto with 1 metric degenerates to scalar max — expected behavior.\")\n\n# Mixed float and dict\nprint(\"\\n--- Backward compat: float scores with ObjectiveConfig ---\")\nfloat_candidates = [\n (0.85, \"A\"),\n (0.92, \"B\"),\n (0.78, \"C\"),\n]\nconfig_float = ObjectiveConfig(mode=\"weighted\", weights={\"score\": 1.0})\nbest_float = select_best(float_candidates, config_float)\nprint(f\" Float candidates: {[s for s, _ in float_candidates]}\")\nprint(f\" Winner: {float_candidates[best_float][1]} (score={float_candidates[best_float][0]})\")\nprint(f\" Note: Floats normalized to {{'score': val}} — backward-compatible.\")\n\n# None config (pure backward compatibility)\nprint(\"\\n--- None config (current behavior) ---\")\nbest_none = select_best(float_candidates, None)\nprint(f\" config=None → scalar max → {float_candidates[best_none][1]} (score={float_candidates[best_none][0]})\")\nprint(f\" Identical to current max(candidates, key=lambda x: x[0])\")\n\n# Negative weight validation\nprint(\"\\n--- Negative weight validation ---\")\ntry:\n ObjectiveConfig(weights={\"accuracy\": 0.8, \"latency_s\": -0.2})\nexcept ValueError as e:\n print(f\" ObjectiveConfig(weights={{..., 'latency_s': -0.2}}) → ValueError: {e}\")\n print(f\" Note: Use minimize={{'latency_s'}} instead of negative weights.\")\n\n# Empty pareto_metrics validation\nprint(\"\\n--- Empty pareto_metrics validation ---\")\ntry:\n ObjectiveConfig(pareto_metrics=())\nexcept ValueError as e:\n print(f\" ObjectiveConfig(pareto_metrics=()) → ValueError: {e}\")\n print(f\" Note: Use None (auto-detect) or a non-empty tuple of metric names.\")" + }, + { + "cell_type": "markdown", + "id": "b510fdc9", + "metadata": {}, + "source": [ + "### 2.8 Visual Summary: Selection Behavior Comparison" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "c9abcad1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "======================================================================\n", + "SELECTION BEHAVIOR COMPARISON\n", + "======================================================================\n", + "\n", + " Mode Winner Reasoning\n", + " ------------------------- --------------- --------------------------------------------------\n", + " scalar (baseline) candidate_A mean of dict values → max\n", + " weighted (acc=0.8) candidate_A weighted sum with {'accuracy': 0.8, 'latency_s': 0.2}\n", + " weighted (lat=0.8) candidate_B weighted sum with {'accuracy': 0.2, 'latency_s': 0.8}\n", + " pareto (tie=weighted) candidate_C rank-0 front, tie-break=weighted\n", + "\n", + " >>> Different modes select different candidates from the SAME pool.\n", + " >>> This is exactly the behavior objectives.py will provide to trainers.\n" + ] + } + ], + "source": [ + "print(\"\\n\" + \"=\" * 70)\n", + "print(\"SELECTION BEHAVIOR COMPARISON\")\n", + "print(\"=\" * 70)\n", + "\n", + "print(f\"\\n {'Mode':<25} {'Winner':<15} {'Reasoning'}\")\n", + "print(f\" {'-'*25} {'-'*15} {'-'*50}\")\n", + "\n", + "modes = [\n", + " (\"scalar (baseline)\", config_scalar),\n", + " (\"weighted (acc=0.8)\", config_weighted_acc),\n", + " (\"weighted (lat=0.8)\", config_weighted_lat),\n", + " (\"pareto (tie=weighted)\", config_pareto),\n", + "]\n", + "\n", + "for mode_name, config in modes:\n", + " idx = select_best(candidates, config)\n", + " name = candidates[idx][1]\n", + " if config.mode == \"scalar\":\n", + " reason = \"mean of dict values → max\"\n", + " elif config.mode == \"weighted\":\n", + " reason = f\"weighted sum with {dict(config.weights)}\"\n", + " elif config.mode == \"pareto\":\n", + " reason = f\"rank-0 front, tie-break={config.tie_break}\"\n", + " print(f\" {mode_name:<25} {name:<15} {reason}\")\n", + "\n", + "print(f\"\\n >>> Different modes select different candidates from the SAME pool.\")\n", + "print(f\" >>> This is exactly the behavior objectives.py will provide to trainers.\")" + ] + }, + { + "cell_type": "markdown", + "id": "3f1ed487", + "metadata": {}, + "source": "---\n## Part 3: Architecture Summary\n\n### Two separate data paths (by design)\n\nThe training loop and selection path are **intentionally separate**. `guide.__call__()` / `get_feedback()` return type is NOT widened — the training loop always receives `float`.\n\n```\nTRAINING LOOP (unchanged):\n guide(x, target.data, info) → (float, str)\n │\n └── score (float) → np.mean(scores) → optimizer backward\n Always float. Never dict. Training loop is completely safe.\n\nSELECTION PATH (new):\n guide.get_score_dict(query, response, reference) → Dict[str, float]\n │\n ▼\n evaluate_vector() → List[Dict[str, float]] (one dict per example)\n │\n ▼\n aggregate_vector_scores() → Dict[str, float] (mean per metric)\n │\n ▼\n objectives.py (select_best / select_top_k)\n │\n ├── mode=\"scalar\" → max(mean_scores) ← unchanged\n ├── mode=\"weighted\" → max(weighted_scalarize()) ← new\n └── mode=\"pareto\" → pareto_rank() + tie-break ← new\n```\n\n**Key safety invariant:** `metric()` always returns `float`. If a guide's `get_feedback()` returns a dict as the score, `metric()` collapses it via `mean(values)`. Dict scores are only accessible through `get_score_dict()`.\n\n### Files to create/modify\n\n| Action | File | Milestone |\n|--------|------|-----------|\n| CREATE | `opto/trainer/objectives.py` | M1 |\n| MODIFY | `opto/trainer/guide.py` — add `get_score_dict()`, update `metric()` to collapse dicts to float | M1 |\n| MODIFY | `opto/trainer/evaluators.py` — add `evaluate_vector()`, `aggregate_vector_scores()` | M1 |\n| MODIFY | `basic_algorithms.py` | M1-M2 |\n| MODIFY | `beamsearch_algorithm.py` | M2 |\n| OPTIONAL | `priority_search.py` | M2 |" + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "3e97bc57", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "======================================================================\n", + "M0 ANALYSIS COMPLETE\n", + "======================================================================\n", + "\n", + "Deliverables verified:\n", + " ✓ Current Guide score contract documented (Tuple[float, str])\n", + " ✓ Scalar selection points identified (BasicSearch max, Beamsearch sorted[:k])\n", + " ✓ Weighted selection produces different results with different weights\n", + " ✓ Pareto selection returns non-dominated tradeoff set\n", + " ✓ Deterministic tie-break verified (same seed → same result, 10 runs)\n", + " ✓ Edge cases validated (empty dict, single metric, float compat, None config)\n", + " ✓ Architecture summary with file list and data flow\n", + "\n", + "See docs/T6_technical_plan.md for the complete refined technical plan.\n", + "\n" + ] + } + ], + "source": [ + "print(\"\\n\" + \"=\" * 70)\n", + "print(\"M0 ANALYSIS COMPLETE\")\n", + "print(\"=\" * 70)\n", + "print(\"\"\"\n", + "Deliverables verified:\n", + " ✓ Current Guide score contract documented (Tuple[float, str])\n", + " ✓ Scalar selection points identified (BasicSearch max, Beamsearch sorted[:k])\n", + " ✓ Weighted selection produces different results with different weights\n", + " ✓ Pareto selection returns non-dominated tradeoff set\n", + " ✓ Deterministic tie-break verified (same seed → same result, 10 runs)\n", + " ✓ Edge cases validated (empty dict, single metric, float compat, None config)\n", + " ✓ Architecture summary with file list and data flow\n", + "\n", + "See docs/T6_technical_plan.md for the complete refined technical plan.\n", + "\"\"\")" + ] + } + ], + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file From 3b2a0b29c0b3ba9cea64675340626f74b4f3c4c7 Mon Sep 17 00:00:00 2001 From: Jose Carlos Rodriguez Date: Tue, 10 Feb 2026 17:20:55 -0400 Subject: [PATCH 2/5] T6 M0: Apply Xavier's review fixes (paths, dates, motivation, real LLM required) --- examples/notebooks/t6_m0_analysis.ipynb | 19 ++----------------- 1 file changed, 2 insertions(+), 17 deletions(-) diff --git a/examples/notebooks/t6_m0_analysis.ipynb b/examples/notebooks/t6_m0_analysis.ipynb index 90eefcad..2549d76a 100644 --- a/examples/notebooks/t6_m0_analysis.ipynb +++ b/examples/notebooks/t6_m0_analysis.ipynb @@ -35,22 +35,7 @@ "cell_type": "markdown", "id": "b1a58d26", "metadata": {}, - "source": [ - "# T6 Multi-Objective Vector Scores — M0 Analysis\n", - "\n", - "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/)\n", - "\n", - "**Milestone 0 Deliverable** — Analysis + Technical Plan + Interface Spec\n", - "\n", - "This notebook demonstrates:\n", - "1. **Current baseline**: How Guide returns scalar scores, how evaluators aggregate, where selection happens\n", - "2. **Exact touchpoints**: The specific lines of code in BasicSearch and Beamsearch that perform scalar selection\n", - "3. **Planned behavior**: A deterministic prototype showing weighted vs Pareto selection on toy candidates\n", - "\n", - "**No API keys required.** All examples use deterministic dummy data.\n", - "\n", - "---" - ] + "source": "# T6 Multi-Objective Vector Scores — M0 Analysis\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/OpenTrace/blob/pull/61/head/examples/notebooks/t6_m0_analysis.ipynb)\n\n**Milestone 0 Deliverable** — Analysis + Technical Plan + Interface Spec\n\nThis notebook demonstrates:\n1. **Current baseline**: How Guide returns scalar scores, how evaluators aggregate, where selection happens\n2. **Exact touchpoints**: The specific lines of code in BasicSearch and Beamsearch that perform scalar selection\n3. **Planned behavior**: A deterministic prototype showing weighted vs Pareto selection on toy candidates\n\n**Motivation (why score-as-dict):** adding extra metrics into the *feedback dict/text* can help optimizers (OptoPrime/OPRO), but trainers typically only use the scalar score for ranking/UCB and ignore additional feedback structure. To enable Pareto/weighted multi-objective selection at the trainer level, we need vector score (score-as-dict) with backward-compatible scalar reduction.\n\n**No API keys required for M0.** All examples use deterministic dummy data. (From M1 onward, milestone notebooks must validate both StubLLM and real LLM modes.)\n\n---" }, { "cell_type": "markdown", @@ -405,7 +390,7 @@ "id": "fbf9e98b", "metadata": {}, "outputs": [], - "source": "print(\"\"\"\n=== Summary: Current Limitations ===\n\n1. Guide.metric() → float only (and stays float BY DESIGN)\n metric() will NOT be widened to return dicts.\n Dict scores flow through the NEW get_score_dict() path instead.\n\n2. evaluate() → np.array of floats\n Cannot aggregate per-metric means across examples.\n New evaluate_vector() will handle dict aggregation separately.\n\n3. BasicSearch: max(candidates, key=scalar)\n Cannot do weighted multi-metric selection or Pareto ranking\n\n4. Beamsearch: sorted(candidates, key=scalar)[:k]\n Cannot select top-k by Pareto dominance\n\n5. No ObjectiveConfig\n No way to declare minimize metrics, weights, or selection mode\n\n>>> All of the above will be addressed in M1-M2 without breaking existing behavior. <<<\n>>> Training loop (guide.__call__ → float) is NEVER modified. <<<\n\"\"\")" + "source": "print(\"\"\"\n=== Summary: Current Limitations ===\n\n0. Extra metrics in feedback are not usable by trainers today.\n Trainers typically rank/UCB using only the scalar score, and do not inspect feedback structure.\n\n1. Guide.metric() → float only (and stays float BY DESIGN)\n metric() will NOT be widened to return dicts.\n Dict scores flow through the NEW get_score_dict() path instead.\n\n2. evaluate() → np.array of floats\n Cannot aggregate per-metric means across examples.\n New evaluate_vector() will handle dict aggregation separately.\n\n3. BasicSearch: max(candidates, key=scalar)\n Cannot do weighted multi-metric selection or Pareto ranking\n\n4. Beamsearch: sorted(candidates, key=scalar)[:k]\n Cannot select top-k by Pareto dominance\n\n5. No ObjectiveConfig\n No way to declare minimize metrics, weights, or selection mode\n\n>>> All of the above will be addressed in M1-M2 without breaking existing behavior. <<<\n>>> Training loop (guide.__call__ → float) is NEVER modified. <<<\n\"\"\")" }, { "cell_type": "markdown", From 249bde6e187b6fc211a344fd328f6bcd35f92ff7 Mon Sep 17 00:00:00 2001 From: Jose Carlos Rodriguez Date: Tue, 10 Feb 2026 17:22:59 -0400 Subject: [PATCH 3/5] T6 M0: Apply Xavier's review fixes to technical plan --- docs/T6_technical_plan.md | 28 +++++++++++++++++++--------- 1 file changed, 19 insertions(+), 9 deletions(-) diff --git a/docs/T6_technical_plan.md b/docs/T6_technical_plan.md index e37c0c8c..87f3e764 100644 --- a/docs/T6_technical_plan.md +++ b/docs/T6_technical_plan.md @@ -2,7 +2,7 @@ **Version:** 1.0 (Refined) **Author:** Carlos Rodriguez -**Date:** February 9, 2025 +**Date:** February 9, 2026 **Status:** M0 Deliverable — Analysis + Architecture + Interface Spec **Target repos / branches:** @@ -32,6 +32,9 @@ Today, trainer selection in Trace is driven by a **single scalar score**. Guides return `Tuple[float, str]` via `get_feedback()`, evaluators produce `np.array` of floats, and trainers (`BasicSearchAlgorithm`, `BeamsearchAlgorithm`) select candidates via scalar comparison (`max(candidates, key=lambda x: x[0])` and `sorted(..., key=lambda x: x[0])` respectively). This blocks trainer-side search from exploiting multiple metrics like `{accuracy, latency_ms, cost}`. +**Motivation note (from team discussion):** +Putting multiple metrics into the *feedback dict/text* is useful for optimizers (OptoPrime/OPRO), but trainers (BasicSearch/UCB/PrioritySearch/GEPA) typically only inspect the **scalar score** for ranking/UCB and ignore additional feedback structure. Therefore, enabling **vector score / score-as-dict** (with backward-compatible scalar reduction) is required for multi-objective trainer selection. + ### What this plan adds | Component | Change | @@ -516,7 +519,7 @@ def select_top_k(candidates: List[Tuple[ScoreLike, any]], | File | Change | Milestone | |------|--------|-----------| -| `opto/trainer/guide.py` | Add `get_score_dict()` method to `Guide` base class. Update `metric()` to collapse dict scores to `float` via `mean(values)` (return type stays `float`). | M1 | +| `opto/trainer/guide.py` | Add `get_score_dict()` method to `Guide` base class. Keep training loop scalar-safe (`metric()` returns `float`). Dict/vector scores are accessed via `get_score_dict()` for trainer-side selection. | M1 | | `opto/trainer/evaluators.py` | Add `evaluate_vector()` and `aggregate_vector_scores()`. Existing `evaluate()` unchanged. | M1 | | `opto/trainer/algorithms/basic_algorithms.py` | Add `objective_config` param to `BasicSearchAlgorithm.train()`. Replace `max(candidates, ...)` with `select_best()` in `optimizer_step()`. | M1 (minimal) / M2 (robust) | | `opto/trainer/algorithms/beamsearch_algorithm.py` | Add `objective_config` param to `BeamsearchAlgorithm.train()`. Replace scalar sort in `select()` with `select_top_k()`. | M2 | @@ -592,7 +595,7 @@ This `score` flows into `MinibatchAlgorithm.update()` where `np.mean(scores)` is | Constraint | Enforcement | |-----------|-------------| | `guide.__call__()` / `get_feedback()` return type is **NOT widened** | No changes to `get_feedback()` signature; it still returns `Tuple[float, str]` | -| Training loop always receives scalar `score` | `metric()` always returns `float` (collapses dict via `mean(values)` if needed) | +| Training loop always receives scalar `score` | `metric()` always returns `float`. Vector/dict scores are not used by the training loop and are accessed via `get_score_dict()` for trainer-side selection. | | Dict scores flow through a separate path | `get_score_dict()` → `evaluate_vector()` → `select_best()` / `select_top_k()` | | A multi-objective guide must return `(float, str)` from `get_feedback()` for the training loop | The float is a collapsed scalar summary; the full dict is extracted via `get_score_dict()` during selection | @@ -619,7 +622,7 @@ Selection path: get_score_dict() → evaluate_vector() → objectives.py ← **Deliverables:** - `docs/T6_technical_plan.md` — this document, finalized -- `notebooks/t6_m0_analysis.ipynb` — Colab-ready notebook +- `examples/notebooks/t6_m0_analysis.ipynb` — Colab-ready notebook **Notebook demonstrates:** - Current Guide score contract (`get_feedback` → `Tuple[float, str]`, `metric` → `float`) @@ -641,12 +644,12 @@ Selection path: get_score_dict() → evaluate_vector() → objectives.py ← - `opto/trainer/evaluators.py` (add `evaluate_vector`, `aggregate_vector_scores`) - `opto/trainer/algorithms/basic_algorithms.py` (BasicSearch: accept/use ObjectiveConfig) - `tests/test_objectives.py`, `tests/test_evaluators_vector.py` -- `notebooks/t6_m1_vector_scores.ipynb` +- `examples/notebooks/t6_m1_vector_scores.ipynb` **Notebook demonstrates:** - StubLLM mode: BasicSearchAlgorithm on small candidate set (5-10) with deterministic dummy guide returning dict metrics - Shows: (a) scalar baseline, (b) weighted mode, (c) Pareto mode, (d) deterministic tie-break under fixed seed -- Real LLM mode (optional): tiny dataset (≤5 items) producing ≥2 metrics +- Real LLM mode (required): tiny dataset (≤5 items) producing ≥2 metrics **SMART validation:** - `pytest -q` passes (all new functions covered) @@ -663,7 +666,7 @@ Selection path: get_score_dict() → evaluate_vector() → objectives.py ← - Expanded BasicSearch tests (edge cases, missing metrics, tie-break policies) - Optional: minimal PrioritySearch support (weighted scalarization for heap, dict stored for logging) - `tests/test_trainers_multiobjective.py` -- `notebooks/t6_m2_trainers.ipynb` +- `examples/notebooks/t6_m2_trainers.ipynb` **Notebook demonstrates:** - BasicSearch + Beamsearch in: scalar mode (baseline), weighted mode, Pareto mode @@ -681,11 +684,18 @@ Selection path: get_score_dict() → evaluate_vector() → objectives.py ← **Deliverables:** - PR to Trace-Bench: benchmark configs/tasks + notebook + - **Trace-Bench touchpoints (update `main` if default branch differs):** + - https://github.com/AgentOpt/Trace-Bench/blob/main/LLM4AD/trainers_benchmark.py + - https://github.com/AgentOpt/Trace-Bench/blob/main/LLM4AD/trainers_benchmark_tasks_validation.py + - https://github.com/AgentOpt/Trace-Bench/blob/main/LLM4AD/benchmark_tasks/index.json + - https://github.com/AgentOpt/Trace-Bench/tree/main/LLM4AD/benchmark_tasks + - https://github.com/AgentOpt/Trace-Bench/blob/main/LLM4AD/llm4ad_loader.py + - https://github.com/AgentOpt/Trace-Bench/blob/main/tests/test_lite_optimize_llm4ad.py - 3 benchmarks: 1. **Accuracy vs latency** (toy QA dataset) 2. **Accuracy vs response length** (penalize verbosity) 3. **Accuracy vs tool calls** (penalize excessive tool usage) -- `notebooks/t6_m3_benchmarks.ipynb` +- Trace-Bench notebook: `notebooks/t6_multiobjective_benchmarks.ipynb` (in Trace-Bench repo) **SMART validation:** - Notebook outputs per-benchmark table: weighted-mode best candidate metrics + Pareto-mode set of tradeoffs @@ -763,7 +773,7 @@ Selection path: get_score_dict() → evaluate_vector() → objectives.py ← Each notebook contains: - **StubLLM (no keys) section:** deterministic dummy guide, runs quickly -- **Real LLM section (optional):** small N (5-20 examples), prints cost/latency caveats, requires API key +- **Real LLM section (required):** small N (5-20 examples), prints cost/latency caveats, requires API key --- From 2213a191bc2ba06274df1e657ba9a2e434e308ce Mon Sep 17 00:00:00 2001 From: Jose Carlos Rodriguez Date: Thu, 12 Feb 2026 11:50:07 -0400 Subject: [PATCH 4/5] =?UTF-8?q?T6=20M1:=20Multi-objective=20vector=20score?= =?UTF-8?q?s=20=E2=80=94=20ObjectiveConfig,=20objectives.py,=20evaluate=5F?= =?UTF-8?q?vector,=20BasicSearch=20integration,=2059=20tests?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- examples/notebooks/t6_m1_vector_scores.ipynb | 810 +++++++++++++++++++ opto/trainer/algorithms/basic_algorithms.py | 95 ++- opto/trainer/evaluators.py | 74 +- opto/trainer/guide.py | 16 + opto/trainer/objectives.py | 312 +++++++ tests/unit_tests/test_evaluators_vector.py | 154 ++++ tests/unit_tests/test_objectives.py | 383 +++++++++ 7 files changed, 1823 insertions(+), 21 deletions(-) create mode 100644 examples/notebooks/t6_m1_vector_scores.ipynb create mode 100644 opto/trainer/objectives.py create mode 100644 tests/unit_tests/test_evaluators_vector.py create mode 100644 tests/unit_tests/test_objectives.py diff --git a/examples/notebooks/t6_m1_vector_scores.ipynb b/examples/notebooks/t6_m1_vector_scores.ipynb new file mode 100644 index 00000000..637322d0 --- /dev/null +++ b/examples/notebooks/t6_m1_vector_scores.ipynb @@ -0,0 +1,810 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "a0000001", + "metadata": {}, + "outputs": [], + "source": [ + "\"\"\"\n", + "T6 Milestone 1 — Multi-Objective Vector Scores\n", + "\n", + "This notebook is the M1 deliverable for the T6 Multi-Objective Vector Scores project.\n", + "It demonstrates:\n", + " 1. ObjectiveConfig creation and validation\n", + " 2. MultiMetricGuide with get_score_dict()\n", + " 3. evaluate_vector() + aggregate_vector_scores()\n", + " 4. Full BasicSearchAlgorithm.train() with DummyLLM + objective_config\n", + " 5. Scalar baseline comparison (backward compat)\n", + " 6. Pareto mode demo + deterministic tiebreak\n", + "\n", + "Part A runs end-to-end WITHOUT API keys (StubLLM / DummyLLM).\n", + "Part B requires an OpenRouter API key (Colab secrets or environment variable).\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "a0000002", + "metadata": {}, + "source": [ + "# T6 Multi-Objective Vector Scores — M1 Implementation\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/OpenTrace/blob/pull/61/head/examples/notebooks/t6_m1_vector_scores.ipynb)\n", + "\n", + "**Milestone 1 Deliverable** — Core multi-objective infrastructure\n", + "\n", + "This notebook demonstrates the M1 implementation:\n", + "1. **ObjectiveConfig**: Frozen dataclass for multi-objective selection configuration\n", + "2. **Vector score path**: `get_score_dict()` → `evaluate_vector()` → `aggregate_vector_scores()` → `select_best()`\n", + "3. **BasicSearch integration**: Training with `objective_config` parameter (weighted + Pareto modes)\n", + "4. **Backward compatibility**: `objective_config=None` produces identical behavior to baseline\n", + "\n", + "**Part A (StubLLM):** No API keys required. Uses `DummyLLM` for deterministic end-to-end training.\n", + "\n", + "**Part B (Real LLM):** Requires `OPENROUTER_API_KEY` via Colab secrets or env var. Uses `google/gemini-2.0-flash-001`.\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "a0000003", + "metadata": {}, + "source": [ + "## How to Validate This Milestone\n", + "\n", + "After running all cells, confirm:\n", + "- [ ] ObjectiveConfig creation and validation work correctly\n", + "- [ ] MultiMetricGuide returns `Dict[str, float]` from `get_score_dict()`\n", + "- [ ] `evaluate_vector()` returns `List[Dict[str, float]]`\n", + "- [ ] `aggregate_vector_scores()` computes per-metric means\n", + "- [ ] BasicSearch with `objective_config=None` (scalar) trains successfully\n", + "- [ ] BasicSearch with weighted `objective_config` selects differently than scalar\n", + "- [ ] Pareto mode produces deterministic results with same seed\n", + "- [ ] Real LLM section (Part B) trains with actual model + multi-metric guide" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0000004", + "metadata": {}, + "outputs": [], + "source": "import sys, os\n\n# Ensure OpenTrace root is on the path (needed when running from examples/notebooks/)\n_repo_root = os.path.abspath(os.path.join(os.getcwd(), '..', '..'))\nif os.path.isdir(os.path.join(_repo_root, 'opto')):\n if _repo_root not in sys.path:\n sys.path.insert(0, _repo_root)\n# Also handle running directly from the repo root\nif os.path.isdir(os.path.join(os.getcwd(), 'opto')):\n if os.getcwd() not in sys.path:\n sys.path.insert(0, os.getcwd())\n\nimport numpy as np\nfrom typing import Dict, Tuple, Optional\n\nprint(\"=\" * 70)\nprint(\"T6 M1 \\u2014 Multi-Objective Vector Scores\")\nprint(\"=\" * 70)" + }, + { + "cell_type": "markdown", + "id": "a0000005", + "metadata": {}, + "source": [ + "---\n", + "## Part A: StubLLM (No API Key Required)\n", + "\n", + "### A.1 ObjectiveConfig Creation & Validation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0000006", + "metadata": {}, + "outputs": [], + "source": [ + "from opto.trainer.objectives import (\n", + " ObjectiveConfig, normalize_score, apply_minimize,\n", + " weighted_scalarize, dominates, pareto_rank, select_best, select_top_k,\n", + ")\n", + "\n", + "print(\"--- ObjectiveConfig defaults ---\")\n", + "config_default = ObjectiveConfig()\n", + "print(f\" mode={config_default.mode}, weights={config_default.weights}, \"\n", + " f\"minimize={config_default.minimize}\")\n", + "\n", + "print(\"\\n--- ObjectiveConfig: weighted mode ---\")\n", + "config_weighted = ObjectiveConfig(\n", + " mode=\"weighted\",\n", + " weights={\"accuracy\": 0.8, \"latency_s\": 0.2},\n", + " minimize=frozenset({\"latency_s\"}),\n", + ")\n", + "print(f\" mode={config_weighted.mode}\")\n", + "print(f\" weights={config_weighted.weights}\")\n", + "print(f\" minimize={config_weighted.minimize}\")\n", + "\n", + "print(\"\\n--- ObjectiveConfig: Pareto mode ---\")\n", + "config_pareto = ObjectiveConfig(\n", + " mode=\"pareto\",\n", + " weights={\"accuracy\": 0.5, \"latency_s\": 0.5},\n", + " minimize=frozenset({\"latency_s\"}),\n", + " tie_break=\"weighted\",\n", + " seed=42,\n", + ")\n", + "print(f\" mode={config_pareto.mode}, tie_break={config_pareto.tie_break}, seed={config_pareto.seed}\")\n", + "\n", + "print(\"\\n--- ObjectiveConfig: set auto-converts to frozenset ---\")\n", + "config_set = ObjectiveConfig(minimize={\"lat\"})\n", + "print(f\" type(minimize)={type(config_set.minimize).__name__} (auto-converted from set)\")\n", + "\n", + "print(\"\\n--- Validation: negative weight ---\")\n", + "try:\n", + " ObjectiveConfig(weights={\"a\": -0.5})\n", + "except ValueError as e:\n", + " print(f\" Caught: {e}\")\n", + "\n", + "print(\"\\n--- Validation: bad mode ---\")\n", + "try:\n", + " ObjectiveConfig(mode=\"unknown\")\n", + "except ValueError as e:\n", + " print(f\" Caught: {e}\")\n", + "\n", + "print(\"\\n--- Frozen (immutable) ---\")\n", + "try:\n", + " config_default.mode = \"weighted\"\n", + "except AttributeError as e:\n", + " print(f\" Caught: {e}\")\n", + "\n", + "print(\"\\nObjectiveConfig validation: all checks passed.\")" + ] + }, + { + "cell_type": "markdown", + "id": "a0000007", + "metadata": {}, + "source": [ + "### A.2 MultiMetricGuide with `get_score_dict()`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0000008", + "metadata": {}, + "outputs": [], + "source": [ + "from opto.trainer.guide import Guide\n", + "\n", + "\n", + "class MultiMetricGuide(Guide):\n", + " \"\"\"Guide that returns multi-metric score dicts.\n", + "\n", + " Evaluates accuracy (exact match) and brevity (inverse length difference).\n", + " The training loop still calls get_feedback() -> (float, str).\n", + " The selection path calls get_score_dict() -> Dict[str, float].\n", + " \"\"\"\n", + "\n", + " def get_feedback(self, query, response, reference=None, **kwargs):\n", + " accuracy = 1.0 if str(response).strip().lower() == str(reference).strip().lower() else 0.0\n", + " len_diff = abs(len(str(response)) - len(str(reference)))\n", + " brevity = 1.0 / (1.0 + len_diff)\n", + " feedback = f\"Expected '{reference}', got '{response}'. \"\n", + " if accuracy < 1.0:\n", + " feedback += \"Incorrect. Please provide the exact expected answer.\"\n", + " else:\n", + " feedback += \"Correct!\"\n", + " # Training loop gets scalar (accuracy) + feedback string\n", + " return accuracy, feedback\n", + "\n", + " def get_score_dict(self, query, response, reference=None, **kwargs):\n", + " accuracy = 1.0 if str(response).strip().lower() == str(reference).strip().lower() else 0.0\n", + " len_diff = abs(len(str(response)) - len(str(reference)))\n", + " brevity = 1.0 / (1.0 + len_diff)\n", + " return {\"accuracy\": accuracy, \"brevity\": brevity}\n", + "\n", + "\n", + "# Demonstrate both paths\n", + "guide = MultiMetricGuide()\n", + "\n", + "print(\"--- Training path: get_feedback() -> (float, str) ---\")\n", + "score, feedback = guide.get_feedback(\"Q: 2+2\", \"4\", \"4\")\n", + "print(f\" score={score} (type={type(score).__name__})\")\n", + "print(f\" feedback='{feedback}'\")\n", + "\n", + "print(\"\\n--- Selection path: get_score_dict() -> Dict[str, float] ---\")\n", + "sd = guide.get_score_dict(\"Q: 2+2\", \"4\", \"4\")\n", + "print(f\" score_dict={sd}\")\n", + "\n", + "print(\"\\n--- metric() still returns float (backward compat) ---\")\n", + "m = guide.metric(\"Q: 2+2\", \"4\", \"4\")\n", + "print(f\" metric()={m} (type={type(m).__name__})\")\n", + "\n", + "print(\"\\n--- Base Guide without get_score_dict override wraps scalar ---\")\n", + "class ScalarOnlyGuide(Guide):\n", + " def get_feedback(self, query, response, reference=None, **kwargs):\n", + " return 0.75, \"some feedback\"\n", + "\n", + "fallback = ScalarOnlyGuide()\n", + "print(f\" get_score_dict()={fallback.get_score_dict('q', 'r', 'ref')}\")\n", + "print(\" (wrapped as {{'score': 0.75}} automatically)\")" + ] + }, + { + "cell_type": "markdown", + "id": "a0000009", + "metadata": {}, + "source": [ + "### A.3 `evaluate_vector()` + `aggregate_vector_scores()`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0000010", + "metadata": {}, + "outputs": [], + "source": [ + "from opto import trace\n", + "from opto.trainer.evaluators import evaluate_vector, aggregate_vector_scores\n", + "\n", + "\n", + "@trace.model\n", + "class StubAgent:\n", + " \"\"\"Agent with a trainable string parameter. Returns it directly.\"\"\"\n", + " def __init__(self, answer):\n", + " self.answer = trace.node(answer, trainable=True)\n", + "\n", + " def forward(self, x):\n", + " return self.answer\n", + "\n", + "\n", + "agent = StubAgent(\"4\")\n", + "guide = MultiMetricGuide()\n", + "\n", + "inputs = [\"What is 2+2?\", \"What is 3+1?\", \"What is 5-1?\"]\n", + "infos = [\"4\", \"4\", \"4\" ] # all expect \"4\"\n", + "\n", + "print(\"--- evaluate_vector() ---\")\n", + "score_dicts = evaluate_vector(agent, guide, inputs, infos, num_threads=1)\n", + "for i, sd in enumerate(score_dicts):\n", + " print(f\" Example {i}: {sd}\")\n", + "\n", + "print(\"\\n--- aggregate_vector_scores() ---\")\n", + "agg = aggregate_vector_scores(score_dicts)\n", + "print(f\" Aggregated (per-metric mean): {agg}\")\n", + "\n", + "# Now test with wrong answer\n", + "agent_wrong = StubAgent(\"five\")\n", + "print(\"\\n--- Wrong answer agent ---\")\n", + "score_dicts_wrong = evaluate_vector(agent_wrong, guide, inputs, infos, num_threads=1)\n", + "for i, sd in enumerate(score_dicts_wrong):\n", + " print(f\" Example {i}: {sd}\")\n", + "agg_wrong = aggregate_vector_scores(score_dicts_wrong)\n", + "print(f\" Aggregated: {agg_wrong}\")" + ] + }, + { + "cell_type": "markdown", + "id": "a0000011", + "metadata": {}, + "source": [ + "### A.4 Selection with `select_best()` and `select_top_k()`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0000012", + "metadata": {}, + "outputs": [], + "source": [ + "# Candidates: (score_dict, payload) tuples\n", + "candidates = [\n", + " ({\"accuracy\": 0.95, \"latency_s\": 0.200}, \"prompt_A\"),\n", + " ({\"accuracy\": 0.70, \"latency_s\": 0.030}, \"prompt_B\"),\n", + " ({\"accuracy\": 0.88, \"latency_s\": 0.080}, \"prompt_C\"),\n", + " ({\"accuracy\": 0.60, \"latency_s\": 0.020}, \"prompt_D\"),\n", + "]\n", + "\n", + "print(\"Candidates:\")\n", + "for s, name in candidates:\n", + " print(f\" {name}: {s}\")\n", + "\n", + "# Scalar mode (backward-compat)\n", + "print(\"\\n--- select_best(config=None) [scalar, backward-compat] ---\")\n", + "idx = select_best(candidates, None)\n", + "print(f\" Winner: {candidates[idx][1]} (index {idx})\")\n", + "\n", + "# Weighted: accuracy-heavy\n", + "print(\"\\n--- select_best(weighted, accuracy=0.8) ---\")\n", + "config_acc = ObjectiveConfig(\n", + " mode=\"weighted\",\n", + " weights={\"accuracy\": 0.8, \"latency_s\": 0.2},\n", + " minimize=frozenset({\"latency_s\"}),\n", + ")\n", + "idx = select_best(candidates, config_acc)\n", + "print(f\" Winner: {candidates[idx][1]} (index {idx})\")\n", + "\n", + "# Weighted: latency-heavy\n", + "print(\"\\n--- select_best(weighted, latency_s=0.8) ---\")\n", + "config_lat = ObjectiveConfig(\n", + " mode=\"weighted\",\n", + " weights={\"accuracy\": 0.2, \"latency_s\": 0.8},\n", + " minimize=frozenset({\"latency_s\"}),\n", + ")\n", + "idx = select_best(candidates, config_lat)\n", + "print(f\" Winner: {candidates[idx][1]} (index {idx})\")\n", + "\n", + "# Pareto mode\n", + "print(\"\\n--- select_best(pareto, tie_break=weighted) ---\")\n", + "config_par = ObjectiveConfig(\n", + " mode=\"pareto\",\n", + " weights={\"accuracy\": 0.5, \"latency_s\": 0.5},\n", + " minimize=frozenset({\"latency_s\"}),\n", + " tie_break=\"weighted\",\n", + ")\n", + "score_dicts_norm = [apply_minimize(normalize_score(s), config_par.minimize) for s, _ in candidates]\n", + "ranks = pareto_rank(score_dicts_norm)\n", + "print(f\" Pareto ranks: {ranks}\")\n", + "print(f\" Front (rank 0): {[candidates[i][1] for i, r in enumerate(ranks) if r == 0]}\")\n", + "idx = select_best(candidates, config_par)\n", + "print(f\" Winner (after tie-break): {candidates[idx][1]} (index {idx})\")\n", + "\n", + "# Deterministic check\n", + "print(\"\\n--- Determinism: 10 runs with same config ---\")\n", + "results = [select_best(candidates, config_par) for _ in range(10)]\n", + "print(f\" Results: {results}\")\n", + "print(f\" All identical: {len(set(results)) == 1}\")\n", + "\n", + "# Top-k\n", + "print(\"\\n--- select_top_k(pareto, k=2) ---\")\n", + "top2 = select_top_k(candidates, config_par, k=2)\n", + "print(f\" Top 2: {[candidates[i][1] for i in top2]}\")" + ] + }, + { + "cell_type": "markdown", + "id": "a0000013", + "metadata": {}, + "source": [ + "### A.5 Full Training: BasicSearch with DummyLLM (scalar baseline)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0000014", + "metadata": {}, + "outputs": [], + "source": [ + "from opto.utils.llm import DummyLLM\n", + "from opto.optimizers import OptoPrimeV2\n", + "from opto.trainer.algorithms.basic_algorithms import BasicSearchAlgorithm\n", + "\n", + "# --- Dataset: simple Q&A ---\n", + "dataset = dict(\n", + " inputs=[\"What is 2+2?\", \"What is 3+1?\", \"What is 10-6?\"],\n", + " infos= [\"4\", \"4\", \"4\" ],\n", + ")\n", + "\n", + "# --- DummyLLM: always proposes the same system prompt ---\n", + "def stub_llm_fn(*args, **kwargs):\n", + " \"\"\"Deterministic LLM stub: always returns a fixed response.\"\"\"\n", + " return \"You are a math assistant. Always answer with just the number.\"\n", + "\n", + "dummy_llm = DummyLLM(stub_llm_fn)\n", + "\n", + "# --- Agent ---\n", + "@trace.model\n", + "class MathAgent:\n", + " def __init__(self, llm):\n", + " self.system_prompt = trace.node(\n", + " \"You are a helpful assistant.\", trainable=True\n", + " )\n", + " self.llm = llm\n", + "\n", + " @trace.bundle()\n", + " def call_llm(self, system_prompt, question):\n", + " resp = self.llm(\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": system_prompt},\n", + " {\"role\": \"user\", \"content\": question},\n", + " ]\n", + " )\n", + " return resp.choices[0].message.content\n", + "\n", + " def forward(self, x):\n", + " return self.call_llm(self.system_prompt, x)\n", + "\n", + "# --- Scalar baseline (objective_config=None) ---\n", + "print(\"=\" * 70)\n", + "print(\"TRAINING: Scalar baseline (objective_config=None)\")\n", + "print(\"=\" * 70)\n", + "\n", + "agent_scalar = MathAgent(dummy_llm)\n", + "optimizer_scalar = OptoPrimeV2(agent_scalar.parameters(), llm=dummy_llm)\n", + "trainer_scalar = BasicSearchAlgorithm(\n", + " agent=agent_scalar, optimizer=optimizer_scalar\n", + ")\n", + "\n", + "guide_scalar = MultiMetricGuide()\n", + "scores_scalar, test_score_scalar = trainer_scalar.train(\n", + " guide=guide_scalar,\n", + " train_dataset=dataset,\n", + " num_proposals=2,\n", + " num_epochs=1,\n", + " batch_size=1,\n", + " num_threads=1,\n", + " objective_config=None, # scalar baseline\n", + ")\n", + "\n", + "print(f\"\\nScalar training scores: {scores_scalar}\")\n", + "print(f\"current_score: {trainer_scalar.current_score}\")\n", + "print(f\"current_score_dict: {trainer_scalar.current_score_dict}\")\n", + "print(\"(current_score_dict is None because scalar mode does not use vector path)\")" + ] + }, + { + "cell_type": "markdown", + "id": "a0000015", + "metadata": {}, + "source": [ + "### A.6 Full Training: BasicSearch with DummyLLM (weighted mode)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0000016", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"=\" * 70)\n", + "print(\"TRAINING: Weighted mode (objective_config.mode='weighted')\")\n", + "print(\"=\" * 70)\n", + "\n", + "config_weighted_train = ObjectiveConfig(\n", + " mode=\"weighted\",\n", + " weights={\"accuracy\": 0.7, \"brevity\": 0.3},\n", + ")\n", + "\n", + "agent_weighted = MathAgent(dummy_llm)\n", + "optimizer_weighted = OptoPrimeV2(agent_weighted.parameters(), llm=dummy_llm)\n", + "trainer_weighted = BasicSearchAlgorithm(\n", + " agent=agent_weighted, optimizer=optimizer_weighted\n", + ")\n", + "\n", + "guide_weighted = MultiMetricGuide()\n", + "scores_weighted, test_score_weighted = trainer_weighted.train(\n", + " guide=guide_weighted,\n", + " train_dataset=dataset,\n", + " num_proposals=2,\n", + " num_epochs=1,\n", + " batch_size=1,\n", + " num_threads=1,\n", + " objective_config=config_weighted_train,\n", + ")\n", + "\n", + "print(f\"\\nWeighted training scores: {scores_weighted}\")\n", + "print(f\"current_score (float): {trainer_weighted.current_score}\")\n", + "print(f\"current_score_dict: {trainer_weighted.current_score_dict}\")\n", + "print(\"(current_score_dict stores the vector score selected by weighted mode)\")" + ] + }, + { + "cell_type": "markdown", + "id": "a0000017", + "metadata": {}, + "source": [ + "### A.7 Full Training: BasicSearch with DummyLLM (Pareto mode)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0000018", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"=\" * 70)\n", + "print(\"TRAINING: Pareto mode (objective_config.mode='pareto')\")\n", + "print(\"=\" * 70)\n", + "\n", + "config_pareto_train = ObjectiveConfig(\n", + " mode=\"pareto\",\n", + " weights={\"accuracy\": 0.5, \"brevity\": 0.5},\n", + " tie_break=\"weighted\",\n", + " seed=42,\n", + ")\n", + "\n", + "agent_pareto = MathAgent(dummy_llm)\n", + "optimizer_pareto = OptoPrimeV2(agent_pareto.parameters(), llm=dummy_llm)\n", + "trainer_pareto = BasicSearchAlgorithm(\n", + " agent=agent_pareto, optimizer=optimizer_pareto\n", + ")\n", + "\n", + "guide_pareto = MultiMetricGuide()\n", + "scores_pareto, test_score_pareto = trainer_pareto.train(\n", + " guide=guide_pareto,\n", + " train_dataset=dataset,\n", + " num_proposals=2,\n", + " num_epochs=1,\n", + " batch_size=1,\n", + " num_threads=1,\n", + " objective_config=config_pareto_train,\n", + ")\n", + "\n", + "print(f\"\\nPareto training scores: {scores_pareto}\")\n", + "print(f\"current_score (float): {trainer_pareto.current_score}\")\n", + "print(f\"current_score_dict: {trainer_pareto.current_score_dict}\")\n", + "\n", + "# Verify determinism: run again with same seed\n", + "print(\"\\n--- Determinism: re-run with same seed ---\")\n", + "agent_pareto2 = MathAgent(dummy_llm)\n", + "optimizer_pareto2 = OptoPrimeV2(agent_pareto2.parameters(), llm=dummy_llm)\n", + "trainer_pareto2 = BasicSearchAlgorithm(\n", + " agent=agent_pareto2, optimizer=optimizer_pareto2\n", + ")\n", + "scores_pareto2, _ = trainer_pareto2.train(\n", + " guide=MultiMetricGuide(),\n", + " train_dataset=dataset,\n", + " num_proposals=2,\n", + " num_epochs=1,\n", + " batch_size=1,\n", + " num_threads=1,\n", + " objective_config=config_pareto_train,\n", + ")\n", + "print(f\"Run 1 current_score_dict: {trainer_pareto.current_score_dict}\")\n", + "print(f\"Run 2 current_score_dict: {trainer_pareto2.current_score_dict}\")\n", + "match = trainer_pareto.current_score_dict == trainer_pareto2.current_score_dict\n", + "print(f\"Deterministic: {match}\")" + ] + }, + { + "cell_type": "markdown", + "id": "a0000019", + "metadata": {}, + "source": [ + "### A.8 Summary: StubLLM Section" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0000020", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"\\n\" + \"=\" * 70)\n", + "print(\"PART A COMPLETE — StubLLM Section\")\n", + "print(\"=\" * 70)\n", + "print(\"\"\"\n", + "Verified:\n", + " ✓ ObjectiveConfig creation, validation, and immutability\n", + " ✓ MultiMetricGuide: get_feedback() -> (float, str) for training loop\n", + " ✓ MultiMetricGuide: get_score_dict() -> Dict[str, float] for selection path\n", + " ✓ evaluate_vector() returns List[Dict[str, float]]\n", + " ✓ aggregate_vector_scores() computes per-metric means\n", + " ✓ select_best(): scalar, weighted, Pareto modes all work\n", + " ✓ BasicSearch training: scalar baseline (objective_config=None)\n", + " ✓ BasicSearch training: weighted mode with vector score selection\n", + " ✓ BasicSearch training: Pareto mode with deterministic tie-break\n", + " ✓ current_score stays float, current_score_dict stores vector\n", + "\"\"\")" + ] + }, + { + "cell_type": "markdown", + "id": "a0000021", + "metadata": {}, + "source": [ + "---\n", + "## Part B: Real LLM (API Key Required)\n", + "\n", + "This section trains a real LLM agent using `CustomLLM` with OpenRouter.\n", + "\n", + "**Requirements:**\n", + "- **Colab:** Set `OPENROUTER_API_KEY` in Colab Secrets (key icon in sidebar)\n", + "- **Local:** `export OPENROUTER_API_KEY=sk-or-v1-...` in your shell, or set in `.env`\n", + "\n", + "Uses model `google/gemini-2.0-flash-001` via OpenRouter (very cheap, fast)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0000022", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "# Try Colab secrets first, then environment variable\n", + "api_key = None\n", + "try:\n", + " from google.colab import userdata\n", + " api_key = userdata.get('OPENROUTER_API_KEY')\n", + " print(\"API key loaded from Colab secrets.\")\n", + "except (ImportError, Exception):\n", + " pass\n", + "\n", + "if not api_key:\n", + " api_key = os.environ.get('OPENROUTER_API_KEY')\n", + " if api_key:\n", + " print(\"API key loaded from environment variable.\")\n", + "\n", + "if not api_key:\n", + " # Try loading from .env file in project root\n", + " env_path = os.path.join(os.getcwd(), '.env')\n", + " if not os.path.exists(env_path):\n", + " env_path = os.path.join(os.path.dirname(os.getcwd()), '.env')\n", + " if os.path.exists(env_path):\n", + " with open(env_path) as f:\n", + " for line in f:\n", + " line = line.strip()\n", + " if line.startswith('OPENROUTER_API_KEY='):\n", + " api_key = line.split('=', 1)[1].strip()\n", + " print(f\"API key loaded from {env_path}\")\n", + " break\n", + "\n", + "if not api_key:\n", + " print(\"WARNING: No OPENROUTER_API_KEY found. Part B cells will be skipped.\")\n", + " print(\"Set it via: Colab Secrets, env var, or .env file.\")\n", + "else:\n", + " # Configure CustomLLM environment\n", + " os.environ['TRACE_CUSTOMLLM_URL'] = 'https://openrouter.ai/api/v1'\n", + " os.environ['TRACE_CUSTOMLLM_API_KEY'] = api_key\n", + " os.environ['TRACE_CUSTOMLLM_MODEL'] = 'google/gemini-2.0-flash-001'\n", + " print(\"CustomLLM configured for OpenRouter (google/gemini-2.0-flash-001).\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0000023", + "metadata": {}, + "outputs": [], + "source": [ + "# Skip this cell if no API key\n", + "if not api_key:\n", + " print(\"Skipping: no API key. Set OPENROUTER_API_KEY to run Part B.\")\n", + "else:\n", + " from opto.utils.llm import CustomLLM\n", + "\n", + " real_llm = CustomLLM(model='google/gemini-2.0-flash-001')\n", + "\n", + " # Quick smoke test\n", + " print(\"--- Smoke test: real LLM call ---\")\n", + " resp = real_llm(messages=[\n", + " {\"role\": \"user\", \"content\": \"What is 2+2? Answer with just the number.\"}\n", + " ])\n", + " print(f\" Response: {resp.choices[0].message.content}\")\n", + " print(\" LLM connection verified.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0000024", + "metadata": {}, + "outputs": [], + "source": [ + "# Real LLM training with weighted multi-objective selection\n", + "if not api_key:\n", + " print(\"Skipping: no API key.\")\n", + "else:\n", + " print(\"=\" * 70)\n", + " print(\"REAL LLM TRAINING: Weighted mode with multi-metric guide\")\n", + " print(\"=\" * 70)\n", + "\n", + " # Small dataset to keep costs low\n", + " real_dataset = dict(\n", + " inputs=[\"What is 7+3?\", \"What is 15-9?\", \"What is 4*3?\"],\n", + " infos= [\"10\", \"6\", \"12\" ],\n", + " )\n", + "\n", + " real_config = ObjectiveConfig(\n", + " mode=\"weighted\",\n", + " weights={\"accuracy\": 0.7, \"brevity\": 0.3},\n", + " )\n", + "\n", + " real_agent = MathAgent(real_llm)\n", + " real_optimizer = OptoPrimeV2(real_agent.parameters(), llm=real_llm)\n", + " real_trainer = BasicSearchAlgorithm(\n", + " agent=real_agent, optimizer=real_optimizer\n", + " )\n", + "\n", + " real_guide = MultiMetricGuide()\n", + " real_scores, real_test = real_trainer.train(\n", + " guide=real_guide,\n", + " train_dataset=real_dataset,\n", + " num_proposals=2,\n", + " num_epochs=1,\n", + " batch_size=1,\n", + " num_threads=1,\n", + " objective_config=real_config,\n", + " )\n", + "\n", + " print(f\"\\nReal LLM training scores: {real_scores}\")\n", + " print(f\"current_score (float): {real_trainer.current_score}\")\n", + " print(f\"current_score_dict: {real_trainer.current_score_dict}\")\n", + " print(f\"\\nFinal system prompt: {real_agent.system_prompt.data}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0000025", + "metadata": {}, + "outputs": [], + "source": [ + "# Real LLM: Pareto mode comparison\n", + "if not api_key:\n", + " print(\"Skipping: no API key.\")\n", + "else:\n", + " print(\"=\" * 70)\n", + " print(\"REAL LLM TRAINING: Pareto mode for comparison\")\n", + " print(\"=\" * 70)\n", + "\n", + " pareto_config = ObjectiveConfig(\n", + " mode=\"pareto\",\n", + " weights={\"accuracy\": 0.5, \"brevity\": 0.5},\n", + " tie_break=\"weighted\",\n", + " seed=42,\n", + " )\n", + "\n", + " pareto_agent = MathAgent(real_llm)\n", + " pareto_optimizer = OptoPrimeV2(pareto_agent.parameters(), llm=real_llm)\n", + " pareto_trainer = BasicSearchAlgorithm(\n", + " agent=pareto_agent, optimizer=pareto_optimizer\n", + " )\n", + "\n", + " pareto_scores, _ = pareto_trainer.train(\n", + " guide=MultiMetricGuide(),\n", + " train_dataset=real_dataset,\n", + " num_proposals=2,\n", + " num_epochs=1,\n", + " batch_size=1,\n", + " num_threads=1,\n", + " objective_config=pareto_config,\n", + " )\n", + "\n", + " print(f\"\\nPareto training scores: {pareto_scores}\")\n", + " print(f\"current_score_dict: {pareto_trainer.current_score_dict}\")\n", + "\n", + " print(\"\\n--- Comparison ---\")\n", + " print(f\"Weighted mode final: {real_trainer.current_score_dict}\")\n", + " print(f\"Pareto mode final: {pareto_trainer.current_score_dict}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0000026", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"\\n\" + \"=\" * 70)\n", + "print(\"M1 NOTEBOOK COMPLETE\")\n", + "print(\"=\" * 70)\n", + "print(\"\"\"\n", + "Deliverables verified:\n", + " ✓ Part A (StubLLM): All cells run without API keys\n", + " - ObjectiveConfig creation + validation\n", + " - MultiMetricGuide with get_score_dict()\n", + " - evaluate_vector() + aggregate_vector_scores()\n", + " - BasicSearch: scalar, weighted, and Pareto modes\n", + " - Backward compatibility (objective_config=None)\n", + " - Deterministic tie-break verification\n", + "\n", + " ✓ Part B (Real LLM): Trained with actual model via OpenRouter\n", + " - Weighted and Pareto mode with real LLM proposals\n", + " - Multi-metric selection (accuracy + brevity)\n", + " - current_score_dict populated with real scores\n", + "\"\"\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11.0" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/opto/trainer/algorithms/basic_algorithms.py b/opto/trainer/algorithms/basic_algorithms.py index 691b14a8..50ea7842 100644 --- a/opto/trainer/algorithms/basic_algorithms.py +++ b/opto/trainer/algorithms/basic_algorithms.py @@ -6,7 +6,8 @@ from opto.trainer.loader import DataLoader from opto.trainer.utils import batch_run, async_run from opto.optimizers.utils import print_color -from opto.trainer.evaluators import evaluate +from opto.trainer.evaluators import evaluate, evaluate_vector, aggregate_vector_scores +from opto.trainer.objectives import ObjectiveConfig, select_best def standard_optimization_step(agent, x, guide, info, min_score=0): @@ -533,6 +534,7 @@ def train(self, validate_dataset = None, # dataset of (x, info) pairs to evaluate the agent for candidate selection validate_guide = None, # to provide scores for the validation set num_proposals = 4, # number of proposals to get from the optimizer + objective_config = None, # optional ObjectiveConfig for multi-objective selection num_epochs = 1, # number of training epochs batch_size = 1, # batch size for updating the agent test_dataset = None, # dataset of (x, info) pairs to evaluate the agent @@ -549,6 +551,8 @@ def train(self, self.validate_guide = validate_guide or guide self.min_score = min_score self.current_score = None + self.objective_config = objective_config + self.current_score_dict = None # stores vector score when using multi-objective return super().train(guide, train_dataset, num_epochs=num_epochs, batch_size=batch_size, test_dataset=test_dataset, test_frequency=test_frequency, log_frequency=log_frequency, @@ -571,6 +575,21 @@ def validate(): description="Validating proposals") return np.mean(scores) if all([s is not None for s in scores]) else -np.inf + def validate_vector(): + """ Validate and return aggregated vector score dict. """ + score_dicts = evaluate_vector(self.agent, + self.validate_guide, + self.validate_dataset['inputs'], + self.validate_dataset['infos'], + min_score=self.min_score, + num_threads=num_threads, + description="Validating proposals (vector)") + return aggregate_vector_scores(score_dicts) + + # Determine whether to use vector scoring for selection + use_vector = (self.objective_config is not None + and self.objective_config.mode != "scalar") + # TODO perhaps we can ask for multiple updates in one query or use different temperatures in different queries # Generate different proposals step_kwargs = dict(bypassing=True, verbose='output' if verbose else False) # we don't print the inner full message @@ -582,25 +601,57 @@ def validate(): kwargs_list=[step_kwargs] * self.num_proposals, max_workers=num_threads, description=f"Generating {self.num_proposals} proposals") # async step + # Validate the proposals candidates = [] backup_dict = {p: copy.deepcopy(p.data) for p in self.agent.parameters()} # backup the current value - for update_dict in update_dicts: - if len(update_dict) == 0: - continue - self.optimizer.update(update_dict) # set the agent with update_dict - score = validate() # check the score on the validation set - candidates.append((score, update_dict)) - self.optimizer.update(backup_dict) # restore the backup - - # Include the current parameter as a candidate - if self.current_score is None: - self.current_score = validate() - candidates.append((self.current_score, backup_dict)) - - # Find the candidate with the best score - best_score, best_update = max(candidates, key=lambda x: x[0]) - self.current_score = best_score + + if use_vector: + # Vector path: collect (score_dict, update_dict) for multi-objective selection + vector_candidates = [] + for update_dict in update_dicts: + if len(update_dict) == 0: + continue + self.optimizer.update(update_dict) + score_dict = validate_vector() + scalar_score = float(np.mean(list(score_dict.values()))) + candidates.append((scalar_score, update_dict)) + vector_candidates.append((score_dict, update_dict)) + self.optimizer.update(backup_dict) + + # Include current parameters as a candidate + if self.current_score_dict is None: + self.current_score_dict = validate_vector() + if self.current_score is None: + self.current_score = float(np.mean(list(self.current_score_dict.values()))) + candidates.append((self.current_score, backup_dict)) + vector_candidates.append((self.current_score_dict, backup_dict)) + + # Select best via multi-objective config + best_idx = select_best(vector_candidates, self.objective_config) + best_score_dict = vector_candidates[best_idx][0] + best_update = vector_candidates[best_idx][1] + best_score = float(np.mean(list(best_score_dict.values()))) + self.current_score = best_score + self.current_score_dict = best_score_dict + else: + # Scalar path: unchanged from original behavior + for update_dict in update_dicts: + if len(update_dict) == 0: + continue + self.optimizer.update(update_dict) # set the agent with update_dict + score = validate() # check the score on the validation set + candidates.append((score, update_dict)) + self.optimizer.update(backup_dict) # restore the backup + + # Include the current parameter as a candidate + if self.current_score is None: + self.current_score = validate() + candidates.append((self.current_score, backup_dict)) + + # Find the candidate with the best score + best_score, best_update = max(candidates, key=lambda x: x[0]) + self.current_score = best_score if verbose: print_color(f"Best score: {best_score} out of scores {[c[0] for c in candidates]}", 'green') @@ -609,5 +660,11 @@ def validate(): # Make the best update self.optimizer.update(best_update) - # Logging - self.logger.log('Validation score', best_score, self.n_iters, color='green') \ No newline at end of file + # Logging — always log scalar for backward compatibility + self.logger.log('Validation score', best_score, self.n_iters, color='green') + + # Log individual vector metrics if available + if use_vector and isinstance(best_score_dict, dict): + for metric_name, metric_value in best_score_dict.items(): + self.logger.log(f'Validation score/{metric_name}', metric_value, + self.n_iters, color='green') \ No newline at end of file diff --git a/opto/trainer/evaluators.py b/opto/trainer/evaluators.py index d1e99c8e..d1271fe8 100644 --- a/opto/trainer/evaluators.py +++ b/opto/trainer/evaluators.py @@ -39,6 +39,76 @@ def _evaluate(agent, guide, i): scores = np.array(scores) if num_samples > 1: # scores will be of length N * num_samples - # Reshape scores into an array of shape (N, num_samples) + # Reshape scores into an array of shape (N, num_samples) scores = scores.reshape(N, num_samples) - return scores \ No newline at end of file + return scores + + +def evaluate_vector(agent, guide, inputs, infos, min_score=None, + num_threads=None, description=None): + """Evaluate the agent and return per-example score dicts. + + Like evaluate(), but calls guide.get_score_dict() instead of + guide.metric(), returning a list of Dict[str, float]. + + Args: + agent: The agent to evaluate + guide: The guide (must have get_score_dict method) + inputs: List of inputs to evaluate on + infos: List of additional information for each input + min_score: Fallback on exception. Dict or float (wrapped as + {"score": val}). None -> {"score": -inf}. + num_threads: Maximum threads for parallel evaluation + description: Progress bar description + + Returns: + List[Dict[str, float]] of length len(inputs) + """ + assert len(inputs) == len(infos), "Inputs and infos must have the same length" + N = len(inputs) + eval_description = description or f"Evaluating {N} examples (vector)" + + if min_score is None: + _fallback = {"score": float("-inf")} + elif isinstance(min_score, dict): + _fallback = min_score + else: + _fallback = {"score": float(min_score)} + + @batch_run(max_workers=num_threads, description=eval_description) + def _evaluate_vector(agent, guide, i): + try: + output = agent(inputs[i]).data + score_dict = guide.get_score_dict(inputs[i], output, infos[i]) + except ExecutionError: + score_dict = copy.copy(_fallback) + return score_dict + + indices = list(range(N)) + return _evaluate_vector(agent, guide, indices) + + +def aggregate_vector_scores(score_dicts): + """Compute the per-metric mean across a list of score dicts. + + Args: + score_dicts: List[Dict[str, float]] + + Returns: + Dict[str, float] with the mean value for each metric key. + Empty dict if input is empty. + """ + if not score_dicts: + return {} + + all_keys = set() + for sd in score_dicts: + all_keys.update(sd.keys()) + + result = {} + for key in sorted(all_keys): + values = [sd[key] for sd in score_dicts + if key in sd and sd[key] is not None] + if values: + result[key] = float(np.mean(values)) + return result \ No newline at end of file diff --git a/opto/trainer/guide.py b/opto/trainer/guide.py index 19b6d3b2..4906c831 100644 --- a/opto/trainer/guide.py +++ b/opto/trainer/guide.py @@ -47,6 +47,22 @@ def metric(self, query: str, response: str, reference: Optional[str] = None, **k """ Exact match metric """ return self.get_feedback(query, response, reference)[0] + def get_score_dict(self, query: str, response: str, reference: Optional[str] = None, **kwargs) -> Dict[str, float]: + """Return the evaluation score as a dictionary. + + Default implementation wraps the scalar from get_feedback() as + {"score": float_value}. Subclasses returning multi-metric scores + should override this method to return e.g. + {"accuracy": 0.9, "fluency": 0.8, "latency_s": 0.05}. + + If get_feedback() returns a dict as its first element, that dict + is returned directly (with values cast to float). + """ + score = self.get_feedback(query, response, reference, **kwargs)[0] + if isinstance(score, dict): + return {k: float(v) for k, v in score.items()} + return {"score": float(score)} + def copy(self): """ Create a copy of the guide instance. diff --git a/opto/trainer/objectives.py b/opto/trainer/objectives.py new file mode 100644 index 00000000..3c21ca67 --- /dev/null +++ b/opto/trainer/objectives.py @@ -0,0 +1,312 @@ +"""Multi-objective configuration and selection utilities. + +Provides ObjectiveConfig and pure functions for multi-objective candidate +selection: weighted scalarization, Pareto ranking, and backward-compatible +scalar max. + +All functions are pure (no side effects) and depend only on numpy, typing, +and dataclasses. No imports from opto.trainer to avoid circular dependencies. +""" +from dataclasses import dataclass, field +from typing import Any, Dict, List, Optional, Tuple, Union +import numpy as np + + +# --- Type aliases --- +ScalarScore = float +VectorScore = Dict[str, float] +ScoreLike = Union[int, float, bool, Dict[str, float]] + + +@dataclass(frozen=True) +class ObjectiveConfig: + """Immutable configuration for multi-objective candidate selection. + + Attributes: + mode: Selection strategy. + - "scalar": existing scalar comparison (default, backward-compatible). + - "weighted": scalarize via weighted sum, then select max. + - "pareto": Pareto dominance ranking with configurable tie-break. + weights: Per-metric weights for weighted scalarization. + Missing metrics use missing_value. Metrics not in weights are ignored. + Empty dict in weighted mode -> equal weight 1.0 for all metrics. + minimize: Frozenset of metric names where lower is better. + These are negated internally ("higher-is-better" normalization). + Users can pass a set; it is auto-converted to frozenset. + missing_value: Score assigned to missing metrics (default: -inf). + pareto_metrics: Subset of metrics for Pareto dominance. + None -> use all metrics present across candidates. + tie_break: Strategy for Pareto-equivalent candidates. + - "weighted": fall back to weighted scalarization. + - "lexicographic": sort by metric names alphabetically. + - "random_seeded": seeded random shuffle. + seed: Random seed for deterministic tie-breaking. + """ + mode: str = "scalar" + weights: Dict[str, float] = field(default_factory=dict) + minimize: frozenset = field(default_factory=frozenset) + missing_value: float = float("-inf") + pareto_metrics: Optional[Tuple[str, ...]] = None + tie_break: str = "weighted" + seed: int = 0 + + def __post_init__(self): + if isinstance(self.minimize, set): + object.__setattr__(self, 'minimize', frozenset(self.minimize)) + if self.mode not in ("scalar", "weighted", "pareto"): + raise ValueError( + f"mode must be 'scalar', 'weighted', or 'pareto', got '{self.mode}'" + ) + if self.tie_break not in ("weighted", "lexicographic", "random_seeded"): + raise ValueError( + f"tie_break must be 'weighted', 'lexicographic', or " + f"'random_seeded', got '{self.tie_break}'" + ) + for k, v in self.weights.items(): + if v < 0: + raise ValueError(f"Weight for '{k}' must be non-negative, got {v}") + if self.pareto_metrics is not None and len(self.pareto_metrics) == 0: + raise ValueError( + "pareto_metrics must be None (auto) or non-empty tuple" + ) + + +# --------------------------------------------------------------------------- +# Pure utility functions +# --------------------------------------------------------------------------- + +def normalize_score(score: ScoreLike) -> Dict[str, float]: + """Convert any score to dict form. + + - bool/int/float -> {"score": float(value)} + - Dict[str, float] -> returned as-is (validated: all values finite) + + Raises: + TypeError: if score is not int, float, bool, or dict. + ValueError: if dict is empty or contains non-finite values. + """ + if isinstance(score, bool): + return {"score": float(score)} + if isinstance(score, (int, float)): + val = float(score) + if not np.isfinite(val): + raise ValueError(f"Score must be finite, got {score}") + return {"score": val} + if isinstance(score, dict): + if len(score) == 0: + raise ValueError("Score dict must not be empty") + for k, v in score.items(): + if not isinstance(v, (int, float)) or not np.isfinite(float(v)): + raise ValueError( + f"Score dict value for '{k}' must be finite float, got {v}" + ) + return {k: float(v) for k, v in score.items()} + raise TypeError( + f"Score must be int, float, bool, or Dict[str, float], " + f"got {type(score).__name__}" + ) + + +def apply_minimize(score_dict: Dict[str, float], + minimize: frozenset) -> Dict[str, float]: + """Negate values for minimize metrics (higher-is-better normalization). + + Returns a new dict; metrics not in *minimize* are unchanged. + """ + return {k: -v if k in minimize else v for k, v in score_dict.items()} + + +def weighted_scalarize(score_dict: Dict[str, float], + weights: Dict[str, float], + missing_value: float = float("-inf")) -> float: + """Compute weighted sum of score dict. + + If *weights* is empty, all present metrics get equal weight 1.0. + Metrics in *score_dict* but NOT in *weights* are ignored. + """ + if not weights: + weights = {k: 1.0 for k in score_dict} + total = 0.0 + for metric, weight in weights.items(): + value = score_dict.get(metric, missing_value) + total += weight * value + return total + + +def dominates(a: Dict[str, float], b: Dict[str, float], + metrics: Optional[Tuple[str, ...]] = None) -> bool: + """Check if candidate *a* Pareto-dominates candidate *b*. + + a dominates b iff: + - a[m] >= b[m] for ALL metrics m, AND + - a[m] > b[m] for AT LEAST ONE metric m + + Both dicts must be in "higher-is-better" form (post apply_minimize). + """ + if metrics is None: + metrics = tuple(sorted(set(a.keys()) | set(b.keys()))) + at_least_one_better = False + for m in metrics: + va = a.get(m, float("-inf")) + vb = b.get(m, float("-inf")) + if va < vb: + return False + if va > vb: + at_least_one_better = True + return at_least_one_better + + +def pareto_rank(candidates: List[Dict[str, float]], + metrics: Optional[Tuple[str, ...]] = None) -> List[int]: + """Assign Pareto rank to each candidate (0 = non-dominated front). + + Uses standard non-dominated sorting. + """ + n = len(candidates) + ranks = [0] * n + remaining = set(range(n)) + current_rank = 0 + + while remaining: + front = [] + for i in remaining: + dominated = False + for j in remaining: + if i != j and dominates(candidates[j], candidates[i], metrics): + dominated = True + break + if not dominated: + front.append(i) + for i in front: + ranks[i] = current_rank + remaining.remove(i) + current_rank += 1 + + return ranks + + +def select_best(candidates: List[Tuple[ScoreLike, Any]], + config: Optional[ObjectiveConfig] = None) -> int: + """Select index of the single best candidate. + + Args: + candidates: List of (score, payload) tuples. + config: Selection config. None -> scalar max (backward-compatible). + + Returns: + Index of the best candidate. + """ + if config is None or config.mode == "scalar": + scores = [] + for score, _ in candidates: + if isinstance(score, dict): + scores.append(np.mean(list(score.values()))) + else: + scores.append(float(score)) + return int(np.argmax(scores)) + + score_dicts = [normalize_score(s) for s, _ in candidates] + score_dicts = [apply_minimize(sd, config.minimize) for sd in score_dicts] + + if config.mode == "weighted": + weighted = [ + weighted_scalarize(sd, config.weights, config.missing_value) + for sd in score_dicts + ] + return int(np.argmax(weighted)) + + if config.mode == "pareto": + ranks = pareto_rank(score_dicts, config.pareto_metrics) + front_indices = [i for i, r in enumerate(ranks) if r == 0] + + if len(front_indices) == 1: + return front_indices[0] + + # Tie-break among front + if config.tie_break == "weighted": + front_scores = [ + weighted_scalarize( + score_dicts[i], config.weights, config.missing_value + ) + for i in front_indices + ] + return front_indices[int(np.argmax(front_scores))] + + if config.tie_break == "lexicographic": + metrics = sorted(score_dicts[front_indices[0]].keys()) + + def lex_key(idx): + return tuple( + score_dicts[idx].get(m, config.missing_value) for m in metrics + ) + + return max(front_indices, key=lex_key) + + if config.tie_break == "random_seeded": + rng = np.random.RandomState(config.seed) + return front_indices[rng.randint(len(front_indices))] + + raise ValueError(f"Unknown mode: {config.mode}") + + +def select_top_k(candidates: List[Tuple[ScoreLike, Any]], + config: Optional[ObjectiveConfig] = None, + k: int = 1) -> List[int]: + """Select the top-k candidate indices. + + Same logic as select_best but returns *k* indices. + For Pareto mode: rank-0 front first (up to k), then rank-1, etc. + """ + if config is None or config.mode == "scalar": + scores = [] + for score, _ in candidates: + if isinstance(score, dict): + scores.append(np.mean(list(score.values()))) + else: + scores.append(float(score)) + return list(np.argsort(scores)[::-1][:k]) + + score_dicts = [normalize_score(s) for s, _ in candidates] + score_dicts = [apply_minimize(sd, config.minimize) for sd in score_dicts] + + if config.mode == "weighted": + weighted = [ + weighted_scalarize(sd, config.weights, config.missing_value) + for sd in score_dicts + ] + return list(np.argsort(weighted)[::-1][:k]) + + if config.mode == "pareto": + ranks = pareto_rank(score_dicts, config.pareto_metrics) + result: List[int] = [] + max_rank = max(ranks) + for rank in range(max_rank + 1): + rank_indices = [i for i, r in enumerate(ranks) if r == rank] + if config.tie_break == "weighted": + rank_indices.sort( + key=lambda i: weighted_scalarize( + score_dicts[i], config.weights, config.missing_value + ), + reverse=True, + ) + elif config.tie_break == "lexicographic": + metrics = ( + sorted(score_dicts[rank_indices[0]].keys()) + if rank_indices else [] + ) + rank_indices.sort( + key=lambda i: tuple( + score_dicts[i].get(m, config.missing_value) + for m in metrics + ), + reverse=True, + ) + elif config.tie_break == "random_seeded": + rng = np.random.RandomState(config.seed + rank) + rng.shuffle(rank_indices) + result.extend(rank_indices) + if len(result) >= k: + break + return result[:k] + + raise ValueError(f"Unknown mode: {config.mode}") diff --git a/tests/unit_tests/test_evaluators_vector.py b/tests/unit_tests/test_evaluators_vector.py new file mode 100644 index 00000000..61cfa1f1 --- /dev/null +++ b/tests/unit_tests/test_evaluators_vector.py @@ -0,0 +1,154 @@ +"""Tests for evaluate_vector and aggregate_vector_scores in opto.trainer.evaluators.""" +import pytest +import numpy as np +from opto import trace +from opto.trainer.evaluators import evaluate_vector, aggregate_vector_scores +from opto.trainer.guide import Guide + + +# --------------------------------------------------------------------------- +# Helpers +# --------------------------------------------------------------------------- + +@trace.model +class SimpleAgent: + """Deterministic agent: returns input + param.""" + def __init__(self, param): + self.param = trace.node(param, trainable=True) + + def forward(self, x): + return x + self.param + + +class MultiMetricGuide(Guide): + """Guide returning multi-metric score dict.""" + def __init__(self, target): + super().__init__() + self.target = target + + def get_feedback(self, query, response, reference=None, **kwargs): + accuracy = float(response == self.target) + brevity = 1.0 / max(abs(response - self.target) + 1, 1) + feedback = f"response={response}, target={self.target}" + return accuracy, feedback + + def get_score_dict(self, query, response, reference=None, **kwargs): + accuracy = float(response == self.target) + brevity = 1.0 / max(abs(response - self.target) + 1, 1) + return {"accuracy": accuracy, "brevity": brevity} + + +class ScalarGuide(Guide): + """Guide using only scalar get_feedback (no get_score_dict override).""" + def __init__(self, target): + super().__init__() + self.target = target + + def get_feedback(self, query, response, reference=None, **kwargs): + score = float(response == self.target) + feedback = f"response={response}" + return score, feedback + + +# --------------------------------------------------------------------------- +# evaluate_vector +# --------------------------------------------------------------------------- + +def test_evaluate_vector_basic(): + """evaluate_vector returns a list of dicts with correct metric values.""" + agent = SimpleAgent(10) + guide = MultiMetricGuide(target=11) + inputs = [1, 2, 3] + infos = [None, None, None] + + results = evaluate_vector(agent, guide, inputs, infos, num_threads=1) + + assert len(results) == 3 + assert isinstance(results[0], dict) + # input=1 + param=10 = 11 == target=11 -> accuracy=1.0, brevity=1.0 + assert results[0]["accuracy"] == 1.0 + assert results[0]["brevity"] == 1.0 + # input=2 + param=10 = 12 != target=11 -> accuracy=0.0 + assert results[1]["accuracy"] == 0.0 + assert results[1]["brevity"] == pytest.approx(0.5) # 1/(|12-11|+1) = 0.5 + # input=3 + param=10 = 13 != target=11 -> accuracy=0.0 + assert results[2]["accuracy"] == 0.0 + assert results[2]["brevity"] == pytest.approx(1.0 / 3.0) # 1/(|13-11|+1) + + +def test_evaluate_vector_all_keys_present(): + """Every result dict contains the same set of metric keys.""" + agent = SimpleAgent(5) + guide = MultiMetricGuide(target=10) + inputs = [1, 2, 3, 4, 5] + infos = [None] * 5 + + results = evaluate_vector(agent, guide, inputs, infos, num_threads=1) + + expected_keys = {"accuracy", "brevity"} + for rd in results: + assert set(rd.keys()) == expected_keys + + +def test_evaluate_vector_scalar_guide_fallback(): + """Guide without get_score_dict override returns {"score": float}.""" + agent = SimpleAgent(10) + guide = ScalarGuide(target=11) + inputs = [1, 2] + infos = [None, None] + + results = evaluate_vector(agent, guide, inputs, infos, num_threads=1) + + assert len(results) == 2 + # input=1 + param=10 = 11 == target=11 -> score=1.0 + assert results[0] == {"score": 1.0} + # input=2 + param=10 = 12 != target=11 -> score=0.0 + assert results[1] == {"score": 0.0} + + +def test_evaluate_vector_empty_inputs(): + """Empty inputs produce empty results.""" + agent = SimpleAgent(0) + guide = MultiMetricGuide(target=0) + + results = evaluate_vector(agent, guide, [], [], num_threads=1) + assert results == [] + + +# --------------------------------------------------------------------------- +# aggregate_vector_scores +# --------------------------------------------------------------------------- + +def test_aggregate_basic(): + """Per-metric mean is computed correctly.""" + score_dicts = [ + {"accuracy": 1.0, "brevity": 0.5}, + {"accuracy": 0.0, "brevity": 1.0}, + ] + agg = aggregate_vector_scores(score_dicts) + assert agg["accuracy"] == pytest.approx(0.5) + assert agg["brevity"] == pytest.approx(0.75) + + +def test_aggregate_empty(): + """Empty input returns empty dict.""" + assert aggregate_vector_scores([]) == {} + + +def test_aggregate_single(): + """Single dict returns the same values.""" + score_dicts = [{"a": 0.42, "b": 0.99}] + agg = aggregate_vector_scores(score_dicts) + assert agg == {"a": pytest.approx(0.42), "b": pytest.approx(0.99)} + + +def test_aggregate_missing_keys(): + """Handles dicts with partially overlapping keys.""" + score_dicts = [ + {"accuracy": 1.0}, + {"accuracy": 0.0, "brevity": 0.8}, + ] + agg = aggregate_vector_scores(score_dicts) + assert agg["accuracy"] == pytest.approx(0.5) + # brevity only present in one dict + assert agg["brevity"] == pytest.approx(0.8) diff --git a/tests/unit_tests/test_objectives.py b/tests/unit_tests/test_objectives.py new file mode 100644 index 00000000..04fbccc2 --- /dev/null +++ b/tests/unit_tests/test_objectives.py @@ -0,0 +1,383 @@ +"""Tests for opto.trainer.objectives — ObjectiveConfig and selection utilities.""" +import pytest +import numpy as np +from opto.trainer.objectives import ( + ObjectiveConfig, normalize_score, apply_minimize, weighted_scalarize, + dominates, pareto_rank, select_best, select_top_k, +) + + +# --------------------------------------------------------------------------- +# normalize_score +# --------------------------------------------------------------------------- + +def test_normalize_score_float(): + assert normalize_score(0.85) == {"score": 0.85} + + +def test_normalize_score_zero(): + assert normalize_score(0.0) == {"score": 0.0} + + +def test_normalize_score_int(): + assert normalize_score(1) == {"score": 1.0} + + +def test_normalize_score_int_zero(): + assert normalize_score(0) == {"score": 0.0} + + +def test_normalize_score_bool_true(): + assert normalize_score(True) == {"score": 1.0} + + +def test_normalize_score_bool_false(): + assert normalize_score(False) == {"score": 0.0} + + +def test_normalize_score_dict(): + result = normalize_score({"acc": 0.9, "lat": 50.0}) + assert result == {"acc": 0.9, "lat": 50.0} + + +def test_normalize_score_dict_with_int_values(): + result = normalize_score({"acc": 1, "lat": 0}) + assert result == {"acc": 1.0, "lat": 0.0} + + +def test_normalize_score_empty_dict_raises(): + with pytest.raises(ValueError, match="must not be empty"): + normalize_score({}) + + +def test_normalize_score_nan_raises(): + with pytest.raises(ValueError, match="finite"): + normalize_score({"a": float("nan")}) + + +def test_normalize_score_inf_raises(): + with pytest.raises(ValueError, match="finite"): + normalize_score(float("inf")) + + +def test_normalize_score_neg_inf_raises(): + with pytest.raises(ValueError, match="finite"): + normalize_score(float("-inf")) + + +def test_normalize_score_string_raises(): + with pytest.raises(TypeError, match="str"): + normalize_score("bad") + + +def test_normalize_score_none_raises(): + with pytest.raises(TypeError): + normalize_score(None) + + +# --------------------------------------------------------------------------- +# apply_minimize +# --------------------------------------------------------------------------- + +def test_apply_minimize_negates(): + result = apply_minimize({"acc": 0.9, "lat": 100.0}, frozenset({"lat"})) + assert result == {"acc": 0.9, "lat": -100.0} + + +def test_apply_minimize_empty_set(): + result = apply_minimize({"acc": 0.9, "lat": 100.0}, frozenset()) + assert result == {"acc": 0.9, "lat": 100.0} + + +def test_apply_minimize_all(): + result = apply_minimize({"a": 1.0, "b": 2.0}, frozenset({"a", "b"})) + assert result == {"a": -1.0, "b": -2.0} + + +# --------------------------------------------------------------------------- +# weighted_scalarize +# --------------------------------------------------------------------------- + +def test_weighted_scalarize_basic(): + result = weighted_scalarize({"a": 0.8, "b": 0.2}, {"a": 0.7, "b": 0.3}) + assert result == pytest.approx(0.7 * 0.8 + 0.3 * 0.2) + + +def test_weighted_scalarize_empty_weights(): + result = weighted_scalarize({"a": 1.0, "b": 2.0}, {}) + assert result == pytest.approx(3.0) # equal weight 1.0 each + + +def test_weighted_scalarize_missing_metric(): + result = weighted_scalarize({"a": 1.0}, {"a": 0.5, "b": 0.5}, missing_value=0.0) + assert result == pytest.approx(0.5) # 0.5*1.0 + 0.5*0.0 + + +def test_weighted_scalarize_ignores_extra_metrics(): + result = weighted_scalarize({"a": 1.0, "b": 2.0, "c": 99.0}, {"a": 1.0}) + assert result == pytest.approx(1.0) # only "a" is weighted + + +# --------------------------------------------------------------------------- +# dominates +# --------------------------------------------------------------------------- + +def test_dominates_yes(): + assert dominates({"a": 2.0, "b": 2.0}, {"a": 1.0, "b": 1.0}) is True + + +def test_dominates_yes_one_equal(): + assert dominates({"a": 2.0, "b": 1.0}, {"a": 1.0, "b": 1.0}) is True + + +def test_dominates_no_equal(): + assert dominates({"a": 1.0, "b": 1.0}, {"a": 1.0, "b": 1.0}) is False + + +def test_dominates_no_tradeoff(): + assert dominates({"a": 2.0, "b": 0.5}, {"a": 1.0, "b": 1.0}) is False + + +def test_dominates_with_metric_subset(): + assert dominates({"a": 2.0, "b": 0.5}, {"a": 1.0, "b": 1.0}, + metrics=("a",)) is True + + +# --------------------------------------------------------------------------- +# pareto_rank +# --------------------------------------------------------------------------- + +def test_pareto_rank_clear_hierarchy(): + candidates = [ + {"a": 3.0, "b": 3.0}, # dominates everything -> rank 0 + {"a": 2.0, "b": 2.0}, # dominated by [0] -> rank 1 + {"a": 1.0, "b": 1.0}, # dominated by [0],[1] -> rank 2 + ] + ranks = pareto_rank(candidates) + assert ranks == [0, 1, 2] + + +def test_pareto_rank_all_nondominated(): + candidates = [ + {"a": 3.0, "b": 1.0}, + {"a": 1.0, "b": 3.0}, + {"a": 2.0, "b": 2.0}, + ] + ranks = pareto_rank(candidates) + # All are tradeoffs — none dominates another + assert ranks == [0, 0, 0] + + +def test_pareto_rank_mixed(): + candidates = [ + {"a": 3.0, "b": 1.0}, # front 0 + {"a": 1.0, "b": 3.0}, # front 0 + {"a": 0.5, "b": 0.5}, # dominated by both -> rank 1 + ] + ranks = pareto_rank(candidates) + assert ranks[0] == 0 + assert ranks[1] == 0 + assert ranks[2] == 1 + + +# --------------------------------------------------------------------------- +# select_best +# --------------------------------------------------------------------------- + +def test_select_best_none_config(): + candidates = [(0.5, "A"), (0.9, "B"), (0.7, "C")] + assert select_best(candidates, None) == 1 + + +def test_select_best_scalar_mode(): + config = ObjectiveConfig(mode="scalar") + candidates = [(0.5, "A"), (0.9, "B"), (0.7, "C")] + assert select_best(candidates, config) == 1 + + +def test_select_best_scalar_with_dict_scores(): + """Scalar mode with dict scores uses mean of values.""" + config = ObjectiveConfig(mode="scalar") + candidates = [ + ({"a": 0.5, "b": 0.3}, "X"), # mean = 0.4 + ({"a": 0.8, "b": 0.6}, "Y"), # mean = 0.7 + ] + assert select_best(candidates, config) == 1 + + +def test_select_best_weighted(): + config = ObjectiveConfig( + mode="weighted", + weights={"accuracy": 0.8, "latency_s": 0.2}, + minimize=frozenset({"latency_s"}), + ) + candidates = [ + ({"accuracy": 0.95, "latency_s": 0.200}, "A"), # 0.8*0.95 + 0.2*(-0.2) = 0.72 + ({"accuracy": 0.70, "latency_s": 0.030}, "B"), # 0.8*0.70 + 0.2*(-0.03) = 0.554 + ] + assert select_best(candidates, config) == 0 + + +def test_select_best_weighted_latency_heavy(): + config = ObjectiveConfig( + mode="weighted", + weights={"accuracy": 0.2, "latency_s": 0.8}, + minimize=frozenset({"latency_s"}), + ) + candidates = [ + ({"accuracy": 0.95, "latency_s": 0.200}, "A"), # 0.2*0.95 + 0.8*(-0.2) = 0.03 + ({"accuracy": 0.70, "latency_s": 0.030}, "B"), # 0.2*0.70 + 0.8*(-0.03) = 0.116 + ] + assert select_best(candidates, config) == 1 + + +def test_select_best_pareto_tiebreak_weighted(): + config = ObjectiveConfig( + mode="pareto", + weights={"a": 0.5, "b": 0.5}, + tie_break="weighted", + ) + candidates = [ + ({"a": 0.9, "b": 0.1}, "X"), # front 0, weighted = 0.5 + ({"a": 0.1, "b": 0.9}, "Y"), # front 0, weighted = 0.5 + ({"a": 0.6, "b": 0.6}, "Z"), # front 0, weighted = 0.6 -> winner + ] + assert select_best(candidates, config) == 2 + + +def test_select_best_pareto_deterministic(): + config = ObjectiveConfig( + mode="pareto", + weights={"a": 0.5, "b": 0.5}, + tie_break="weighted", + seed=42, + ) + candidates = [ + ({"a": 0.9, "b": 0.1}, "X"), + ({"a": 0.1, "b": 0.9}, "Y"), + ] + results = [select_best(candidates, config) for _ in range(10)] + assert len(set(results)) == 1 # same result every time + + +def test_select_best_pareto_random_seeded_deterministic(): + config = ObjectiveConfig( + mode="pareto", + tie_break="random_seeded", + seed=42, + ) + candidates = [ + ({"a": 0.9, "b": 0.1}, "X"), + ({"a": 0.1, "b": 0.9}, "Y"), + ] + results = [select_best(candidates, config) for _ in range(20)] + assert len(set(results)) == 1 + + +def test_select_best_pareto_different_seeds_may_differ(): + results = set() + for seed in range(50): + config = ObjectiveConfig( + mode="pareto", + tie_break="random_seeded", + seed=seed, + ) + candidates = [ + ({"a": 0.9, "b": 0.1}, "X"), + ({"a": 0.1, "b": 0.9}, "Y"), + ] + results.add(select_best(candidates, config)) + # With 50 different seeds across 2 candidates, we expect both to appear + assert len(results) == 2 + + +# --------------------------------------------------------------------------- +# select_top_k +# --------------------------------------------------------------------------- + +def test_select_top_k_scalar_none_config(): + candidates = [(0.5, "A"), (0.9, "B"), (0.7, "C")] + indices = select_top_k(candidates, None, k=2) + assert len(indices) == 2 + assert indices[0] == 1 # B is best + assert indices[1] == 2 # C is second + + +@pytest.mark.parametrize("k", [1, 2, 3]) +def test_select_top_k_scalar_k(k): + candidates = [(0.5, "A"), (0.9, "B"), (0.7, "C")] + indices = select_top_k(candidates, None, k=k) + assert len(indices) == k + assert indices[0] == 1 # B always best + + +def test_select_top_k_weighted(): + config = ObjectiveConfig( + mode="weighted", + weights={"a": 1.0, "b": 1.0}, + ) + candidates = [ + ({"a": 0.5, "b": 0.5}, "X"), # weighted = 1.0 + ({"a": 0.9, "b": 0.1}, "Y"), # weighted = 1.0 + ({"a": 0.8, "b": 0.8}, "Z"), # weighted = 1.6 + ] + indices = select_top_k(candidates, config, k=2) + assert indices[0] == 2 # Z is best + + +def test_select_top_k_pareto(): + config = ObjectiveConfig( + mode="pareto", + weights={"a": 0.5, "b": 0.5}, + tie_break="weighted", + ) + candidates = [ + ({"a": 0.9, "b": 0.1}, "X"), # front 0 + ({"a": 0.1, "b": 0.9}, "Y"), # front 0 + ({"a": 0.05, "b": 0.05}, "Z"), # front 1 (dominated) + ] + indices = select_top_k(candidates, config, k=2) + assert set(indices) == {0, 1} # both front-0 candidates + + +# --------------------------------------------------------------------------- +# ObjectiveConfig validation +# --------------------------------------------------------------------------- + +def test_config_default(): + config = ObjectiveConfig() + assert config.mode == "scalar" + assert config.weights == {} + assert config.minimize == frozenset() + + +def test_config_set_to_frozenset(): + config = ObjectiveConfig(minimize={"lat"}) + assert isinstance(config.minimize, frozenset) + assert "lat" in config.minimize + + +def test_config_negative_weight_raises(): + with pytest.raises(ValueError, match="non-negative"): + ObjectiveConfig(weights={"a": -1.0}) + + +def test_config_bad_mode_raises(): + with pytest.raises(ValueError, match="mode"): + ObjectiveConfig(mode="unknown") + + +def test_config_bad_tie_break_raises(): + with pytest.raises(ValueError, match="tie_break"): + ObjectiveConfig(tie_break="bad") + + +def test_config_empty_pareto_metrics_raises(): + with pytest.raises(ValueError, match="non-empty"): + ObjectiveConfig(pareto_metrics=()) + + +def test_config_frozen(): + config = ObjectiveConfig() + with pytest.raises(AttributeError): + config.mode = "weighted" From 45901029613159d83f18819ff6062541d316d734 Mon Sep 17 00:00:00 2001 From: Jose Carlos Rodriguez Date: Thu, 12 Feb 2026 12:18:05 -0400 Subject: [PATCH 5/5] T6 M1: Fix Colab install cell for Python 3.12 compatibility --- examples/notebooks/t6_m1_vector_scores.ipynb | 576 +++++++++++++++++-- 1 file changed, 530 insertions(+), 46 deletions(-) diff --git a/examples/notebooks/t6_m1_vector_scores.ipynb b/examples/notebooks/t6_m1_vector_scores.ipynb index 637322d0..6363aee2 100644 --- a/examples/notebooks/t6_m1_vector_scores.ipynb +++ b/examples/notebooks/t6_m1_vector_scores.ipynb @@ -6,23 +6,7 @@ "id": "a0000001", "metadata": {}, "outputs": [], - "source": [ - "\"\"\"\n", - "T6 Milestone 1 — Multi-Objective Vector Scores\n", - "\n", - "This notebook is the M1 deliverable for the T6 Multi-Objective Vector Scores project.\n", - "It demonstrates:\n", - " 1. ObjectiveConfig creation and validation\n", - " 2. MultiMetricGuide with get_score_dict()\n", - " 3. evaluate_vector() + aggregate_vector_scores()\n", - " 4. Full BasicSearchAlgorithm.train() with DummyLLM + objective_config\n", - " 5. Scalar baseline comparison (backward compat)\n", - " 6. Pareto mode demo + deterministic tiebreak\n", - "\n", - "Part A runs end-to-end WITHOUT API keys (StubLLM / DummyLLM).\n", - "Part B requires an OpenRouter API key (Colab secrets or environment variable).\n", - "\"\"\"" - ] + "source": "!git clone https://github.com/carlosrod723/OpenTrace.git Trace\n%cd Trace\n!git checkout t6-multi-objective-m0\n!sed -i 's/python_requires=\">=3.13\"/python_requires=\">=3.12\"/' setup.py\n!pip install -e ." }, { "cell_type": "markdown", @@ -72,7 +56,7 @@ "id": "a0000004", "metadata": {}, "outputs": [], - "source": "import sys, os\n\n# Ensure OpenTrace root is on the path (needed when running from examples/notebooks/)\n_repo_root = os.path.abspath(os.path.join(os.getcwd(), '..', '..'))\nif os.path.isdir(os.path.join(_repo_root, 'opto')):\n if _repo_root not in sys.path:\n sys.path.insert(0, _repo_root)\n# Also handle running directly from the repo root\nif os.path.isdir(os.path.join(os.getcwd(), 'opto')):\n if os.getcwd() not in sys.path:\n sys.path.insert(0, os.getcwd())\n\nimport numpy as np\nfrom typing import Dict, Tuple, Optional\n\nprint(\"=\" * 70)\nprint(\"T6 M1 \\u2014 Multi-Objective Vector Scores\")\nprint(\"=\" * 70)" + "source": "import numpy as np\nfrom typing import Dict, Tuple, Optional\n\nprint(\"=\" * 70)\nprint(\"T6 M1 \\u2014 Multi-Objective Vector Scores\")\nprint(\"=\" * 70)" }, { "cell_type": "markdown", @@ -87,10 +71,41 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "id": "a0000006", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--- ObjectiveConfig defaults ---\n", + " mode=scalar, weights={}, minimize=frozenset()\n", + "\n", + "--- ObjectiveConfig: weighted mode ---\n", + " mode=weighted\n", + " weights={'accuracy': 0.8, 'latency_s': 0.2}\n", + " minimize=frozenset({'latency_s'})\n", + "\n", + "--- ObjectiveConfig: Pareto mode ---\n", + " mode=pareto, tie_break=weighted, seed=42\n", + "\n", + "--- ObjectiveConfig: set auto-converts to frozenset ---\n", + " type(minimize)=frozenset (auto-converted from set)\n", + "\n", + "--- Validation: negative weight ---\n", + " Caught: Weight for 'a' must be non-negative, got -0.5\n", + "\n", + "--- Validation: bad mode ---\n", + " Caught: mode must be 'scalar', 'weighted', or 'pareto', got 'unknown'\n", + "\n", + "--- Frozen (immutable) ---\n", + " Caught: cannot assign to field 'mode'\n", + "\n", + "ObjectiveConfig validation: all checks passed.\n" + ] + } + ], "source": [ "from opto.trainer.objectives import (\n", " ObjectiveConfig, normalize_score, apply_minimize,\n", @@ -157,10 +172,30 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "id": "a0000008", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--- Training path: get_feedback() -> (float, str) ---\n", + " score=1.0 (type=float)\n", + " feedback='Expected '4', got '4'. Correct!'\n", + "\n", + "--- Selection path: get_score_dict() -> Dict[str, float] ---\n", + " score_dict={'accuracy': 1.0, 'brevity': 1.0}\n", + "\n", + "--- metric() still returns float (backward compat) ---\n", + " metric()=1.0 (type=float)\n", + "\n", + "--- Base Guide without get_score_dict override wraps scalar ---\n", + " get_score_dict()={'score': 0.75}\n", + " (wrapped as {{'score': 0.75}} automatically)\n" + ] + } + ], "source": [ "from opto.trainer.guide import Guide\n", "\n", @@ -228,10 +263,32 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "id": "a0000010", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--- evaluate_vector() ---\n", + "Evaluating 3 examples (vector) (Running sequentially).\n", + " Example 0: {'accuracy': 1.0, 'brevity': 1.0}\n", + " Example 1: {'accuracy': 1.0, 'brevity': 1.0}\n", + " Example 2: {'accuracy': 1.0, 'brevity': 1.0}\n", + "\n", + "--- aggregate_vector_scores() ---\n", + " Aggregated (per-metric mean): {'accuracy': 1.0, 'brevity': 1.0}\n", + "\n", + "--- Wrong answer agent ---\n", + "Evaluating 3 examples (vector) (Running sequentially).\n", + " Example 0: {'accuracy': 0.0, 'brevity': 0.25}\n", + " Example 1: {'accuracy': 0.0, 'brevity': 0.25}\n", + " Example 2: {'accuracy': 0.0, 'brevity': 0.25}\n", + " Aggregated: {'accuracy': 0.0, 'brevity': 0.25}\n" + ] + } + ], "source": [ "from opto import trace\n", "from opto.trainer.evaluators import evaluate_vector, aggregate_vector_scores\n", @@ -282,10 +339,43 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "id": "a0000012", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Candidates:\n", + " prompt_A: {'accuracy': 0.95, 'latency_s': 0.2}\n", + " prompt_B: {'accuracy': 0.7, 'latency_s': 0.03}\n", + " prompt_C: {'accuracy': 0.88, 'latency_s': 0.08}\n", + " prompt_D: {'accuracy': 0.6, 'latency_s': 0.02}\n", + "\n", + "--- select_best(config=None) [scalar, backward-compat] ---\n", + " Winner: prompt_A (index 0)\n", + "\n", + "--- select_best(weighted, accuracy=0.8) ---\n", + " Winner: prompt_A (index 0)\n", + "\n", + "--- select_best(weighted, latency_s=0.8) ---\n", + " Winner: prompt_B (index 1)\n", + "\n", + "--- select_best(pareto, tie_break=weighted) ---\n", + " Pareto ranks: [0, 0, 0, 0]\n", + " Front (rank 0): ['prompt_A', 'prompt_B', 'prompt_C', 'prompt_D']\n", + " Winner (after tie-break): prompt_C (index 2)\n", + "\n", + "--- Determinism: 10 runs with same config ---\n", + " Results: [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]\n", + " All identical: True\n", + "\n", + "--- select_top_k(pareto, k=2) ---\n", + " Top 2: ['prompt_C', 'prompt_A']\n" + ] + } + ], "source": [ "# Candidates: (score_dict, payload) tuples\n", "candidates = [\n", @@ -361,10 +451,55 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 7, "id": "a0000014", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "======================================================================\n", + "TRAINING: Scalar baseline (objective_config=None)\n", + "======================================================================\n", + "Evaluating agent (iteration 0) (Running sequentially).\n", + "[Step 0] \u001b[92mAverage test score: 0.0\u001b[0m\n", + "Forward pass (batch size: 1) (Running sequentially).\n", + "Generating 2 proposals (Running sequentially).\n", + "Validating proposals (Running sequentially).\n", + "[Step 0] \u001b[92mValidation score: 0.0\u001b[0m\n", + "Evaluating agent (iteration 1) (Running sequentially).\n", + "[Step 1] \u001b[92mAverage test score: 0.0\u001b[0m\n", + "Epoch: 0. Iteration: 1\n", + "[Step 1] Instantaneous train score: 0.0\n", + "[Step 1] Average train score: 0.0\n", + "[Step 1] \u001b[91mParameter: str:2: You are a helpful assistant.\u001b[0m\n", + "Forward pass (batch size: 1) (Running sequentially).\n", + "Generating 2 proposals (Running sequentially).\n", + "[Step 1] \u001b[92mValidation score: 0.0\u001b[0m\n", + "Evaluating agent (iteration 2) (Running sequentially).\n", + "[Step 2] \u001b[92mAverage test score: 0.0\u001b[0m\n", + "Epoch: 0. Iteration: 2\n", + "[Step 2] Instantaneous train score: 0.0\n", + "[Step 2] Average train score: 0.0\n", + "[Step 2] \u001b[91mParameter: str:2: You are a helpful assistant.\u001b[0m\n", + "Forward pass (batch size: 1) (Running sequentially).\n", + "Generating 2 proposals (Running sequentially).\n", + "[Step 2] \u001b[92mValidation score: 0.0\u001b[0m\n", + "Evaluating agent (iteration 3) (Running sequentially).\n", + "[Step 3] \u001b[92mAverage test score: 0.0\u001b[0m\n", + "Epoch: 0. Iteration: 3\n", + "[Step 3] Instantaneous train score: 0.0\n", + "[Step 3] Average train score: 0.0\n", + "[Step 3] \u001b[91mParameter: str:2: You are a helpful assistant.\u001b[0m\n", + "\n", + "Scalar training scores: [np.float64(0.0), np.float64(0.0), np.float64(0.0)]\n", + "current_score: 0.0\n", + "current_score_dict: None\n", + "(current_score_dict is None because scalar mode does not use vector path)\n" + ] + } + ], "source": [ "from opto.utils.llm import DummyLLM\n", "from opto.optimizers import OptoPrimeV2\n", @@ -443,10 +578,61 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 8, "id": "a0000016", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "======================================================================\n", + "TRAINING: Weighted mode (objective_config.mode='weighted')\n", + "======================================================================\n", + "Evaluating agent (iteration 0) (Running sequentially).\n", + "[Step 0] \u001b[92mAverage test score: 0.0\u001b[0m\n", + "Forward pass (batch size: 1) (Running sequentially).\n", + "Generating 2 proposals (Running sequentially).\n", + "Validating proposals (vector) (Running sequentially).\n", + "[Step 0] \u001b[92mValidation score: 0.00819672131147541\u001b[0m\n", + "[Step 0] \u001b[92mValidation score/accuracy: 0.0\u001b[0m\n", + "[Step 0] \u001b[92mValidation score/brevity: 0.01639344262295082\u001b[0m\n", + "Evaluating agent (iteration 1) (Running sequentially).\n", + "[Step 1] \u001b[92mAverage test score: 0.0\u001b[0m\n", + "Epoch: 0. Iteration: 1\n", + "[Step 1] Instantaneous train score: 0.0\n", + "[Step 1] Average train score: 0.0\n", + "[Step 1] \u001b[91mParameter: str:9: You are a helpful assistant.\u001b[0m\n", + "Forward pass (batch size: 1) (Running sequentially).\n", + "Generating 2 proposals (Running sequentially).\n", + "[Step 1] \u001b[92mValidation score: 0.00819672131147541\u001b[0m\n", + "[Step 1] \u001b[92mValidation score/accuracy: 0.0\u001b[0m\n", + "[Step 1] \u001b[92mValidation score/brevity: 0.01639344262295082\u001b[0m\n", + "Evaluating agent (iteration 2) (Running sequentially).\n", + "[Step 2] \u001b[92mAverage test score: 0.0\u001b[0m\n", + "Epoch: 0. Iteration: 2\n", + "[Step 2] Instantaneous train score: 0.0\n", + "[Step 2] Average train score: 0.0\n", + "[Step 2] \u001b[91mParameter: str:9: You are a helpful assistant.\u001b[0m\n", + "Forward pass (batch size: 1) (Running sequentially).\n", + "Generating 2 proposals (Running sequentially).\n", + "[Step 2] \u001b[92mValidation score: 0.00819672131147541\u001b[0m\n", + "[Step 2] \u001b[92mValidation score/accuracy: 0.0\u001b[0m\n", + "[Step 2] \u001b[92mValidation score/brevity: 0.01639344262295082\u001b[0m\n", + "Evaluating agent (iteration 3) (Running sequentially).\n", + "[Step 3] \u001b[92mAverage test score: 0.0\u001b[0m\n", + "Epoch: 0. Iteration: 3\n", + "[Step 3] Instantaneous train score: 0.0\n", + "[Step 3] Average train score: 0.0\n", + "[Step 3] \u001b[91mParameter: str:9: You are a helpful assistant.\u001b[0m\n", + "\n", + "Weighted training scores: [np.float64(0.0), np.float64(0.0), np.float64(0.0)]\n", + "current_score (float): 0.00819672131147541\n", + "current_score_dict: {'accuracy': 0.0, 'brevity': 0.01639344262295082}\n", + "(current_score_dict stores the vector score selected by weighted mode)\n" + ] + } + ], "source": [ "print(\"=\" * 70)\n", "print(\"TRAINING: Weighted mode (objective_config.mode='weighted')\")\n", @@ -490,10 +676,101 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 9, "id": "a0000018", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "======================================================================\n", + "TRAINING: Pareto mode (objective_config.mode='pareto')\n", + "======================================================================\n", + "Evaluating agent (iteration 0) (Running sequentially).\n", + "[Step 0] \u001b[92mAverage test score: 0.0\u001b[0m\n", + "Forward pass (batch size: 1) (Running sequentially).\n", + "Generating 2 proposals (Running sequentially).\n", + "Validating proposals (vector) (Running sequentially).\n", + "[Step 0] \u001b[92mValidation score: 0.00819672131147541\u001b[0m\n", + "[Step 0] \u001b[92mValidation score/accuracy: 0.0\u001b[0m\n", + "[Step 0] \u001b[92mValidation score/brevity: 0.01639344262295082\u001b[0m\n", + "Evaluating agent (iteration 1) (Running sequentially).\n", + "[Step 1] \u001b[92mAverage test score: 0.0\u001b[0m\n", + "Epoch: 0. Iteration: 1\n", + "[Step 1] Instantaneous train score: 0.0\n", + "[Step 1] Average train score: 0.0\n", + "[Step 1] \u001b[91mParameter: str:16: You are a helpful assistant.\u001b[0m\n", + "Forward pass (batch size: 1) (Running sequentially).\n", + "Generating 2 proposals (Running sequentially).\n", + "[Step 1] \u001b[92mValidation score: 0.00819672131147541\u001b[0m\n", + "[Step 1] \u001b[92mValidation score/accuracy: 0.0\u001b[0m\n", + "[Step 1] \u001b[92mValidation score/brevity: 0.01639344262295082\u001b[0m\n", + "Evaluating agent (iteration 2) (Running sequentially).\n", + "[Step 2] \u001b[92mAverage test score: 0.0\u001b[0m\n", + "Epoch: 0. Iteration: 2\n", + "[Step 2] Instantaneous train score: 0.0\n", + "[Step 2] Average train score: 0.0\n", + "[Step 2] \u001b[91mParameter: str:16: You are a helpful assistant.\u001b[0m\n", + "Forward pass (batch size: 1) (Running sequentially).\n", + "Generating 2 proposals (Running sequentially).\n", + "[Step 2] \u001b[92mValidation score: 0.00819672131147541\u001b[0m\n", + "[Step 2] \u001b[92mValidation score/accuracy: 0.0\u001b[0m\n", + "[Step 2] \u001b[92mValidation score/brevity: 0.01639344262295082\u001b[0m\n", + "Evaluating agent (iteration 3) (Running sequentially).\n", + "[Step 3] \u001b[92mAverage test score: 0.0\u001b[0m\n", + "Epoch: 0. Iteration: 3\n", + "[Step 3] Instantaneous train score: 0.0\n", + "[Step 3] Average train score: 0.0\n", + "[Step 3] \u001b[91mParameter: str:16: You are a helpful assistant.\u001b[0m\n", + "\n", + "Pareto training scores: [np.float64(0.0), np.float64(0.0), np.float64(0.0)]\n", + "current_score (float): 0.00819672131147541\n", + "current_score_dict: {'accuracy': 0.0, 'brevity': 0.01639344262295082}\n", + "\n", + "--- Determinism: re-run with same seed ---\n", + "Evaluating agent (iteration 0) (Running sequentially).\n", + "[Step 0] \u001b[92mAverage test score: 0.0\u001b[0m\n", + "Forward pass (batch size: 1) (Running sequentially).\n", + "Generating 2 proposals (Running sequentially).\n", + "Validating proposals (vector) (Running sequentially).\n", + "[Step 0] \u001b[92mValidation score: 0.00819672131147541\u001b[0m\n", + "[Step 0] \u001b[92mValidation score/accuracy: 0.0\u001b[0m\n", + "[Step 0] \u001b[92mValidation score/brevity: 0.01639344262295082\u001b[0m\n", + "Evaluating agent (iteration 1) (Running sequentially).\n", + "[Step 1] \u001b[92mAverage test score: 0.0\u001b[0m\n", + "Epoch: 0. Iteration: 1\n", + "[Step 1] Instantaneous train score: 0.0\n", + "[Step 1] Average train score: 0.0\n", + "[Step 1] \u001b[91mParameter: str:23: You are a helpful assistant.\u001b[0m\n", + "Forward pass (batch size: 1) (Running sequentially).\n", + "Generating 2 proposals (Running sequentially).\n", + "[Step 1] \u001b[92mValidation score: 0.00819672131147541\u001b[0m\n", + "[Step 1] \u001b[92mValidation score/accuracy: 0.0\u001b[0m\n", + "[Step 1] \u001b[92mValidation score/brevity: 0.01639344262295082\u001b[0m\n", + "Evaluating agent (iteration 2) (Running sequentially).\n", + "[Step 2] \u001b[92mAverage test score: 0.0\u001b[0m\n", + "Epoch: 0. Iteration: 2\n", + "[Step 2] Instantaneous train score: 0.0\n", + "[Step 2] Average train score: 0.0\n", + "[Step 2] \u001b[91mParameter: str:23: You are a helpful assistant.\u001b[0m\n", + "Forward pass (batch size: 1) (Running sequentially).\n", + "Generating 2 proposals (Running sequentially).\n", + "[Step 2] \u001b[92mValidation score: 0.00819672131147541\u001b[0m\n", + "[Step 2] \u001b[92mValidation score/accuracy: 0.0\u001b[0m\n", + "[Step 2] \u001b[92mValidation score/brevity: 0.01639344262295082\u001b[0m\n", + "Evaluating agent (iteration 3) (Running sequentially).\n", + "[Step 3] \u001b[92mAverage test score: 0.0\u001b[0m\n", + "Epoch: 0. Iteration: 3\n", + "[Step 3] Instantaneous train score: 0.0\n", + "[Step 3] Average train score: 0.0\n", + "[Step 3] \u001b[91mParameter: str:23: You are a helpful assistant.\u001b[0m\n", + "Run 1 current_score_dict: {'accuracy': 0.0, 'brevity': 0.01639344262295082}\n", + "Run 2 current_score_dict: {'accuracy': 0.0, 'brevity': 0.01639344262295082}\n", + "Deterministic: True\n" + ] + } + ], "source": [ "print(\"=\" * 70)\n", "print(\"TRAINING: Pareto mode (objective_config.mode='pareto')\")\n", @@ -559,10 +836,34 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 10, "id": "a0000020", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "======================================================================\n", + "PART A COMPLETE — StubLLM Section\n", + "======================================================================\n", + "\n", + "Verified:\n", + " ✓ ObjectiveConfig creation, validation, and immutability\n", + " ✓ MultiMetricGuide: get_feedback() -> (float, str) for training loop\n", + " ✓ MultiMetricGuide: get_score_dict() -> Dict[str, float] for selection path\n", + " ✓ evaluate_vector() returns List[Dict[str, float]]\n", + " ✓ aggregate_vector_scores() computes per-metric means\n", + " ✓ select_best(): scalar, weighted, Pareto modes all work\n", + " ✓ BasicSearch training: scalar baseline (objective_config=None)\n", + " ✓ BasicSearch training: weighted mode with vector score selection\n", + " ✓ BasicSearch training: Pareto mode with deterministic tie-break\n", + " ✓ current_score stays float, current_score_dict stores vector\n", + "\n" + ] + } + ], "source": [ "print(\"\\n\" + \"=\" * 70)\n", "print(\"PART A COMPLETE — StubLLM Section\")\n", @@ -601,10 +902,19 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 11, "id": "a0000022", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "API key loaded from environment variable.\n", + "CustomLLM configured for OpenRouter (google/gemini-2.0-flash-001).\n" + ] + } + ], "source": [ "import os\n", "\n", @@ -649,10 +959,21 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 12, "id": "a0000023", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--- Smoke test: real LLM call ---\n", + " Response: 4\n", + "\n", + " LLM connection verified.\n" + ] + } + ], "source": [ "# Skip this cell if no API key\n", "if not api_key:\n", @@ -673,10 +994,74 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 13, "id": "a0000024", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "======================================================================\n", + "REAL LLM TRAINING: Weighted mode with multi-metric guide\n", + "======================================================================\n", + "Evaluating agent (iteration 0) (Running sequentially).\n", + "[Step 0] \u001b[92mAverage test score: 0.0\u001b[0m\n", + "Forward pass (batch size: 1) (Running sequentially).\n", + "Generating 2 proposals (Running sequentially).\n", + "Validating proposals (vector) (Running sequentially).\n", + "Validating proposals (vector) (Running sequentially).\n", + "Validating proposals (vector) (Running sequentially).\n", + "[Step 0] \u001b[92mValidation score: 0.75\u001b[0m\n", + "[Step 0] \u001b[92mValidation score/accuracy: 1.0\u001b[0m\n", + "[Step 0] \u001b[92mValidation score/brevity: 0.5\u001b[0m\n", + "Checking improvement (iteration 0) (Running sequentially).\n", + "\u001b[92mUpdate accepted: Current score 0.0, New score 1.0\u001b[0m\n", + "Evaluating agent (iteration 1) (Running sequentially).\n", + "[Step 1] \u001b[92mAverage test score: 1.0\u001b[0m\n", + "Epoch: 0. Iteration: 1\n", + "[Step 1] Instantaneous train score: 0.0\n", + "[Step 1] Average train score: 0.0\n", + "[Step 1] \u001b[91mParameter: str:30: You are a helpful assistant. Your task is to calculate the answer to the question. You should respond with the numerical answer only.\u001b[0m\n", + "Forward pass (batch size: 1) (Running sequentially).\n", + "Generating 2 proposals (Running sequentially).\n", + "Validating proposals (vector) (Running sequentially).\n", + "Validating proposals (vector) (Running sequentially).\n", + "[Step 1] \u001b[92mValidation score: 0.75\u001b[0m\n", + "[Step 1] \u001b[92mValidation score/accuracy: 1.0\u001b[0m\n", + "[Step 1] \u001b[92mValidation score/brevity: 0.5\u001b[0m\n", + "Checking improvement (iteration 1) (Running sequentially).\n", + "\u001b[91mUpdate rejected: Current score 1.0, New score 1.0\u001b[0m\n", + "Evaluating agent (iteration 2) (Running sequentially).\n", + "[Step 2] \u001b[92mAverage test score: 1.0\u001b[0m\n", + "Epoch: 0. Iteration: 2\n", + "[Step 2] Instantaneous train score: 1.0\n", + "[Step 2] Average train score: 0.5\n", + "[Step 2] \u001b[91mParameter: str:30: You are a helpful assistant. Your task is to calculate the answer to the question. You should respond with the numerical answer only.\u001b[0m\n", + "Forward pass (batch size: 1) (Running sequentially).\n", + "Generating 2 proposals (Running sequentially).\n", + "Validating proposals (vector) (Running sequentially).\n", + "Validating proposals (vector) (Running sequentially).\n", + "[Step 2] \u001b[92mValidation score: 0.75\u001b[0m\n", + "[Step 2] \u001b[92mValidation score/accuracy: 1.0\u001b[0m\n", + "[Step 2] \u001b[92mValidation score/brevity: 0.5\u001b[0m\n", + "Checking improvement (iteration 2) (Running sequentially).\n", + "\u001b[91mUpdate rejected: Current score 1.0, New score 1.0\u001b[0m\n", + "Evaluating agent (iteration 3) (Running sequentially).\n", + "[Step 3] \u001b[92mAverage test score: 1.0\u001b[0m\n", + "Epoch: 0. Iteration: 3\n", + "[Step 3] Instantaneous train score: 1.0\n", + "[Step 3] Average train score: 0.6666666666666666\n", + "[Step 3] \u001b[91mParameter: str:30: You are a helpful assistant. Your task is to calculate the answer to the question. You should respond with the numerical answer only.\u001b[0m\n", + "\n", + "Real LLM training scores: [np.float64(0.0), np.float64(1.0), np.float64(1.0)]\n", + "current_score (float): 0.75\n", + "current_score_dict: {'accuracy': 1.0, 'brevity': 0.5}\n", + "\n", + "Final system prompt: You are a helpful assistant. Your task is to calculate the answer to the question. You should respond with the numerical answer only.\n" + ] + } + ], "source": [ "# Real LLM training with weighted multi-objective selection\n", "if not api_key:\n", @@ -722,10 +1107,75 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 14, "id": "a0000025", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "======================================================================\n", + "REAL LLM TRAINING: Pareto mode for comparison\n", + "======================================================================\n", + "Evaluating agent (iteration 0) (Running sequentially).\n", + "[Step 0] \u001b[92mAverage test score: 0.0\u001b[0m\n", + "Forward pass (batch size: 1) (Running sequentially).\n", + "Generating 2 proposals (Running sequentially).\n", + "Validating proposals (vector) (Running sequentially).\n", + "Validating proposals (vector) (Running sequentially).\n", + "Validating proposals (vector) (Running sequentially).\n", + "[Step 0] \u001b[92mValidation score: 0.75\u001b[0m\n", + "[Step 0] \u001b[92mValidation score/accuracy: 1.0\u001b[0m\n", + "[Step 0] \u001b[92mValidation score/brevity: 0.5\u001b[0m\n", + "Checking improvement (iteration 0) (Running sequentially).\n", + "\u001b[92mUpdate accepted: Current score 0.0, New score 1.0\u001b[0m\n", + "Evaluating agent (iteration 1) (Running sequentially).\n", + "[Step 1] \u001b[92mAverage test score: 1.0\u001b[0m\n", + "Epoch: 0. Iteration: 1\n", + "[Step 1] Instantaneous train score: 0.0\n", + "[Step 1] Average train score: 0.0\n", + "[Step 1] \u001b[91mParameter: str:37: You are a helpful assistant. Your task is to answer math questions. You should only provide the numerical answer without any explanation or problem description.\u001b[0m\n", + "Forward pass (batch size: 1) (Running sequentially).\n", + "Generating 2 proposals (Running sequentially).\n", + "Validating proposals (vector) (Running sequentially).\n", + "Validating proposals (vector) (Running sequentially).\n", + "[Step 1] \u001b[92mValidation score: 0.75\u001b[0m\n", + "[Step 1] \u001b[92mValidation score/accuracy: 1.0\u001b[0m\n", + "[Step 1] \u001b[92mValidation score/brevity: 0.5\u001b[0m\n", + "Checking improvement (iteration 1) (Running sequentially).\n", + "\u001b[91mUpdate rejected: Current score 1.0, New score 1.0\u001b[0m\n", + "Evaluating agent (iteration 2) (Running sequentially).\n", + "[Step 2] \u001b[92mAverage test score: 1.0\u001b[0m\n", + "Epoch: 0. Iteration: 2\n", + "[Step 2] Instantaneous train score: 1.0\n", + "[Step 2] Average train score: 0.5\n", + "[Step 2] \u001b[91mParameter: str:37: You are a helpful assistant. Your task is to answer math questions. You should only provide the numerical answer without any explanation or problem description.\u001b[0m\n", + "Forward pass (batch size: 1) (Running sequentially).\n", + "Generating 2 proposals (Running sequentially).\n", + "Validating proposals (vector) (Running sequentially).\n", + "Validating proposals (vector) (Running sequentially).\n", + "[Step 2] \u001b[92mValidation score: 0.8333333333333333\u001b[0m\n", + "[Step 2] \u001b[92mValidation score/accuracy: 1.0\u001b[0m\n", + "[Step 2] \u001b[92mValidation score/brevity: 0.6666666666666666\u001b[0m\n", + "Checking improvement (iteration 2) (Running sequentially).\n", + "\u001b[91mUpdate rejected: Current score 1.0, New score 1.0\u001b[0m\n", + "Evaluating agent (iteration 3) (Running sequentially).\n", + "[Step 3] \u001b[92mAverage test score: 1.0\u001b[0m\n", + "Epoch: 0. Iteration: 3\n", + "[Step 3] Instantaneous train score: 1.0\n", + "[Step 3] Average train score: 0.6666666666666666\n", + "[Step 3] \u001b[91mParameter: str:37: You are a helpful assistant. Your task is to answer math questions. You should only provide the numerical answer without any explanation or problem description.\u001b[0m\n", + "\n", + "Pareto training scores: [np.float64(0.0), np.float64(1.0), np.float64(1.0)]\n", + "current_score_dict: {'accuracy': 1.0, 'brevity': 0.6666666666666666}\n", + "\n", + "--- Comparison ---\n", + "Weighted mode final: {'accuracy': 1.0, 'brevity': 0.5}\n", + "Pareto mode final: {'accuracy': 1.0, 'brevity': 0.6666666666666666}\n" + ] + } + ], "source": [ "# Real LLM: Pareto mode comparison\n", "if not api_key:\n", @@ -768,10 +1218,36 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 15, "id": "a0000026", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "======================================================================\n", + "M1 NOTEBOOK COMPLETE\n", + "======================================================================\n", + "\n", + "Deliverables verified:\n", + " ✓ Part A (StubLLM): All cells run without API keys\n", + " - ObjectiveConfig creation + validation\n", + " - MultiMetricGuide with get_score_dict()\n", + " - evaluate_vector() + aggregate_vector_scores()\n", + " - BasicSearch: scalar, weighted, and Pareto modes\n", + " - Backward compatibility (objective_config=None)\n", + " - Deterministic tie-break verification\n", + "\n", + " ✓ Part B (Real LLM): Trained with actual model via OpenRouter\n", + " - Weighted and Pareto mode with real LLM proposals\n", + " - Multi-metric selection (accuracy + brevity)\n", + " - current_score_dict populated with real scores\n", + "\n" + ] + } + ], "source": [ "print(\"\\n\" + \"=\" * 70)\n", "print(\"M1 NOTEBOOK COMPLETE\")\n", @@ -796,13 +1272,21 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", "name": "python", - "version": "3.11.0" + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.6" } }, "nbformat": 4,