Enhancement: Synchronized Cohort-Based Evaluation

**Problem**
All evaluations are currently asynchronous, making scores incomparable across models and time.

**Basis of issue**

1. Time-boxed evaluation rounds (e.g., quarterly cohorts)
2. Shared prompt batches per cohort
3. Cohort registration and locking mechanism
4. Split evaluation budget (e.g., 70% cohort / 30% immediate)

**Importance**

1. Enables fair, head-to-head comparisons
2. Reduces contamination surface
3. Aligns with paper’s benchmarking methodology

**Current Implementation gap**

1. Immediate scoring only
2. No cohort coordination

**Implementation Checklist**

1. Cohort scheduler with fixed evaluation windows
2. Models in a cohort evaluated on identical prompt sets
3. Cohort scores stored and compared independently
4. Immediate scoring remains available but limited

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhancement: Synchronized Cohort-Based Evaluation #60

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enhancement: Synchronized Cohort-Based Evaluation #60

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions