Skip to content

Enhancement: Synchronized Cohort-Based Evaluation #60

@sharmaanchita

Description

@sharmaanchita

Problem
All evaluations are currently asynchronous, making scores incomparable across models and time.

Basis of issue

  1. Time-boxed evaluation rounds (e.g., quarterly cohorts)
  2. Shared prompt batches per cohort
  3. Cohort registration and locking mechanism
  4. Split evaluation budget (e.g., 70% cohort / 30% immediate)

Importance

  1. Enables fair, head-to-head comparisons
  2. Reduces contamination surface
  3. Aligns with paper’s benchmarking methodology

Current Implementation gap

  1. Immediate scoring only
  2. No cohort coordination

Implementation Checklist

  1. Cohort scheduler with fixed evaluation windows
  2. Models in a cohort evaluated on identical prompt sets
  3. Cohort scores stored and compared independently
  4. Immediate scoring remains available but limited

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions