-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
Problem
All evaluations are currently asynchronous, making scores incomparable across models and time.
Basis of issue
- Time-boxed evaluation rounds (e.g., quarterly cohorts)
- Shared prompt batches per cohort
- Cohort registration and locking mechanism
- Split evaluation budget (e.g., 70% cohort / 30% immediate)
Importance
- Enables fair, head-to-head comparisons
- Reduces contamination surface
- Aligns with paper’s benchmarking methodology
Current Implementation gap
- Immediate scoring only
- No cohort coordination
Implementation Checklist
- Cohort scheduler with fixed evaluation windows
- Models in a cohort evaluated on identical prompt sets
- Cohort scores stored and compared independently
- Immediate scoring remains available but limited
coderabbitai
Metadata
Metadata
Assignees
Labels
No labels