-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
While exploring PeerBench benchmarks (for example, Logical Puzzles – Polish), I noticed that most modern models achieve almost perfect scores on nearly all prompts.
Since many of these prompts are taken from well-known, published puzzle books, it is very likely that models have already seen similar data during training. Because of this, the leaderboard does not clearly reflect real reasoning differences between models.
Why this matters
- PeerBench aims to provide fair and cheat-proof AI benchmarking. However, when benchmarks become saturated:
- Leaderboards lose their ability to differentiate models
- Small score differences may appear meaningful when they are not
- Users may misinterpret results as true performance gains
- Without any indicator, it is difficult for users to judge the reliability of benchmark results.
Suggested Improvement
It would be helpful to add simple benchmark quality signals, such as:
- A warning when most models score near-perfect
- An indicator that prompts come from well-known or published sources
- A short explanation that results may reflect memorization rather than reasoning
- This additional context would help users better understand leaderboard results.
Expected Impact
- Better interpretation of benchmark results
- Increased trust in PeerBench evaluations
- Stronger alignment with the project’s goal of meaningful benchmarking
Metadata
Metadata
Assignees
Labels
No labels