Benchmark leaderboards become hard to interpret when most models score near-perfect

While exploring PeerBench benchmarks (for example, Logical Puzzles – Polish), I noticed that most modern models achieve almost perfect scores on nearly all prompts.

Since many of these prompts are taken from well-known, published puzzle books, it is very likely that models have already seen similar data during training. Because of this, the leaderboard does not clearly reflect real reasoning differences between models.



## Why this matters

- PeerBench aims to provide fair and cheat-proof AI benchmarking. However, when benchmarks become saturated:
- Leaderboards lose their ability to differentiate models
- Small score differences may appear meaningful when they are not
- Users may misinterpret results as true performance gains
- Without any indicator, it is difficult for users to judge the reliability of benchmark results.


## Suggested Improvement
It would be helpful to add simple benchmark quality signals, such as:

- A warning when most models score near-perfect
- An indicator that prompts come from well-known or published sources
- A short explanation that results may reflect memorization rather than reasoning
- This additional context would help users better understand leaderboard results.


## Expected Impact

- Better interpretation of benchmark results
- Increased trust in PeerBench evaluations
- Stronger alignment with the project’s goal of meaningful benchmarking

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark leaderboards become hard to interpret when most models score near-perfect #48

Why this matters

Suggested Improvement

Expected Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark leaderboards become hard to interpret when most models score near-perfect #48

Description

Why this matters

Suggested Improvement

Expected Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions