Skip to content

Benchmark leaderboards become hard to interpret when most models score near-perfect #48

@Meghhanaa

Description

@Meghhanaa

While exploring PeerBench benchmarks (for example, Logical Puzzles – Polish), I noticed that most modern models achieve almost perfect scores on nearly all prompts.

Since many of these prompts are taken from well-known, published puzzle books, it is very likely that models have already seen similar data during training. Because of this, the leaderboard does not clearly reflect real reasoning differences between models.

Why this matters

  • PeerBench aims to provide fair and cheat-proof AI benchmarking. However, when benchmarks become saturated:
  • Leaderboards lose their ability to differentiate models
  • Small score differences may appear meaningful when they are not
  • Users may misinterpret results as true performance gains
  • Without any indicator, it is difficult for users to judge the reliability of benchmark results.

Suggested Improvement

It would be helpful to add simple benchmark quality signals, such as:

  • A warning when most models score near-perfect
  • An indicator that prompts come from well-known or published sources
  • A short explanation that results may reflect memorization rather than reasoning
  • This additional context would help users better understand leaderboard results.

Expected Impact

  • Better interpretation of benchmark results
  • Increased trust in PeerBench evaluations
  • Stronger alignment with the project’s goal of meaningful benchmarking

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions