PeerBench

name	version	short_description
PeerBench	0.1.0	A proctored, community-governed AI benchmarking platform that eliminates test-set contamination through secret tests, sealed execution, and reputation-weighted validation.

PeerBench

Tamper-proof AI evaluation: secret tests, sealed execution, and community-verified scores.

TL;DR

PeerBench is a proctored, live evaluation platform where AI models are tested against continuously renewed secret datasets, scored by reputation-weighted validators, and ranked on publicly auditable leaderboards. The system prevents benchmark gaming through cryptographic commitments, sealed execution environments, and economic penalties for misconduct. Think of it as the SAT/GRE proctoring paradigm applied to AI—rigorous, fair, and independently verifiable.

Join the Community

Get involved with PeerBench — whether you're a researcher, developer, or AI enthusiast:

Resource	Link	Description
🌐 Web App	peerbench.ai	Submit prompts, review tests, and explore leaderboards
💻 GitHub	github.com/peerbench/peerBench	Source code, issues, and discussions
💬 Discord	Join our Discord	Chat with the community, ask questions, get support
📖 Onboarding Guide	Getting Started	Step-by-step instructions for new contributors

Problem

Public AI benchmarks are fundamentally broken. Test sets inevitably leak into training corpora—research shows over 45% overlap on major QA benchmarks and GPT-4 inferring masked MMLU answers 57% of the time, well above chance [Deng et al., 2024]. Model developers can cherry-pick favorable subsets, train directly on test data, or engineer heuristics that inflate scores without genuine capability gains. This "Wild West" of evaluation makes distinguishing real progress from manufactured leaderboard positions nearly impossible, eroding scientific signal and public trust in AI advancement claims.

Solution

PeerBench implements a proctored exam paradigm for AI evaluation. Secret test items are held in a rolling reservoir, revealed only during sealed execution, and retired after use to prevent contamination. Contributors submit tests with cryptographic commitment hashes; reviewers validate quality before items enter the live pool. Model scores are computed as reputation-weighted averages, where test weights combine peer-reviewed quality scores and contributor track records. Misconduct triggers automatic collateral slashing, creating economic alignment with honest participation.

Value

For researchers, PeerBench provides contamination-resistant benchmarks that measure genuine generalization, enabling credible capability claims and reproducible comparisons. For decision-makers, it offers trustworthy signals for procurement, deployment, and risk assessment—analogous to how credit rating agencies provide reliable financial assessments. For the AI ecosystem, it establishes a neutral, community-governed standard that restores integrity to evaluation, accelerating authentic progress while exposing hype. The system is designed to be a complementary "certificate-grade" layer alongside open benchmarks, not a replacement.

Architecture

Actors and Data Flow:

Actor	Role
Contributors	Submit test items, scoring functions, and commitment hashes
Coordination Server	Orchestrates evaluation rounds, manages the reservoir, publishes signed scores
Live Reservoir	Holds active secret test items (size `k`); oldest/lowest-weight items are retired and published
Reviewers	Rate test quality; earn reputation through consensus correlation
Model Endpoints	AI models under evaluation
Public Leaderboard	Displays aggregated, peer-validated rankings with full audit trails

The workflow proceeds: Contributors → Commitment → Review → Reservoir → Sealed Execution → Scoring → Leaderboard. Retired tests become public, enabling longitudinal contamination audits.

Security & Audit

Cryptographic Guarantees

Commitment Hashing: Contributors submit hash commitments before tests enter review. The hash is published; content remains sealed until execution or retirement.
Sealed Execution: Model inference runs in isolated SDK environments with no network egress; prompts are never exposed to model owners before scoring.
Partial Revelation for Review: Reviewers see only metadata and sampled items sufficient for quality assessment, not full test content.

Governance & Metrics

Leaderboards Maintained

Leaderboard	Purpose
Contributor Leaderboard	Ranks test creators by cumulative quality and verification bonuses
Reviewer Leaderboard	Ranks validators by consensus alignment on quality assessments
Model Leaderboard	Ranks AI models by reputation-weighted evaluation scores

Scoring Formulas

Contributor Score — cumulative test quality plus verification bonuses:

ContributorScore(c) = Σ quality(T_i^(c)) + bonuses

Reviewer Score — Pearson correlation between individual ratings and consensus:

ReviewerScore(r) = Pearson({q(i)_r}, {q(i)})

Model Score — reputation-weighted average of per-test scores:

ModelScore(m) = (Σ_i w(T_i) × s_i(m)) / (Σ_i w(T_i))

Test Weight — combines peer-reviewed quality and contributor reputation:

w(T) = max{0, 0.7 × quality(T) + 0.3 × min(2, ρ_c / 100)}

Note

If you want to contribute to this algorithm please check this discusion: link.

Contributing

To the Codebase

We welcome code contributions via pull requests. To contribute:

Fork the repository and create a feature branch
Make your changes following the existing code style and conventions
Test locally
Open a PR with a clear description of the changes

For bug reports and feature requests, please open an issue.

To the Datasets

Dataset contributions (prompts, scores, reviews) are made through the PeerBench web application:

Visit peerbench.ai and create an account
Follow the onboarding instructions: PeerBench Onboarding Guide
Submit prompts, provide reviews, or leaver feedback directly through the webapp interface

Your contributions build reputation and directly improve the quality of AI evaluation benchmarks.

Appendix: Design Notes

Hybrid Evaluation Modes

PeerBench supports two evaluation scheduling paradigms with distinct trade-offs:

Immediate Scoring: Models are evaluated on-demand as tests become available. Pros: Fast feedback loop, continuous ranking updates. Cons: Scores across time windows are not directly comparable; higher contamination risk if tests are reused.
Synchronized Cohort Evaluation: All models in a cohort are evaluated against the same test batch in a fixed window. Pros: Fair head-to-head comparison, reduced contamination surface. Cons: Slower iteration, requires coordination overhead.
Recommended Hybrid Approach: A portion of the reservoir (e.g., 70%) is reserved for synchronized quarterly cohorts, while 30% supports immediate scoring for rapid iteration. This balances fairness with agility and provides both comparable cohort rankings and continuous progress signals.

Rolling Reservoir & Retirement

The live reservoir maintains a fixed capacity k of secret tests. When new high-quality tests are added, the system retires items based on: (1) lowest weight scores, or (2) oldest age. Retired tests are published with full content for transparency and enable longitudinal contamination detection. This "rolling renewal" ensures freshness while building a public archive for reproducibility research.

Partial Revelation Protocol

To enable quality review without compromising test secrecy, reviewers receive: (1) metadata (domain, capability tags, difficulty estimate), (2) a random sample of 1–3 items from multi-item tests, (3) the scoring function signature. Full content is never revealed until retirement, minimizing leakage vectors while maintaining review rigor.

References

Position Paper: Cheng, Z., et al. "Benchmarking is Broken — Don't Let AI be its Own Judge." NeurIPS 2025. arXiv:2510.07575
Data Contamination Study: Deng, C., et al. "Investigating Data Contamination in Modern Benchmarks for Large Language Models." arXiv:2311.09783, 2024.
Prototype: peerbench.ai
Related Work: LiveBench, Dynabench, SEAL Leaderboards

Name		Name	Last commit message	Last commit date
Latest commit History 197 Commits
apps/webapp		apps/webapp
assets		assets
packages/sdk		packages/sdk
.dockerignore		.dockerignore
.gitignore		.gitignore
.npmrc		.npmrc
.prettierignore		.prettierignore
.prettierrc		.prettierrc
Dockerfile.webapp		Dockerfile.webapp
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
turbo.json		turbo.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PeerBench

Table of Contents

TL;DR

Join the Community

Problem

Solution

Value

Architecture

Security & Audit

Cryptographic Guarantees

Governance & Metrics

Leaderboards Maintained

Scoring Formulas

Contributing

To the Codebase

To the Datasets

Appendix: Design Notes

Hybrid Evaluation Modes

Rolling Reservoir & Retirement

Partial Revelation Protocol

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 8

Uh oh!

Languages

peerbench/peerBench

Folders and files

Latest commit

History

Repository files navigation

PeerBench

Table of Contents

TL;DR

Join the Community

Problem

Solution

Value

Architecture

Security & Audit

Cryptographic Guarantees

Governance & Metrics

Leaderboards Maintained

Scoring Formulas

Contributing

To the Codebase

To the Datasets

Appendix: Design Notes

Hybrid Evaluation Modes

Rolling Reservoir & Retirement

Partial Revelation Protocol

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 8

Uh oh!

Languages

Packages