statbelt

statbelt is a strict experimental harness for reproducible, statistically aware model evaluation in Python.

Status: Alpha (APIs may evolve).
Supported Python: 3.11+.

Installation

Install from PyPI:

pip install statbelt

For local development:

uv sync --all-groups

Quick Start

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from statbelt import ExperimentalHarness

dataset = load_breast_cancer()
X, y = dataset.data, dataset.target

report = (
    ExperimentalHarness()
    .data(X, y)
    .task("binary_classification")
    .compare(
        ("logreg", make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))),
        ("rf", RandomForestClassifier(n_estimators=100, random_state=21)),
    )
    .metrics("accuracy", "roc_auc", "log_loss")
    .design(cv=5, cv_repeats=2, random_state=42)
    .inference(alpha=0.05, bootstrap_resamples=2000)
    .compare_inference(method="paired_bootstrap", alternative="two-sided")
    .multiplicity(method="holm", family="global")
    .practical_significance(accuracy=0.005, roc_auc=0.002, log_loss=0.01)
    .baseline("logreg")
    .guardrails(min_improvement={"accuracy": 0.002}, confidence=0.95)
    .fasten("statbelt.lock.json")
    .evaluate()
)

print(report.summary())

Sample output:

Task: binary_classification
CV folds: 5
CV repeats: 2
Bootstrap resamples: 2000
Confidence interval: 95%

Model: logreg
  accuracy: 0.9754 (CI 0.9666, 0.9833)
  roc_auc: 0.9948 (CI 0.9913, 0.9976)
  log_loss: 0.0775 (CI 0.0603, 0.0975)

Model: rf
  accuracy: 0.9596 (CI 0.9526, 0.9666)
  roc_auc: 0.9903 (CI 0.9860, 0.9941)
  log_loss: 0.1478 (CI 0.1066, 0.2145)

Pairwise comparisons:
  logreg - rf [accuracy]: 0.0158 (CI 0.0070, 0.0255), p_adj=0.002999, reject, practical
  logreg - rf [roc_auc]: 0.0045 (CI 0.0002, 0.0098), p_adj=0.03898, reject, practical
  logreg - rf [log_loss]: -0.0703 (CI -0.1396, -0.0245), p_adj=0.002999, reject, practical

Guardrails: FAIL
  rf vs logreg [accuracy]: FAIL (min 0.0020, CI -0.0263, -0.0070)

When pairwise inference and guardrails are configured, summary() also includes pairwise comparison lines and an overall guardrail pass/fail section.

Multiclass Quick Start

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from statbelt import ExperimentalHarness

dataset = load_iris()
X, y = dataset.data, dataset.target

report = (
    ExperimentalHarness()
    .data(X, y)
    .task("multiclass_classification")
    .compare(
        ("logreg", LogisticRegression(max_iter=1000)),
        ("rf", RandomForestClassifier(n_estimators=200, random_state=42)),
    )
    .metrics(
        "accuracy",
        "precision",
        "recall",
        "f1",
        "roc_auc",
        "log_loss",
        "roc_auc_ovo_weighted",
    )
    .design(cv=5, random_state=42)
    .inference(alpha=0.05, bootstrap_resamples=1000)
    .fasten("statbelt.lock.json")
    .evaluate()
)

Core Features

ExperimentalHarness builder-style API for binary and multiclass classification comparisons.
Deterministic repeated stratified k-fold evaluation with shared folds across models.
Bootstrap confidence intervals over fold-level metrics.
Pairwise model inference with paired bootstrap/permutation p-values.
Multiple-comparison correction (holm, bonferroni, fdr_bh).
Practical-significance thresholds and baseline guardrail checks.
Machine-readable exports via EvaluationReport.to_json() and .to_dataframe().
Lock artifact output (statbelt.lock.json) with config and split indices.
Strict staged workflow: configure -> fasten() -> evaluate().

Inference Configuration

Use pairwise inference to compare models directly:

.compare_inference(method="paired_bootstrap", alternative="two-sided")

Supported values:

method: paired_bootstrap, permutation
alternative: two-sided, greater, less

How to choose alternative:

two-sided: use when you only care whether A and B differ.
greater: use when your question is “is A better than B?”
less: use when your question is “is A worse than B?”

How to choose method:

paired_bootstrap: default practical choice for CI + p-value style comparison.
permutation: exact paired-randomization style null test over fold deltas.

Interpretation details:

Pairwise rows are always model_a vs model_b; A/B come from compare(...) order.
delta in the report is raw metric-space model_a - model_b.
One-sided p-values are metric-direction normalized, so greater/less keep the same meaning across mixed metrics (for example, both accuracy and log_loss).

Quick example:

report = (
    ExperimentalHarness()
    .data(X, y)
    .task("binary_classification")
    .compare(("candidate", candidate_model), ("baseline", baseline_model))
    .metrics("accuracy", "log_loss")
    .compare_inference(method="paired_bootstrap", alternative="greater")
    .fasten()
    .evaluate()
)
# Here, p-values answer: \"is candidate better than baseline?\"

Control multiple testing with:

.multiplicity(method="holm", family="global")

Supported values:

method: holm, bonferroni, fdr_bh
family: global, per_metric

Practical Significance and Guardrails

Practical thresholds:

.practical_significance(accuracy=0.005, log_loss=0.01)

Guardrails against a baseline:

.baseline("logreg")
.guardrails(min_improvement={"accuracy": 0.002}, confidence=0.95)

Rules:

Threshold values must be finite and non-negative.
Guardrails require baseline(...).
Guardrail metrics must also be included in .metrics(...).

Report and Export API

evaluate() returns an EvaluationReport with:

models: per-model metric intervals
pairwise: pairwise deltas, CIs, raw/adjusted p-values, practical-significance flags
guardrails: per-check pass/fail and aggregate overall_pass
splits and split_metadata: deterministic split definitions

Export helpers:

report.to_json("report.json")
report.to_dataframe(kind="models")
report.to_dataframe(kind="pairwise")

Lockfile Schema

fasten() writes schema version 3 lockfiles, including:

design: cv, cv_repeats, random_state
inference config: alpha, bootstrap_resamples, pairwise_inference
multiplicity config
practical-significance and guardrail config
split indices with repeat/fold metadata

Supported Task and Metrics

Supported tasks:

binary_classification
multiclass_classification

Binary metrics:

accuracy
precision
recall
f1
roc_auc
log_loss

Multiclass metrics:

accuracy
precision_macro, precision_weighted, precision_micro
recall_macro, recall_weighted, recall_micro
f1_macro, f1_weighted, f1_micro
roc_auc_ovr_macro, roc_auc_ovr_weighted
roc_auc_ovo_macro, roc_auc_ovo_weighted
log_loss

Task-aware metric aliases:

Metric name	`binary_classification`	`multiclass_classification`
`precision`	binary precision	`precision_macro`
`recall`	binary recall	`recall_macro`
`f1`	binary F1	`f1_macro`
`roc_auc`	binary ROC AUC	`roc_auc_ovr_macro`

Validation is fail-fast. For example:

log_loss requires predict_proba.
binary roc_auc accepts predict_proba or decision_function.
multiclass ROC AUC metrics require predict_proba.

Development

uv sync --all-groups
uv run ruff check .
uv run pytest

For release operations (tagging, TestPyPI gate, PyPI publish), see RELEASING.md.

Current Limits

Classification tasks only (regression is not supported yet).

License

This project is licensed under the GNU Affero General Public License, version 3 or later (AGPL-3.0-or-later). See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
src/statbelt		src/statbelt
tests		tests
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
RELEASING.md		RELEASING.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

statbelt

Installation

Quick Start

Multiclass Quick Start

Core Features

Inference Configuration

Practical Significance and Guardrails

Report and Export API

Lockfile Schema

Supported Task and Metrics

Development

Current Limits

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

statbelt

Installation

Quick Start

Multiclass Quick Start

Core Features

Inference Configuration

Practical Significance and Guardrails

Report and Export API

Lockfile Schema

Supported Task and Metrics

Development

Current Limits

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages