Statistical Learning Theory in Lean 4: Empirical Processes from Scratch

Abstract

We present the first comprehensive Lean 4 formalization of statistical learning theory (SLT) grounded in empirical process theory. Our end-to-end formal infrastructure implement the missing contents in latest Lean library, including a complete development of Gaussian Lipschitz concentration, the first formalization of Dudley’s entropy integral theorem for sub-Gaussian processes, and an application to least-squares regression with a sharp rate. The project was carried out using a human–AI collaborative workflow, in which humans design proof strategies and AI agents execute tactical proof construction, resulting in approximately 30,000 lines of human-verified Lean 4 code produced over 500 hours of supervised development. Beyond implementation, the formalization process exposes and resolves implicit assumptions and missing details in standard SLT textbooks, enforcing a granular, line-by-line understanding of the theory. This work establishes a reusable formal foundation for future developments in machine learning theory.

Lean formulation for Localized Empirical Process Framework. It includes the blue part for the capacity control and the red part for concentration. The colored zone indicates the major results in the chapters of Wainwright (2019, High-Dimensional Statistics) and Boucheron et al. (2013, Concentration Inequalities).

Implementation Roadmap

How to Run

# get Mathlib4 cache 
lake exe cache get

# build the project
lake build

Major Results

Our formalization library SLT contains (but not limited to):

Name	Reference
`small_ball_prob`	Vershynin (2018), Exercise 2.2.10
`coveringNumber_lt_top_of_totallyBounded`	Vershynin (2018), Remark 4.2.3
`isENet_of_maximal`	Vershynin (2018), Lemma 4.2.6
`coveringNumber_euclideanBall_le`	Vershynin (2018), Corollary 4.2.13
`coveringNumber_l1Ball_le`	Daras et al. (2021), Theorem 2
`subGaussian_finite_max_bound`	Wainwright (2019), Exercise 2.12
`dudley`	Boucheron et al. (2013), Corollary 13.2
`efronStein`	Boucheron et al. (2013), Theorem 3.1
`gaussianPoincare`	Boucheron et al. (2013), Theorem 3.20
`han_inequality`	Boucheron et al. (2013), Theorem 4.1
`entropy_duality`	Boucheron et al. (2013), Theorem 4.13
`entropy_duality_T`	Boucheron et al. (2013), Remark 4.4
`entropy_subadditive`	Boucheron et al. (2013), Theorem 4.22
`bernoulli_logSobolev`	Boucheron et al. (2013), Theorem 5.1
`gaussian_logSobolev_W12_pi`	Boucheron et al. (2013), Theorem 5.4
`lipschitz_cgf_bound`	Boucheron et al. (2013), Theorem 5.5
`gaussian_lipschitz_concentration`	Boucheron et al. (2013), Theorem 5.6
`one_step_discretization`	Wainwright (2019), Proposition 5.17
`local_gaussian_complexity_bound`	Wainwright (2019), (5.48) Gaussian Case
`master_error_bound`	Wainwright (2019), Theorem 13.5
`gaussian_complexity_monotone`	Wainwright (2019), Lemma 13.6
`linear_minimax_rate_rank`	Wainwright (2019), Example 13.8
`bad_event_probability_bound`	Wainwright (2019), Lemma 13.12
`l1BallImage_coveringNumber_le`	Raskutti et al. (2011), Lemma 4, q=1

Dataset

We release a high-quality Lean 4 training dataset for LLM's formal reasoning — 865 traced theorems, 18,669 tactic steps, 300M tokens from our SLT library. Every proof is human-verified, non-LLM-synthetic with full proof state traces (state_before → tactic → state_after).

Dataset	Download
Novel	🤗 HuggingFace
Random	🤗 HuggingFace
Corpus	🤗 HuggingFace

Novel: Validation and test sets contain theorems that use premises not seen during training (harder evaluation).
Random: Theorems are split randomly.
Corpus: 3,021 premises across 470 files (SLT library + referenced Mathlib/Lean4 stdlib declarations). Used for retrieval-augmented proving.

Recipe for Vibe Formalization

We provide a practical recipe for human–AI collaborative formalization in Lean 4, distilled from our experience producing ~30,000 lines of human-verified code with Claude Code (Claude-Opus-4.5). The full guide and example prompts live in vibe-recipe/.

Workflow at a glance. The process follows four iterative steps:

Decompose proofs into small lemmas. We recommend to ensure the target to formalize can be finished within a single agent context window (~300 lines) without auto-compaction. Small units increase the agent's effective thinking budget, reduce the information loss of context compaction.
Design high-quality prompts. Supply the agent with (a) signatures of possibly-needed project-local declarations obtained via the file outline MCP tool—never full file contents, which quickly fill the context window and cause hallucinations—and (b) a well-written mathematical proof to formalize. Mathlib declarations can be discovered on-the-fly through Lean search MCP tools. A worked example is given in vibe-recipe/EXAMPLE_INSTRUCTIONS.md.
Clean compiler warnings. Instruct the agent to eliminate all warnings, explicitly directing it to remove unused variables rather than masking them with _ (a harmful preference of Claude-Opus-4.5).
Clean unused have statements. Use the custom #check_unused_have metaprogram (vibe-recipe/UnusedHaveDetector.lean) to detect and remove dead have bindings from proof terms. Repeat Steps 3–4 until both are clean.

Human-in-the-loop intervention. A recurring failure mode we observe is that when the agent faces many simultaneous tactic errors in a long proof, it tends to abandon a largely correct proof structure in favor of drastic—and often incorrect—rewrites (e.g., "Let me simplify the approach…"). To counteract this, we instruct to encourage the agent always to fix errors first. This forces incremental error resolution, which surfaces structurally diagnostic errors that expose the true root cause rather than triggering wholesale re-proofs.

References

Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press.
Daras, G., Dean, J., Jalal, A., & Dimakis, A. (2021). Intermediate Layer Optimization for Inverse Problems using Deep Generative Models. In Proceedings of the 38th International Conference on Machine Learning (Vol. 139, pp. 2421–2432). PMLR.
Raskutti, G., Wainwright, M. J., & Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over ℓ_q-balls. IEEE Transactions on Information Theory, 57(10), 6976–6994.
Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science (Vol. 47). Cambridge University Press.
Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint (Vol. 48). Cambridge University Press.

Acknowledgement

SLT/SeparableSpaceSup.lean is sourced from lean-rademacher. We use separableSpaceSup_eq_real from it in SLT/Dudley.lean.
SLT/GaussianPoincare/LevyContinuity.lean is sourced from CLT. We use tendsto_iff_tendsto_charFun from Clt/Inversion.lean in SLT/GaussianPoincare/Limit.lean.
We use mcp tools from lean-lsp-mcp to enable the agent to interact with live LSP feedbacks and retrieve relevant information.

We greatly appreciate these remarkable repositories.

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
SLT		SLT
figs		figs
vibe-recipe		vibe-recipe
README.md		README.md
lake-manifest.json		lake-manifest.json
lakefile.lean		lakefile.lean
lean-toolchain		lean-toolchain

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Statistical Learning Theory in Lean 4: Empirical Processes from Scratch

Table of Contents

Abstract

Implementation Roadmap

How to Run

Major Results

Dataset

Recipe for Vibe Formalization

References

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

YuanheZ/lean-stat-learning-theory

Folders and files

Latest commit

History

Repository files navigation

Statistical Learning Theory in Lean 4: Empirical Processes from Scratch

Table of Contents

Abstract

Implementation Roadmap

How to Run

Major Results

Dataset

Recipe for Vibe Formalization

References

Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages