Status: V0.5.0.5 — Phase 0 Complete, Phase 1 Starting
V4 Status: ✅ Graduated (GPT-2 parity at 17% fewer params)
Phase 0: ✅ COMPLETE (Base model characterization)
Updated: 2026-01-13
Repository: https://github.com/9to5ninja-projects/groundthink
License: MIT (see LICENSE)
⚠️ EXPERIMENTAL RESEARCH CODE — Not for production use. No warranties.⚖️ ATTRIBUTION: This project builds on RWKV-6 (Peng et al., 2024) and Mamba-2 (Dao & Gu, 2024). Our contribution is the fusion architecture, training methodology, and validation framework. See ATTRIBUTION.md for full citations.
- 📖 About GroundThink — Project overview, status, and goals
- ⚖️ Attribution & Citations — Required reading for usage/citation
- 🚀 Getting Started — Installation and setup
- 🗺️ Documentation Map — Full documentation index
- 📊 V4 Graduation Summary — Phase 4.0 results
- 🔬 Phase 0 Findings — COMPLETE
- 🔮 V0.5 Roadmap — Twin Debate architecture plan (Phase 1 current)
Phase 0 Complete ✅ (2026-01-13):
- ✅ Pure RWKV-6 benchmarked (4M params) — AMPLIFIER (5.5x total variance)
- ✅ Pure Mamba-2 benchmarked (4M params) — AMPLIFIER at full model (2.0x), DAMPER at layer level
- ✅ GPT-1 baseline benchmarked (4M params) — AMPLIFIER (782x extreme)
- ✅ BlinkDL initialization confirmed architecture-agnostic (fixes saturation in all models)
- ✅ Comparative analysis complete — Fusion architecture decisions made
Key Discovery: All full models amplify variance, but SSMs are 142× more stable than attention-based models. RWKV amplifies per-layer, Mamba damps at layer level—complementary behavior confirmed!
Phase 1 Now Starting:
- Task 0.1: GRU Arbiter (stateful gating)
- Task 0.2: Mamba Residual Path (preserve damping)
- Task 0.3: Twin Debate Loss (pathway specialization)
- Task 0.4: 4M Pilot Run (target: Mamba >5% contribution)
See V0.5_ROADMAP.md and BASE_MODEL_CHARACTERIZATION.md for details.
| Model | Type | Variance Amplification | Key Insight |
|---|---|---|---|
| GPT-1 (4M) | Attention | 782× | Extreme amplification |
| RWKV-6 (4M) | SSM | 5.5× (1.28×/layer) | Amplifies, layer-level |
| Mamba-2 (4M) | SSM | 2.0× full / 0.005× layer | Damps at layer level! |
- Layer-Level Fusion: Preserve Mamba's damping by fusing before residual aggregation
- BlinkDL Init: Apply to all components (embeddings: ±1e-4, projections: zero)
- Target Variance: 2–6× total (SSM range, not GPT-1's 782×)
- Open Question: How to add Mamba residuals without losing damping effect?
See BASE_MODEL_CHARACTERIZATION.md for full findings.
GroundThink is an experimental hybrid architecture combining:
- RWKV-6 (Peng et al., 2024) — recurrent-style, long-range memory
- Mamba-2 (Dao & Gu, 2024) — selective state-space model
- Gated Fusion (our contribution) — learnable pathway weighting
Our Contribution: The specific fusion mechanism, training methodology, and validation framework. We did not create RWKV-6 or Mamba-2 — we are exploring how to optimally combine them.
Both components run in parallel within each block, fused via learned gating. This design leverages RWKV's recurrent continuity and Mamba's selective reasoning in a single forward pass.
Key innovation: Learned α-gating enables context-dependent pathway weighting, allowing the model to dynamically choose between recurrent (RWKV) and selective (Mamba) processing modes.
┌─────────────────────────────────────┐
│ Input: [batch, seq, 128] │
├─────────────────────────────────────┤
│ │
│ Norm │
│ ├─→ RWKV-6 ──┐ │
│ └─→ Mamba-2 ─┤ │
│ ▼ │
│ Gated Fusion (learns α) │
│ output = α·rwkv + (1-α)·mamba
│ │ │
│ + SKIP ────────────→ │
│ │ │
│ ▼ │
│ RMSNorm + FFN │
│ │ │
│ + SKIP ────────────→ │
│ │ │
│ ▼ │
│ Output: [batch, seq, 128] │
│ │
└─────────────────────────────────────┘
See V4_DESIGN.md for detailed architecture diagrams and layer specifications.
| Metric | Result | Comparison |
|---|---|---|
| GPT-2 Parity | Loss ratio 1.008 | ✅ EQUIVALENT |
| Parameter Efficiency | 5.6M params | 17% fewer than GPT-2 (6.8M) |
| Dataset | WikiText-103 | 16K BPE tokenization |
| Long Context | 1.04× @ 512 tokens | Stable degradation |
| Throughput | 42.9K tok/s | 4.5× slower (kernel optimization needed) |
1. Mamba Paradox:
- Mamba receives 10× larger gradients than RWKV
- But contributes <0.3% to final state
- Architectural behavior, not a training bug
2. Attractor Zone:
- All gate initializations converge to 10-30% RWKV/Mamba ratio
- Optimizer finds same equilibrium regardless of starting bias
3. Architecture Validated:
- Hybrid fusion matches transformer performance at small scale
- Linear O(n) complexity maintained for both pathways
- Ready for V0.5 architectural improvements
See OBSERVATION_SYNTHESIS.md for detailed analysis.
# Install dependencies (Python 3.10+, CUDA 12.1+)
pip install -r requirements.txt
# On Linux, install optional faster kernels
pip install causal-conv1d mamba-ssm# 1. Setup
git clone https://github.com/9to5ninja-projects/groundthink.git
cd groundthink
source .venv/bin/activate
pip install -r requirements.txt
# 2. Verify environment
python -m tests.test_phase0_complete
# 3. Run benchmark
python benchmark_variants.pyEssential reading:
- ONBOARDING.md — What are RWKV and Mamba? Why combine them?
- GETTING_STARTED.md — Clone, install, run first benchmark
- V0.5_ROADMAP.md — Current phase implementation plan
- V4_DESIGN.md — Architecture specification
Current status:
- HANDOFF.md — Agent handoff, current tasks
- BASE_MODEL_CHARACTERIZATION.md — Phase 0 findings
- CHANGELOG.md — Version history
Contributions follow the survival of the fittest approach:
- Create a new variant (fork hybrid_v4_GF.py)
- Benchmark it against current winner (GF-MH)
- If it beats the winner, merge it
- Update README with new results
The only gate: must benchmark fairly (same dataset, same steps, same seeds).
MIT (see LICENSE)
See documentation in this order:
- Current Phase: V0.5_ROADMAP.md
- Architecture: V4_DESIGN.md
- Status: HANDOFF.md
- Phase 0 Findings: BASE_MODEL_CHARACTERIZATION.md
Last Updated: 2026-01-13 (Phase 0 Complete, Phase 1 Starting)