sutyum · sutyum · Nov 23, 2025 · Nov 23, 2025 · Nov 23, 2025 · Nov 23, 2025
diff --git a/README.md b/README.md
@@ -6,9 +6,11 @@ Learning by implementing papers
 - [GPT2, 2019](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
 - [ColBERT, 2020](https://arxiv.org/abs/2004.12832)
 - [Qwen3, 2025](https://huggingface.co/Qwen/Qwen3-0.6B)
+- [Flow Matching, 2023](https://arxiv.org/abs/2210.02747): For RL control tasks ([Tutorial](flow_matching_pendulum/TUTORIAL.md))
 
 ## Reinforcement Learning
 - Reinforce
+- Flow Matching for Control: Inverted Pendulum with behavior cloning ([flow_matching_pendulum/](flow_matching_pendulum/))
 
 ## Training
 ### Dataset Generation

diff --git a/flow_matching_pendulum/QUICKSTART.md b/flow_matching_pendulum/QUICKSTART.md
@@ -0,0 +1,151 @@
+# Flow Matching Quick Start Guide
+
+## 🎯 What You Have
+
+A complete, educational implementation of **Flow Matching** for learning control policies! This teaches you:
+- How to use generative modeling for RL
+- Flow matching theory and practice
+- Foundation for humanoid locomotion control
+
+## 🚀 Get Started in 5 Minutes
+
+### 1. See the Expert in Action
+```bash
+cd flow_matching_pendulum
+uv run python main.py demo
+```
+This creates `expert_trajectory.png` showing the energy-based swing-up controller.
+
+### 2. Run a Quick Test
+```bash
+uv run python main.py test
+```
+Trains a tiny model to verify everything works (~2 minutes).
+
+### 3. Train a Real Policy
+```bash
+uv run python main.py train --episodes 100 --epochs 50
+```
+Full training run (~10-15 minutes). Creates checkpoints in `checkpoints/`.
+
+### 4. Evaluate and Visualize
+```bash
+uv run python eval.py --checkpoint checkpoints/best_model_*.eqx --visualize
+```
+Creates beautiful visualizations comparing your policy to the expert!
+
+## 📊 What the Visualizations Show
+
+- **expert_trajectory.png**: Expert controller behavior (swing-up + balance)
+- **rollout_comparison.png**: Your policy vs expert side-by-side
+- **flow_field.png**: The learned velocity field v_θ(z, t, state)
+- **denoising_process.png**: How noise → action transformation works
+
+## 🎓 Learning Path
+
+1. **Read the Theory** (`README.md`): Understand flow matching basics
+2. **Study the Code** (`flow_matching.py`): See implementation with detailed comments
+3. **Follow the Tutorial** (`TUTORIAL.md`): Deep dive into theory and practice
+4. **Experiment**: Try different hyperparameters, visualizations, etc.
+5. **Scale Up**: Apply to humanoid locomotion!
+
+## 🔧 Key Files
+
+```
+flow_matching_pendulum/
+├── flow_matching.py      # Core: VelocityNet, loss, sampling
+├── pendulum_env.py       # Environment + expert controller
+├── train.py              # Training loop
+├── eval.py               # Evaluation + visualizations
+├── main.py               # CLI interface
+├── README.md             # Project overview
+├── TUTORIAL.md           # Comprehensive tutorial
+└── QUICKSTART.md         # This file!
+```
+
+## 🎮 Common Commands
+
+```bash
+# Demo
+uv run python main.py demo
+
+# Quick test
+uv run python main.py test
+
+# Full training with custom params
+uv run python main.py train \
+    --episodes 200 \
+    --epochs 100 \
+    --batch-size 256 \
+    --lr 0.0003 \
+    --hidden-dim 256
+
+# Evaluate
+uv run python eval.py \
+    --checkpoint checkpoints/best_model_*.eqx \
+    --num-episodes 20 \
+    --num-steps 20 \
+    --visualize
+
+# Test individual modules
+uv run python flow_matching.py    # Test core implementation
+uv run python pendulum_env.py     # Test environment + expert
+```
+
+## 💡 Pro Tips
+
+1. **Start small**: Use `main.py test` first to verify everything works
+2. **More data helps**: Try 100-200 expert episodes for best results
+3. **Sampling steps trade-off**: 10 steps is fast, 20 is better quality
+4. **Watch the loss**: Should decrease from ~3.0 to ~1.0 or lower
+5. **Compare with expert**: Your policy should get within 80-90% of expert performance
+
+## 🐛 Troubleshooting
+
+**Model not learning?**
+- Increase epochs (try 100)
+- Collect more expert data (try 200 episodes)
+- Increase model size (hidden_dim=512, num_layers=4)
+
+**Actions too noisy?**
+- Use more sampling steps (20-50)
+- Use Heun integration instead of Euler (better but slower)
+
+**Want better performance?**
+- The expert is energy-based + LQR, achieving ~-200 to -500 return
+- Flow matching should get close with enough data and training
+- Try combining with RL fine-tuning for best results
+
+## 🚀 Next: Humanoids!
+
+Once you're comfortable with the pendulum:
+
+1. **Read TUTORIAL.md Part 4**: Scaling to humanoids
+2. **Choose environment**: dm_control humanoid, Isaac Gym, or MuJoCo
+3. **Adapt the code**:
+   - Change state_dim (3 → 50-200)
+   - Change action_dim (1 → 20-30)
+   - Add velocity conditioning
+4. **Collect expert data**: RL policy or motion capture
+5. **Train**: Same algorithm, just bigger!
+
+## 📚 Learning Resources
+
+- **Flow Matching Paper**: https://arxiv.org/abs/2210.02747
+- **Diffusion Policy**: Similar idea for robotics
+- **Score Matching**: Related generative approach
+
+## ✅ Success Criteria
+
+You're ready for humanoids when you can:
+- [ ] Explain flow matching in your own words
+- [ ] Understand the VelocityNet architecture
+- [ ] Train a policy that gets >80% of expert performance
+- [ ] Create and interpret visualizations
+- [ ] Modify hyperparameters and see effects
+
+## 🎉 Have Fun!
+
+Flow matching is powerful and elegant. Enjoy learning, experimenting, and scaling to complex robots!
+
+Questions? Check TUTORIAL.md or open an issue on GitHub.
diff --git a/flow_matching_pendulum/README.md b/flow_matching_pendulum/README.md
@@ -0,0 +1,139 @@
+# Flow Matching for Inverted Pendulum Control
+
+An educational implementation of **Flow Matching** applied to reinforcement learning control problems, starting with the classic inverted pendulum and designed to scale to humanoid locomotion.
+
+## 🎯 What is Flow Matching?
+
+Flow Matching is a generative modeling technique that learns to transform a simple source distribution (e.g., Gaussian noise) into a complex target distribution by learning a continuous-time flow.
+
+### Key Concepts
+
+**1. Continuous Normalizing Flows (CNFs)**
+- Think of it as learning a vector field that "pushes" samples from noise to data
+- At each point in space-time, the model predicts which direction to move
+- Starting from random noise at t=0, we follow the flow to get a sample at t=1
+
+**2. Flow Matching Objective**
+Unlike diffusion models that add/remove noise, flow matching directly learns the velocity field:
+
+```
+Flow ODE: dx/dt = v_θ(x, t)
+```
+
+Where:
+- `x` is the state (position in data space)
+- `t` is time ∈ [0, 1]
+- `v_θ` is our learned velocity field (neural network)
+
+**3. Training with Optimal Transport**
+We use the *conditional flow matching* objective:
+```
+L(θ) = E[||v_θ(x_t, t) - (x_1 - x_0)||²]
+```
+
+Where:
+- `x_0 ~ N(0, I)` (source: Gaussian noise)
+- `x_1 ~ p_data` (target: expert demonstrations)
+- `x_t = (1-t)x_0 + t·x_1` (linear interpolation)
+- The optimal velocity is simply `x_1 - x_0`!
+
+This is beautifully simple: the model learns to predict the direction from noise to data.
+
+## 🤸 Application to RL: Why Flow Matching for Control?
+
+Traditional RL methods (PPO, SAC, etc.) optimize for cumulative reward. Flow matching offers an alternative:
+
+**Behavior Cloning as Generative Modeling**
+- Expert demonstrations define a distribution over state-action pairs
+- Flow matching learns to generate actions conditioned on states
+- Benefits:
+  - Can model multimodal action distributions
+  - Smooth interpolation in action space
+  - Naturally handles continuous control
+  - Can be combined with diffusion-style iterative refinement
+
+**For the Inverted Pendulum:**
+- State: `[cos(θ), sin(θ), θ_dot]` (angle and angular velocity)
+- Action: `[torque]` (continuous control)
+- Goal: Learn to map states → actions from expert trajectories
+
+## 🏗️ Architecture Overview
+
+```
+State → [Encoder] → Embedding
+                        ↓
+Noise z_0 ────────────→ [Flow Network] → Refined Action
+Time t ────────────────→      ↑
+                              │
+                    Predicts velocity v_θ(z_t, t, state)
+```
+
+### Flow Network Design
+- **Input**: Current noisy action `z_t`, time `t`, state embedding
+- **Output**: Velocity vector (direction to move in action space)
+- **Architecture**: MLP with time and state conditioning
+
+## 📚 Scaling to Humanoids
+
+This implementation is designed with humanoid locomotion in mind:
+
+1. **State Representation**: Easy to extend from 3D pendulum state to high-dimensional humanoid state (joint angles, velocities, orientation, etc.)
+
+2. **Action Dimension**: Pendulum has 1 action (torque), humanoids have ~20+ (joint torques). Flow matching scales naturally to high dimensions.
+
+3. **Multimodal Behaviors**: Humanoids may have multiple valid gaits (walking, running, jumping). Flow matching can capture these modes.
+
+4. **Velocity Tracking**: The pendulum teaches you to match reference velocities. For humanoids, you'll track desired walking velocities - same concept, higher dimensions!
+
+## 🚀 Quick Start
+
+```bash
+# Install dependencies
+uv sync
+
+# Train a flow matching policy from expert data
+python train.py --episodes 1000 --flow-steps 10
+
+# Evaluate the learned policy
+python eval.py --checkpoint checkpoints/best.eqx
+
+# Visualize the flow field
+python visualize.py
+```
+
+## 📖 Code Structure
+
+- `flow_matching.py`: Core flow matching implementation
+- `policy.py`: Flow-based policy network
+- `pendulum_env.py`: Environment wrapper and expert controller
+- `train.py`: Training loop with expert data collection
+- `eval.py`: Evaluation and visualization
+- `utils.py`: Helper functions
+
+## 🧠 Key Insights for Learning
+
+1. **Flow Matching vs Diffusion**: Flow matching uses straight paths (OT), diffusion uses curved paths. Flow matching is often simpler to train and faster to sample.
+
+2. **Conditional Generation**: We condition the flow on the state, making this a *conditional flow matching* problem.
+
+3. **Trade-off**: More flow steps → better quality but slower. Start with 10-20 steps.
+
+4. **Data Efficiency**: Flow matching can be data-efficient since it directly learns from demonstrations without rewards.
+
+## 📊 What You'll Learn
+
+- [x] Flow matching theory and implementation
+- [x] Conditional generation for RL
+- [x] Neural ODEs and continuous-time modeling
+- [x] Behavior cloning with generative models
+- [x] Foundation for scaling to complex robots
+
+## 🔗 References
+
+- [Flow Matching for Generative Modeling](https://arxiv.org/abs/2210.02747) (Lipman et al., 2023)
+- [Diffusion Models for Reinforcement Learning](https://arxiv.org/abs/2205.09991) (Diffuser)
+- [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) (DDPM)
+
+---
+
+**Next Steps**: Once you master this, you can apply the same techniques to humanoid locomotion using environments like `dm_control humanoid` or `Isaac Gym`!