Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,11 @@ Learning by implementing papers
- [GPT2, 2019](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
- [ColBERT, 2020](https://arxiv.org/abs/2004.12832)
- [Qwen3, 2025](https://huggingface.co/Qwen/Qwen3-0.6B)
- [Flow Matching, 2023](https://arxiv.org/abs/2210.02747): For RL control tasks ([Tutorial](flow_matching_pendulum/TUTORIAL.md))

## Reinforcement Learning
- Reinforce
- Flow Matching for Control: Inverted Pendulum with behavior cloning ([flow_matching_pendulum/](flow_matching_pendulum/))

## Training
### Dataset Generation
Expand Down
151 changes: 151 additions & 0 deletions flow_matching_pendulum/QUICKSTART.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# Flow Matching Quick Start Guide

## 🎯 What You Have

A complete, educational implementation of **Flow Matching** for learning control policies! This teaches you:
- How to use generative modeling for RL
- Flow matching theory and practice
- Foundation for humanoid locomotion control

## 🚀 Get Started in 5 Minutes

### 1. See the Expert in Action
```bash
cd flow_matching_pendulum
uv run python main.py demo
```
This creates `expert_trajectory.png` showing the energy-based swing-up controller.

### 2. Run a Quick Test
```bash
uv run python main.py test
```
Trains a tiny model to verify everything works (~2 minutes).

### 3. Train a Real Policy
```bash
uv run python main.py train --episodes 100 --epochs 50
```
Full training run (~10-15 minutes). Creates checkpoints in `checkpoints/`.

### 4. Evaluate and Visualize
```bash
uv run python eval.py --checkpoint checkpoints/best_model_*.eqx --visualize
```
Creates beautiful visualizations comparing your policy to the expert!

## 📊 What the Visualizations Show

- **expert_trajectory.png**: Expert controller behavior (swing-up + balance)
- **rollout_comparison.png**: Your policy vs expert side-by-side
- **flow_field.png**: The learned velocity field v_θ(z, t, state)
- **denoising_process.png**: How noise → action transformation works

## 🎓 Learning Path

1. **Read the Theory** (`README.md`): Understand flow matching basics
2. **Study the Code** (`flow_matching.py`): See implementation with detailed comments
3. **Follow the Tutorial** (`TUTORIAL.md`): Deep dive into theory and practice
4. **Experiment**: Try different hyperparameters, visualizations, etc.
5. **Scale Up**: Apply to humanoid locomotion!

## 🔧 Key Files

```
flow_matching_pendulum/
├── flow_matching.py # Core: VelocityNet, loss, sampling
├── pendulum_env.py # Environment + expert controller
├── train.py # Training loop
├── eval.py # Evaluation + visualizations
├── main.py # CLI interface
├── README.md # Project overview
├── TUTORIAL.md # Comprehensive tutorial
└── QUICKSTART.md # This file!
```

## 🎮 Common Commands

```bash
# Demo
uv run python main.py demo

# Quick test
uv run python main.py test

# Full training with custom params
uv run python main.py train \
--episodes 200 \
--epochs 100 \
--batch-size 256 \
--lr 0.0003 \
--hidden-dim 256

# Evaluate
uv run python eval.py \
--checkpoint checkpoints/best_model_*.eqx \
--num-episodes 20 \
--num-steps 20 \
--visualize

# Test individual modules
uv run python flow_matching.py # Test core implementation
uv run python pendulum_env.py # Test environment + expert
```

## 💡 Pro Tips

1. **Start small**: Use `main.py test` first to verify everything works
2. **More data helps**: Try 100-200 expert episodes for best results
3. **Sampling steps trade-off**: 10 steps is fast, 20 is better quality
4. **Watch the loss**: Should decrease from ~3.0 to ~1.0 or lower
5. **Compare with expert**: Your policy should get within 80-90% of expert performance

## 🐛 Troubleshooting

**Model not learning?**
- Increase epochs (try 100)
- Collect more expert data (try 200 episodes)
- Increase model size (hidden_dim=512, num_layers=4)

**Actions too noisy?**
- Use more sampling steps (20-50)
- Use Heun integration instead of Euler (better but slower)

**Want better performance?**
- The expert is energy-based + LQR, achieving ~-200 to -500 return
- Flow matching should get close with enough data and training
- Try combining with RL fine-tuning for best results

## 🚀 Next: Humanoids!

Once you're comfortable with the pendulum:

1. **Read TUTORIAL.md Part 4**: Scaling to humanoids
2. **Choose environment**: dm_control humanoid, Isaac Gym, or MuJoCo
3. **Adapt the code**:
- Change state_dim (3 → 50-200)
- Change action_dim (1 → 20-30)
- Add velocity conditioning
4. **Collect expert data**: RL policy or motion capture
5. **Train**: Same algorithm, just bigger!

## 📚 Learning Resources

- **Flow Matching Paper**: https://arxiv.org/abs/2210.02747
- **Diffusion Policy**: Similar idea for robotics
- **Score Matching**: Related generative approach

## ✅ Success Criteria

You're ready for humanoids when you can:
- [ ] Explain flow matching in your own words
- [ ] Understand the VelocityNet architecture
- [ ] Train a policy that gets >80% of expert performance
- [ ] Create and interpret visualizations
- [ ] Modify hyperparameters and see effects

## 🎉 Have Fun!

Flow matching is powerful and elegant. Enjoy learning, experimenting, and scaling to complex robots!

Questions? Check TUTORIAL.md or open an issue on GitHub.
139 changes: 139 additions & 0 deletions flow_matching_pendulum/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# Flow Matching for Inverted Pendulum Control

An educational implementation of **Flow Matching** applied to reinforcement learning control problems, starting with the classic inverted pendulum and designed to scale to humanoid locomotion.

## 🎯 What is Flow Matching?

Flow Matching is a generative modeling technique that learns to transform a simple source distribution (e.g., Gaussian noise) into a complex target distribution by learning a continuous-time flow.

### Key Concepts

**1. Continuous Normalizing Flows (CNFs)**
- Think of it as learning a vector field that "pushes" samples from noise to data
- At each point in space-time, the model predicts which direction to move
- Starting from random noise at t=0, we follow the flow to get a sample at t=1

**2. Flow Matching Objective**
Unlike diffusion models that add/remove noise, flow matching directly learns the velocity field:

```
Flow ODE: dx/dt = v_θ(x, t)
```

Where:
- `x` is the state (position in data space)
- `t` is time ∈ [0, 1]
- `v_θ` is our learned velocity field (neural network)

**3. Training with Optimal Transport**
We use the *conditional flow matching* objective:
```
L(θ) = E[||v_θ(x_t, t) - (x_1 - x_0)||²]
```

Where:
- `x_0 ~ N(0, I)` (source: Gaussian noise)
- `x_1 ~ p_data` (target: expert demonstrations)
- `x_t = (1-t)x_0 + t·x_1` (linear interpolation)
- The optimal velocity is simply `x_1 - x_0`!

This is beautifully simple: the model learns to predict the direction from noise to data.

## 🤸 Application to RL: Why Flow Matching for Control?

Traditional RL methods (PPO, SAC, etc.) optimize for cumulative reward. Flow matching offers an alternative:

**Behavior Cloning as Generative Modeling**
- Expert demonstrations define a distribution over state-action pairs
- Flow matching learns to generate actions conditioned on states
- Benefits:
- Can model multimodal action distributions
- Smooth interpolation in action space
- Naturally handles continuous control
- Can be combined with diffusion-style iterative refinement

**For the Inverted Pendulum:**
- State: `[cos(θ), sin(θ), θ_dot]` (angle and angular velocity)
- Action: `[torque]` (continuous control)
- Goal: Learn to map states → actions from expert trajectories

## 🏗️ Architecture Overview

```
State → [Encoder] → Embedding
Noise z_0 ────────────→ [Flow Network] → Refined Action
Time t ────────────────→ ↑
Predicts velocity v_θ(z_t, t, state)
```

### Flow Network Design
- **Input**: Current noisy action `z_t`, time `t`, state embedding
- **Output**: Velocity vector (direction to move in action space)
- **Architecture**: MLP with time and state conditioning

## 📚 Scaling to Humanoids

This implementation is designed with humanoid locomotion in mind:

1. **State Representation**: Easy to extend from 3D pendulum state to high-dimensional humanoid state (joint angles, velocities, orientation, etc.)

2. **Action Dimension**: Pendulum has 1 action (torque), humanoids have ~20+ (joint torques). Flow matching scales naturally to high dimensions.

3. **Multimodal Behaviors**: Humanoids may have multiple valid gaits (walking, running, jumping). Flow matching can capture these modes.

4. **Velocity Tracking**: The pendulum teaches you to match reference velocities. For humanoids, you'll track desired walking velocities - same concept, higher dimensions!

## 🚀 Quick Start

```bash
# Install dependencies
uv sync

# Train a flow matching policy from expert data
python train.py --episodes 1000 --flow-steps 10

# Evaluate the learned policy
python eval.py --checkpoint checkpoints/best.eqx

# Visualize the flow field
python visualize.py
```

## 📖 Code Structure

- `flow_matching.py`: Core flow matching implementation
- `policy.py`: Flow-based policy network
- `pendulum_env.py`: Environment wrapper and expert controller
- `train.py`: Training loop with expert data collection
- `eval.py`: Evaluation and visualization
- `utils.py`: Helper functions

## 🧠 Key Insights for Learning

1. **Flow Matching vs Diffusion**: Flow matching uses straight paths (OT), diffusion uses curved paths. Flow matching is often simpler to train and faster to sample.

2. **Conditional Generation**: We condition the flow on the state, making this a *conditional flow matching* problem.

3. **Trade-off**: More flow steps → better quality but slower. Start with 10-20 steps.

4. **Data Efficiency**: Flow matching can be data-efficient since it directly learns from demonstrations without rewards.

## 📊 What You'll Learn

- [x] Flow matching theory and implementation
- [x] Conditional generation for RL
- [x] Neural ODEs and continuous-time modeling
- [x] Behavior cloning with generative models
- [x] Foundation for scaling to complex robots

## 🔗 References

- [Flow Matching for Generative Modeling](https://arxiv.org/abs/2210.02747) (Lipman et al., 2023)
- [Diffusion Models for Reinforcement Learning](https://arxiv.org/abs/2205.09991) (Diffuser)
- [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) (DDPM)

---

**Next Steps**: Once you master this, you can apply the same techniques to humanoid locomotion using environments like `dm_control humanoid` or `Isaac Gym`!
Loading