A visual comparison of on-policy First-Visit Monte Carlo Control and off-policy Q-Learning in a stochastic 10×10 GridWorld environment with random walls.
This project implements and compares on-policy First-Visit Monte Carlo Control and off-policy Q-Learning in a 10×10 GridWorld with random walls. Through reward curves, policy visualizations, and a side-by-side animation, we demonstrate how Q-Learning’s bootstrapping and off-policy nature enable faster convergence and greater robustness compared to Monte Carlo’s sample-inefficient, on-policy approach. The animation includes real-time metrics—episode count, epsilon decay, and cumulative success rates—to make reinforcement learning concepts intuitively clear.
This project demonstrates a fundamental RL principle:
"Off-policy methods like Q-Learning are more sample-efficient than on-policy Monte Carlo in deterministic environments with sparse rewards."
- Q-Learning converges faster and achieves higher rewards due to bootstrapping and off-policy learning
- Monte Carlo requires careful exploration tuning (
end_e=0.02) to avoid local minima - Final policies evaluated with controlled exploration (
eps=0.1) to test robustness - Success rates and epsilon decay visualized in real-time
- 10×10 grid with 15% randomly placed walls
- Start:
(0, 0)(top-left) - Goal:
(9, 9)(bottom-right) - Rewards:
-1per step,+10for reaching goal - Agent stays in place when hitting walls/boundaries
| Method | Type | Episodes | Epsilon Schedule | Key Parameters |
|---|---|---|---|---|
| Monte Carlo | On-policy | 50,000 | 0.9 → 0.02 | γ=0.8 |
| Q-Learning | Off-policy | 50,000 | 0.9 → 0.02 | γ=0.8, α=0.1 |
rewards_plot.png: Moving average reward curvespolicies_comparison.png: Greedy policy arrows for both methodsmc_vs_ql_comparison.mp4: 10-second side-by-side animation showing:- Left: Monte Carlo policy
- Right: Q-Learning policy
- Real-time metrics: episode count, epsilon, success rates, hyperparameters
Monte Carlo is on-policy and only updates after full episodes. If the policy becomes too greedy too early (e.g., ε=0.01), it gets stuck in suboptimal loops and never discovers better paths. A small but persistent exploration (ε=0.02) allows it to escape local minima.
SARSA (on-policy TD) would be a great middle ground! It’s more sample-efficient than MC but still on-policy. We focused on MC vs QL to contrast on-policy vs off-policy extremes. SARSA would likely sit between them in performance.
Yes! Dense or maze-like walls make exploration harder, amplifying MC’s struggles. We used 15% random walls for a balanced challenge. Results are reproducible due to fixed random seeds.
ε=0.0 shows optimal paths but hides policy robustness. With ε=0.1, we test how well each policy handles real-world perturbations — QL typically maintains high success rates, while MC degrades more.
Tabular methods (like ours) fail in large/continuous spaces. Next step: replace Q-tables with function approximation (e.g., neural networks → DQN) or linear tile coding.
- Add SARSA and Expected SARSA for on-policy TD comparison
- Implement Double Q-Learning to reduce overestimation bias
- Compare n-step TD methods (e.g., TD(λ))
- Measure sample efficiency: Episodes needed to reach 90% success
- Analyze path efficiency: Compare path length vs. optimal (Manhattan + wall detours)
- Test robustness across wall densities (5% → 30%)
- Build an interactive Jupyter widget to adjust
α,γ,εin real-time - Create 3D Q-value surface plots over the grid
- Highlight failure cases (e.g., MC stuck in loops)
- Scale to larger grids (20×20, 50×50) with procedural walls
- Add stochastic dynamics (e.g., 10% action slip probability)
- Introduce partial observability (agent sees only nearby cells)
- Replace tabular Q with:
- Linear approximation (tile coding)
- Neural networks (DQN)
- Benchmark sample efficiency in large grids
- Run 10+ random seeds, plot mean ± std deviation
- Perform significance testing (e.g., t-test) on final rewards
- Conduct hyperparameter sweeps for
α,γ, andεschedules
- Frame as robot navigation in obstacle-rich environments
- Extend to multi-agent GridWorld (cooperative/competitive)
- Explore transfer learning: train on one layout, test on another



