Skip to content

A visual comparison of on-policy First-Visit Monte Carlo Control and off-policy Q-Learning in a stochastic 10×10 GridWorld environment with random walls.

Notifications You must be signed in to change notification settings

kuchris/RL_GridWorld

Repository files navigation

🧠 Reinforcement Learning: Monte Carlo vs Q-Learning in GridWorld

A visual comparison of on-policy First-Visit Monte Carlo Control and off-policy Q-Learning in a stochastic 10×10 GridWorld environment with random walls.

📝 Description

This project implements and compares on-policy First-Visit Monte Carlo Control and off-policy Q-Learning in a 10×10 GridWorld with random walls. Through reward curves, policy visualizations, and a side-by-side animation, we demonstrate how Q-Learning’s bootstrapping and off-policy nature enable faster convergence and greater robustness compared to Monte Carlo’s sample-inefficient, on-policy approach. The animation includes real-time metrics—episode count, epsilon decay, and cumulative success rates—to make reinforcement learning concepts intuitively clear.

📚 Why This Matters

This project demonstrates a fundamental RL principle:

"Off-policy methods like Q-Learning are more sample-efficient than on-policy Monte Carlo in deterministic environments with sparse rewards."

🎯 Key Insights

  • Q-Learning converges faster and achieves higher rewards due to bootstrapping and off-policy learning
  • Monte Carlo requires careful exploration tuning (end_e=0.02) to avoid local minima
  • Final policies evaluated with controlled exploration (eps=0.1) to test robustness
  • Success rates and epsilon decay visualized in real-time

🌐 Environment

  • 10×10 grid with 15% randomly placed walls
  • Start: (0, 0) (top-left)
  • Goal: (9, 9) (bottom-right)
  • Rewards: -1 per step, +10 for reaching goal
  • Agent stays in place when hitting walls/boundaries

⚙️ Algorithms

Method Type Episodes Epsilon Schedule Key Parameters
Monte Carlo On-policy 50,000 0.9 → 0.02 γ=0.8
Q-Learning Off-policy 50,000 0.9 → 0.02 γ=0.8, α=0.1

📊 Outputs

  • rewards_plot.png: Moving average reward curves
  • policies_comparison.png: Greedy policy arrows for both methods
  • mc_vs_ql_comparison.mp4: 10-second side-by-side animation showing:
    • Left: Monte Carlo policy
    • Right: Q-Learning policy
    • Real-time metrics: episode count, epsilon, success rates, hyperparameters

Comparison

Comparison

Comparison

Comparison

❓ Frequently Asked Questions

Why does Monte Carlo need a higher final epsilon (0.02) to converge?

Monte Carlo is on-policy and only updates after full episodes. If the policy becomes too greedy too early (e.g., ε=0.01), it gets stuck in suboptimal loops and never discovers better paths. A small but persistent exploration (ε=0.02) allows it to escape local minima.

Why not use SARSA instead of Monte Carlo?

SARSA (on-policy TD) would be a great middle ground! It’s more sample-efficient than MC but still on-policy. We focused on MC vs QL to contrast on-policy vs off-policy extremes. SARSA would likely sit between them in performance.

Does the wall configuration affect results?

Yes! Dense or maze-like walls make exploration harder, amplifying MC’s struggles. We used 15% random walls for a balanced challenge. Results are reproducible due to fixed random seeds.

Why simulate with ε=0.1 instead of ε=0.0?

ε=0.0 shows optimal paths but hides policy robustness. With ε=0.1, we test how well each policy handles real-world perturbations — QL typically maintains high success rates, while MC degrades more.

Can this scale to larger grids or continuous spaces?

Tabular methods (like ours) fail in large/continuous spaces. Next step: replace Q-tables with function approximation (e.g., neural networks → DQN) or linear tile coding.

🔮 Future Work

🧪 Algorithm Extensions

  • Add SARSA and Expected SARSA for on-policy TD comparison
  • Implement Double Q-Learning to reduce overestimation bias
  • Compare n-step TD methods (e.g., TD(λ))

📈 Deeper Analysis

  • Measure sample efficiency: Episodes needed to reach 90% success
  • Analyze path efficiency: Compare path length vs. optimal (Manhattan + wall detours)
  • Test robustness across wall densities (5% → 30%)

🎥 Enhanced Visualizations

  • Build an interactive Jupyter widget to adjust α, γ, ε in real-time
  • Create 3D Q-value surface plots over the grid
  • Highlight failure cases (e.g., MC stuck in loops)

🤖 Environment Scaling

  • Scale to larger grids (20×20, 50×50) with procedural walls
  • Add stochastic dynamics (e.g., 10% action slip probability)
  • Introduce partial observability (agent sees only nearby cells)

🧠 Function Approximation

  • Replace tabular Q with:
    • Linear approximation (tile coding)
    • Neural networks (DQN)
  • Benchmark sample efficiency in large grids

📊 Statistical Rigor

  • Run 10+ random seeds, plot mean ± std deviation
  • Perform significance testing (e.g., t-test) on final rewards
  • Conduct hyperparameter sweeps for α, γ, and ε schedules

🌐 Real-World Applications

  • Frame as robot navigation in obstacle-rich environments
  • Extend to multi-agent GridWorld (cooperative/competitive)
  • Explore transfer learning: train on one layout, test on another

About

A visual comparison of on-policy First-Visit Monte Carlo Control and off-policy Q-Learning in a stochastic 10×10 GridWorld environment with random walls.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published