🧠 Reinforcement Learning: Monte Carlo vs Q-Learning in GridWorld

A visual comparison of on-policy First-Visit Monte Carlo Control and off-policy Q-Learning in a stochastic 10×10 GridWorld environment with random walls.

📝 Description

This project implements and compares on-policy First-Visit Monte Carlo Control and off-policy Q-Learning in a 10×10 GridWorld with random walls. Through reward curves, policy visualizations, and a side-by-side animation, we demonstrate how Q-Learning’s bootstrapping and off-policy nature enable faster convergence and greater robustness compared to Monte Carlo’s sample-inefficient, on-policy approach. The animation includes real-time metrics—episode count, epsilon decay, and cumulative success rates—to make reinforcement learning concepts intuitively clear.

📚 Why This Matters

This project demonstrates a fundamental RL principle:

"Off-policy methods like Q-Learning are more sample-efficient than on-policy Monte Carlo in deterministic environments with sparse rewards."

🎯 Key Insights

Q-Learning converges faster and achieves higher rewards due to bootstrapping and off-policy learning
Monte Carlo requires careful exploration tuning (end_e=0.02) to avoid local minima
Final policies evaluated with controlled exploration (eps=0.1) to test robustness
Success rates and epsilon decay visualized in real-time

🌐 Environment

10×10 grid with 15% randomly placed walls
Start: (0, 0) (top-left)
Goal: (9, 9) (bottom-right)
Rewards: -1 per step, +10 for reaching goal
Agent stays in place when hitting walls/boundaries

⚙️ Algorithms

Method	Type	Episodes	Epsilon Schedule	Key Parameters
Monte Carlo	On-policy	50,000	0.9 → 0.02	γ=0.8
Q-Learning	Off-policy	50,000	0.9 → 0.02	γ=0.8, α=0.1

📊 Outputs

rewards_plot.png: Moving average reward curves
policies_comparison.png: Greedy policy arrows for both methods
mc_vs_ql_comparison.mp4: 10-second side-by-side animation showing:
- Left: Monte Carlo policy
- Right: Q-Learning policy
- Real-time metrics: episode count, epsilon, success rates, hyperparameters

❓ Frequently Asked Questions

Why does Monte Carlo need a higher final epsilon (0.02) to converge?

Monte Carlo is on-policy and only updates after full episodes. If the policy becomes too greedy too early (e.g., ε=0.01), it gets stuck in suboptimal loops and never discovers better paths. A small but persistent exploration (ε=0.02) allows it to escape local minima.

Why not use SARSA instead of Monte Carlo?

SARSA (on-policy TD) would be a great middle ground! It’s more sample-efficient than MC but still on-policy. We focused on MC vs QL to contrast on-policy vs off-policy extremes. SARSA would likely sit between them in performance.

Does the wall configuration affect results?

Yes! Dense or maze-like walls make exploration harder, amplifying MC’s struggles. We used 15% random walls for a balanced challenge. Results are reproducible due to fixed random seeds.

Why simulate with `ε=0.1` instead of `ε=0.0`?

ε=0.0 shows optimal paths but hides policy robustness. With ε=0.1, we test how well each policy handles real-world perturbations — QL typically maintains high success rates, while MC degrades more.

Can this scale to larger grids or continuous spaces?

Tabular methods (like ours) fail in large/continuous spaces. Next step: replace Q-tables with function approximation (e.g., neural networks → DQN) or linear tile coding.

🔮 Future Work

🧪 Algorithm Extensions

Add SARSA and Expected SARSA for on-policy TD comparison
Implement Double Q-Learning to reduce overestimation bias
Compare n-step TD methods (e.g., TD(λ))

📈 Deeper Analysis

Measure sample efficiency: Episodes needed to reach 90% success
Analyze path efficiency: Compare path length vs. optimal (Manhattan + wall detours)
Test robustness across wall densities (5% → 30%)

🎥 Enhanced Visualizations

Build an interactive Jupyter widget to adjust α, γ, ε in real-time
Create 3D Q-value surface plots over the grid
Highlight failure cases (e.g., MC stuck in loops)

🤖 Environment Scaling

Scale to larger grids (20×20, 50×50) with procedural walls
Add stochastic dynamics (e.g., 10% action slip probability)
Introduce partial observability (agent sees only nearby cells)

🧠 Function Approximation

Replace tabular Q with:
- Linear approximation (tile coding)
- Neural networks (DQN)
Benchmark sample efficiency in large grids

📊 Statistical Rigor

Run 10+ random seeds, plot mean ± std deviation
Perform significance testing (e.g., t-test) on final rewards
Conduct hyperparameter sweeps for α, γ, and ε schedules

🌐 Real-World Applications

Frame as robot navigation in obstacle-rich environments
Extend to multi-agent GridWorld (cooperative/competitive)
Explore transfer learning: train on one layout, test on another

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
1_10x10gridworld.png		1_10x10gridworld.png
2_Greedy_Path.png		2_Greedy_Path.png
3_moving_avg.png		3_moving_avg.png
4_mc_vs_ql_comparison.gif		4_mc_vs_ql_comparison.gif
FrozenLake.ipynb		FrozenLake.ipynb
Readme.md		Readme.md
gridworld.ipynb		gridworld.ipynb
mc_vs_ql_comparison.mp4		mc_vs_ql_comparison.mp4
rein.png		rein.png
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Reinforcement Learning: Monte Carlo vs Q-Learning in GridWorld

📝 Description

📚 Why This Matters

🎯 Key Insights

🌐 Environment

⚙️ Algorithms

📊 Outputs

❓ Frequently Asked Questions

Why does Monte Carlo need a higher final epsilon (0.02) to converge?

Why not use SARSA instead of Monte Carlo?

Does the wall configuration affect results?

Why simulate with `ε=0.1` instead of `ε=0.0`?

Can this scale to larger grids or continuous spaces?

🔮 Future Work

🧪 Algorithm Extensions

📈 Deeper Analysis

🎥 Enhanced Visualizations

🤖 Environment Scaling

🧠 Function Approximation

📊 Statistical Rigor

🌐 Real-World Applications

About

Uh oh!

Releases

Packages

Languages

kuchris/RL_GridWorld

Folders and files

Latest commit

History

Repository files navigation

🧠 Reinforcement Learning: Monte Carlo vs Q-Learning in GridWorld

📝 Description

📚 Why This Matters

🎯 Key Insights

🌐 Environment

⚙️ Algorithms

📊 Outputs

❓ Frequently Asked Questions

Why does Monte Carlo need a higher final epsilon (0.02) to converge?

Why not use SARSA instead of Monte Carlo?

Does the wall configuration affect results?

Why simulate with ε=0.1 instead of ε=0.0?

Can this scale to larger grids or continuous spaces?

🔮 Future Work

🧪 Algorithm Extensions

📈 Deeper Analysis

🎥 Enhanced Visualizations

🤖 Environment Scaling

🧠 Function Approximation

📊 Statistical Rigor

🌐 Real-World Applications

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Why simulate with `ε=0.1` instead of `ε=0.0`?

Packages