Monte Carlo Control Demo

In dynamic programming (Lecture 06), we solved MDPs using policy iteration and value iteration — both of which require full knowledge of the transition model \(P(s' \mid s, a)\). In practice, we rarely have this model. Monte Carlo (MC) Control learns a policy purely from sampled episodes, with no model required.

From DP to Monte Carlo: What Changes?

  Dynamic Programming Monte Carlo Control
Model access Requires \(P(s' \mid s, a)\) Model-free — learns from experience
Update target Expected value over all transitions Sample return from a single episode
When updates happen Sweep over all states (no interaction needed) After completing an episode
What we learn V(s) or Q(s,a) via Bellman equations Q(s,a) via averaging observed returns

Why Q Instead of V?

In DP, we could extract a policy from \(V(s)\) because we had access to \(P(s' \mid s, a)\) — we could compute which action leads to higher-value next states. Without a model, we cannot do this. So MC Control learns \(Q(s, a)\) directly, and the policy simply picks \(\arg\max_a Q(s, a)\).

Exploration: Epsilon-Greedy

A greedy policy that always picks \(\arg\max_a Q(s,a)\) may never try suboptimal-looking actions and miss better strategies. Epsilon-greedy fixes this: with probability \(\epsilon\), pick a random action (explore); otherwise, pick the greedy action (exploit). We decay \(\epsilon\) over time so the agent explores broadly early on and gradually commits to its learned policy.

Key Concepts in the Code

  • Return computation: After an episode ends, we compute the discounted return \(G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots\) for each time step by working backwards through the rewards.
  • Incremental mean update: Instead of storing all returns and averaging, we use the incremental formula: \(Q(s,a) \leftarrow Q(s,a) + \frac{1}{N(s,a)}(G - Q(s,a))\), where \(N(s,a)\) counts how many times we’ve visited \((s, a)\).
  • Epsilon decay: Epsilon decreases linearly from 1.0 (fully random) to 0.01 over training, balancing exploration and exploitation.
MC Control Demo
Download
# %% [markdown]
# # Monte Carlo Control on FrozenLake
#
# In dynamic programming (policy/value iteration), we had access to the full
# transition model `env.unwrapped.P`. In practice, we rarely know the environment
# dynamics. Monte Carlo Control learns a policy by **sampling episodes** and
# using the observed returns to estimate Q-values — no model needed.
#
# The key idea: run episodes with the current policy, compute returns from the
# collected rewards, and update Q(s,a) toward the observed returns.

# %%
import numpy as np
from tqdm import tqdm
import gymnasium as gym

env = gym.make('FrozenLake-v1', map_name="4x4", is_slippery=True)
n_state = env.unwrapped.observation_space.n
n_action = env.unwrapped.action_space.n
print("# of actions:", n_action)
print("# of states:", n_state)

# %% [markdown]
# ## Step 1: Collect an Episode
#
# Unlike DP methods that sweep over all states using the model, MC methods
# must **run a full episode** to collect (state, action, reward) trajectories.
# We then compute returns from these trajectories.

# %%


def run_episode(env, policy):
    """Run one episode and return the trajectory (states, actions, rewards)."""
    obs = env.reset()[0]
    states, actions, rewards = [], [], []
    while True:
        states.append(obs)
        action = int(policy(obs))
        actions.append(action)
        obs, reward, done, truncated, _ = env.step(action)
        rewards.append(reward)
        if done or truncated:
            break
    return states, actions, rewards


# %% [markdown]
# ## Step 2: Policy with Epsilon-Greedy Exploration
#
# MC Control uses **epsilon-greedy** exploration: with probability epsilon, pick
# a random action; otherwise, pick the greedy action from Q.
#
# Why do we need Q instead of V here? In DP, we could extract a policy from V
# because we had access to the transition model P(s'|s,a). Without a model,
# we can't compute which action leads to higher-value states — so we must
# learn Q(s,a) directly and pick the action with the highest Q.

# %%


class EpsilonGreedyPolicy:
    def __init__(self, Q, N, eps, gamma=0.98):
        self.Q = Q          # Q-table: Q[state, action]
        self.N = N          # Visit count: N[state, action]
        self.eps = eps       # Exploration rate
        self.gamma = gamma   # Discount factor

    def update(self, states, actions, rewards):
        """Update Q-values using returns computed from the episode."""
        # Compute returns G_t = r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ...
        # We compute this backwards for efficiency
        returns = np.zeros(len(rewards))
        G = 0
        for i in reversed(range(len(rewards))):
            G = rewards[i] + self.gamma * G
            returns[i] = G

        # Update Q using incremental mean: Q = Q + (G - Q) / N
        for state, action, G in zip(states, actions, returns):
            self.N[state, action] += 1
            self.Q[state, action] += (G - self.Q[state, action]) / self.N[state, action]

    def __call__(self, state):
        """Select action using epsilon-greedy."""
        if np.random.rand() < self.eps:
            return np.random.choice(n_action)
        return np.argmax(self.Q[state, :])


# %% [markdown]
# ## Step 3: Training Loop
#
# We repeatedly: (1) collect an episode, (2) update Q from returns, and
# (3) decay epsilon so the policy gradually shifts from exploration to exploitation.

# %%
n_episodes = 20000
Q = np.zeros((n_state, n_action))
N = np.zeros_like(Q)
policy = EpsilonGreedyPolicy(Q, N, eps=1.0)

for i in tqdm(range(n_episodes)):
    states, actions, rewards = run_episode(env, policy)
    policy.update(states, actions, rewards)
    # Linear decay: epsilon goes from 1.0 to 0.01 over training
    policy.eps = max(0.01, policy.eps - 1.0 / n_episodes)

# %% [markdown]
# ## Evaluate the Learned Policy
#
# Set epsilon to 0 (fully greedy) and run 100 episodes to measure performance.

# %%
policy.eps = 0.0
scores = [sum(run_episode(env, policy)[2]) for _ in range(100)]
print("Final score: {:.2f}".format(np.mean(scores)))