Part 1: Understand Policy Iteration

Read through the demo code below to understand how policy iteration works.

Policy Iteration Demo
Download
# To add a new cell, type '# %%'
# To add a new markdown cell, type '# %% [markdown]'
# %% [markdown]
#
# %% [markdown]
# In this lab, we will implement policy iteration and value iteration to solve the Frozen lake problem.
#
# Let's first get ourselves familarized with the environment.

# %%
import numpy as np
import gymnasium as gym

env = gym.make('FrozenLake-v1', render_mode="human", map_name="4x4", is_slippery=True)  # or you can try '8x8'
env.reset()
n_state = env.unwrapped.observation_space.n
n_action = env.unwrapped.action_space.n
print("# of actions", n_action)
print("# of states", n_state)
P = env.unwrapped.P
# At state 14 apply action 2
for prob, next_state, reward, done in P[14][2]:
    print("If apply action 2 under state 14, there is %3.2g probability it will transition to state %d and yield reward as %i" % (
        prob, next_state, reward))

# %% [markdown]
# First, let's write utility functions we will use to run experiments.
# %% [markdown]
# Given certain policy, how can we compute the value function for each state.

# %%


def compute_policy_v(env, policy, gamma=1.0):
    # The goal of this function is to calculate the Q(s,a) for each action under each state
    eps = 1e-10
    V = np.zeros(n_state)
    while True:
        prev_V = np.copy(V)
        for state in range(n_state):
            action = policy[state]
            V[state] = 0
            for prob, next_state, reward, done in P[state][action]:
                V[state] += prob * (reward + gamma * prev_V[next_state])
        if (np.sum((np.fabs(prev_V - V))) <= eps):
            break
    return V


# %% [markdown]
# Let's test a random policy

# %%
random_policy = [np.random.choice(n_action) for _ in range(n_state)]
print(random_policy)
rand_v = compute_policy_v(env, random_policy)
print(np.round(rand_v, 5))

# %% [markdown]
# Given value function, we need to extract the best policy from it.

# %%


def extract_policy(V, gamma=1.0):
    policy = np.zeros(n_state)
    Q = np.zeros((n_state, n_action))
    for state in range(n_state):
        for action in range(n_action):
            for prob, next_state, reward, done in P[state][action]:
                Q[state, action] += prob * (reward + gamma*V[next_state])
        policy[state] = np.argmax(Q[state])
    return policy


# %%
print(extract_policy(rand_v))

# %% [markdown]
# Now let's start with a random policy and compute the value then extract new policy. Do this recursively will improve the policy

# %%


def run_episode(env, policy, gamma=1.0, render=False):
    """ Runs an episode and return the total reward """
    obs, _ = env.reset()
    total_reward = 0
    step_idx = 0
    while True:
        if render:
            env.render()
        obs, reward, truncated, done, _ = env.step(int(policy[obs]))
        # this will calculate the return for the first step
        total_reward += (gamma ** step_idx * reward)
        step_idx += 1
        if done or truncated:
            break
    return total_reward


# %%
policy = random_policy
max_iterations = 200000
gamma = 1.0
for i in range(max_iterations):
    policy_v = compute_policy_v(env, policy)
    new_policy = extract_policy(policy_v)
    score = run_episode(env, new_policy, gamma, False)
    print("iteration %i: score %f" % (i, score))
    if (np.all(policy == new_policy)):
        print('Policy-Iteration converged at step %d.' % (i+1))
        break
    policy = new_policy

scores = [run_episode(env, policy, gamma, False) for _ in range(100)]
print("Final score:", np.mean(scores))

Policy iteration has two distinct phases that alternate until convergence:

  1. Policy Evaluation (compute_policy_v): Given a fixed policy, compute the value function \(V^\pi(s)\) for every state. This uses the Bellman expectation equation:
\[V^\pi(s) = \sum_{s', r} P(s', r \mid s, \pi(s)) [ r + \gamma V^\pi(s') ]\]

Notice that compute_policy_v runs an inner while loop that iterates until \(V\) converges — this is because computing the exact value function for a fixed policy requires solving a system of equations iteratively.

  1. Policy Improvement (extract_policy): Given \(V^\pi\), derive a better policy by acting greedily:
\[\pi'(s) = \arg\max_a \sum_{s', r} P(s', r \mid s, a) [ r + \gamma V^\pi(s') ]\]

The algorithm alternates between these two phases. When the policy no longer changes, it has converged to the optimal policy.

Part 2: Implement Value Iteration

Your task is to implement the value_iteration function. Value iteration takes a different approach from policy iteration — instead of fully evaluating a fixed policy, it directly computes the optimal value function using the Bellman optimality equation:

\[V^*(s) = \max_a \sum_{s', r} P(s', r \mid s, a) [ r + \gamma V(s') ]\]

Key Difference from Policy Iteration

  Policy Iteration Value Iteration
Equation Bellman expectation equation Bellman optimality equation
Per iteration Full policy evaluation (inner loop until convergence) + policy improvement Single sweep: compute Q(s,a) for all actions, set V(s) = max Q(s,a)
Structure Two nested loops (outer: policy updates, inner: value convergence) One loop (iterate until V converges)

Note: The Role of V and Q in Each Method

In policy iteration, the two phases use different quantities:

  • Policy evaluation (compute_policy_v) computes Q implicitly, but stores it as V. Look at this line in the demo: V[state] += prob * (reward + gamma * prev_V[next_state]). This summation is exactly the Q-value formula \(Q^\pi(s, \pi(s))\) — but since a fixed policy has only one action per state, \(V^\pi(s) = Q^\pi(s, \pi(s))\), so we store the result directly into V without needing a separate Q table.
  • Policy improvement (extract_policy) then computes Q(s, a) for all actions explicitly — it needs to compare across actions to find \(\arg\max_a Q(s, a)\) for the new policy.

In value iteration, Q must be computed for all actions at every iteration because the update rule \(V(s) = \max_a Q(s, a)\) requires knowing Q across all actions to take the max. There is no separate evaluation/improvement split — both happen in a single step.

In short: Policy evaluation can get away with only V because it follows one fixed action per state. Value iteration cannot — it must consider all actions at every step to find the optimal value.

Implementation Hints

  1. To turn off the rendering, just remove the render_mode="human" argument when creating the environment.
  2. You do not need separate compute_policy_v and extract_policy functions. A single value_iteration function with one loop is sufficient.
  3. In each iteration, for each state, compute \(Q(s, a)\) for all actions (not just the action from a fixed policy), then set \(V(s) = \max_a Q(s, a)\).
  4. Use the same convergence criterion as the demo: stop when the sum of absolute differences between consecutive value functions is below a small threshold (e.g., 1e-10).
  5. After convergence, extract the policy from \(Q\) using argmax.
  6. You can reuse the run_episode function from the demo to test your extracted policy.

Verify

If your implementation is correct, the following assertions should pass (assuming V is your converged value function with default gamma=1.0 and is_slippery=True):

assert all(abs(v - 0.82352941) < 1e-5 for v in V[0:5]), "Value function is not calculated correctly"
assert all(abs(v - 0) < 1e-5 for v in V[[5,7,11,12,15]]), "Value function is not calculated correctly"

Part 3: Questions

Answer the following questions in your report. Support your answers with observations from your implementation and experiments.

  1. (5 pts) Why does policy iteration need an inner loop (line 38 in demo code) while value iteration does not? Can we get away without the inner loop in policy iteration? Why or why not?

  2. (10 pts) How many outer iterations (line 106 in demo code) does policy iteration take to converge? How many iterations does value iteration take to converge? You need to run the code and count the iterations. Why is there such a difference in the number of iterations, and does fewer iterations necessarily mean faster overall?

  3. (10 pts) In the value iteration update \(V(s) = \max_a Q(s, a)\), we take the max over actions. In policy evaluation for policy iteration, we only consider one action \(\pi(s)\) per state. How does this difference relate to the Bellman expectation equation vs. the Bellman optimality equation?

  4. (5 pts) Both methods produce an optimal policy for FrozenLake. Run each policy for 100 episodes and report the average score. Why is the score significantly less than 1.0 even though the policy is optimal?*

  5. (10 pts) Based on your understanding and experimental observation of the two methods, what are the pros and cons of value iteration compared to policy iteration?

Deliverables and Rubrics

Submit a PDF exported from your Jupyter notebook that includes your code and answers. Your submission should contain:

  • (60 pts) Working value iteration implementation that passes the assert verification above.

  • (40 pts) Answers to all four questions in Part 3.