Lab 02-1: Policy/Value Iteration
Part 1: Understand Policy Iteration
Read through the demo code below to understand how policy iteration works.
# To add a new cell, type '# %%'
# To add a new markdown cell, type '# %% [markdown]'
# %% [markdown]
#
# %% [markdown]
# In this lab, we will implement policy iteration and value iteration to solve the Frozen lake problem.
#
# Let's first get ourselves familarized with the environment.
# %%
import numpy as np
import gymnasium as gym
env = gym.make('FrozenLake-v1', render_mode="human", map_name="4x4", is_slippery=True) # or you can try '8x8'
env.reset()
n_state = env.unwrapped.observation_space.n
n_action = env.unwrapped.action_space.n
print("# of actions", n_action)
print("# of states", n_state)
P = env.unwrapped.P
# At state 14 apply action 2
for prob, next_state, reward, done in P[14][2]:
print("If apply action 2 under state 14, there is %3.2g probability it will transition to state %d and yield reward as %i" % (
prob, next_state, reward))
# %% [markdown]
# First, let's write utility functions we will use to run experiments.
# %% [markdown]
# Given certain policy, how can we compute the value function for each state.
# %%
def compute_policy_v(env, policy, gamma=1.0):
# The goal of this function is to calculate the Q(s,a) for each action under each state
eps = 1e-10
V = np.zeros(n_state)
while True:
prev_V = np.copy(V)
for state in range(n_state):
action = policy[state]
V[state] = 0
for prob, next_state, reward, done in P[state][action]:
V[state] += prob * (reward + gamma * prev_V[next_state])
if (np.sum((np.fabs(prev_V - V))) <= eps):
break
return V
# %% [markdown]
# Let's test a random policy
# %%
random_policy = [np.random.choice(n_action) for _ in range(n_state)]
print(random_policy)
rand_v = compute_policy_v(env, random_policy)
print(np.round(rand_v, 5))
# %% [markdown]
# Given value function, we need to extract the best policy from it.
# %%
def extract_policy(V, gamma=1.0):
policy = np.zeros(n_state)
Q = np.zeros((n_state, n_action))
for state in range(n_state):
for action in range(n_action):
for prob, next_state, reward, done in P[state][action]:
Q[state, action] += prob * (reward + gamma*V[next_state])
policy[state] = np.argmax(Q[state])
return policy
# %%
print(extract_policy(rand_v))
# %% [markdown]
# Now let's start with a random policy and compute the value then extract new policy. Do this recursively will improve the policy
# %%
def run_episode(env, policy, gamma=1.0, render=False):
""" Runs an episode and return the total reward """
obs, _ = env.reset()
total_reward = 0
step_idx = 0
while True:
if render:
env.render()
obs, reward, truncated, done, _ = env.step(int(policy[obs]))
# this will calculate the return for the first step
total_reward += (gamma ** step_idx * reward)
step_idx += 1
if done or truncated:
break
return total_reward
# %%
policy = random_policy
max_iterations = 200000
gamma = 1.0
for i in range(max_iterations):
policy_v = compute_policy_v(env, policy)
new_policy = extract_policy(policy_v)
score = run_episode(env, new_policy, gamma, False)
print("iteration %i: score %f" % (i, score))
if (np.all(policy == new_policy)):
print('Policy-Iteration converged at step %d.' % (i+1))
break
policy = new_policy
scores = [run_episode(env, policy, gamma, False) for _ in range(100)]
print("Final score:", np.mean(scores))
Policy iteration has two distinct phases that alternate until convergence:
- Policy Evaluation (
compute_policy_v): Given a fixed policy, compute the value function \(V^\pi(s)\) for every state. This uses the Bellman expectation equation:
Notice that compute_policy_v runs an inner while loop that iterates until \(V\) converges — this is because computing the exact value function for a fixed policy requires solving a system of equations iteratively.
- Policy Improvement (
extract_policy): Given \(V^\pi\), derive a better policy by acting greedily:
The algorithm alternates between these two phases. When the policy no longer changes, it has converged to the optimal policy.
Part 2: Implement Value Iteration
Your task is to implement the value_iteration function. Value iteration takes a different approach from policy iteration — instead of fully evaluating a fixed policy, it directly computes the optimal value function using the Bellman optimality equation:
Key Difference from Policy Iteration
| Policy Iteration | Value Iteration | |
|---|---|---|
| Equation | Bellman expectation equation | Bellman optimality equation |
| Per iteration | Full policy evaluation (inner loop until convergence) + policy improvement | Single sweep: compute Q(s,a) for all actions, set V(s) = max Q(s,a) |
| Structure | Two nested loops (outer: policy updates, inner: value convergence) | One loop (iterate until V converges) |
Note: The Role of V and Q in Each Method
In policy iteration, the two phases use different quantities:
- Policy evaluation (
compute_policy_v) computes Q implicitly, but stores it as V. Look at this line in the demo:V[state] += prob * (reward + gamma * prev_V[next_state]). This summation is exactly the Q-value formula \(Q^\pi(s, \pi(s))\) — but since a fixed policy has only one action per state, \(V^\pi(s) = Q^\pi(s, \pi(s))\), so we store the result directly into V without needing a separate Q table. - Policy improvement (
extract_policy) then computes Q(s, a) for all actions explicitly — it needs to compare across actions to find \(\arg\max_a Q(s, a)\) for the new policy.
In value iteration, Q must be computed for all actions at every iteration because the update rule \(V(s) = \max_a Q(s, a)\) requires knowing Q across all actions to take the max. There is no separate evaluation/improvement split — both happen in a single step.
In short: Policy evaluation can get away with only V because it follows one fixed action per state. Value iteration cannot — it must consider all actions at every step to find the optimal value.
Implementation Hints
- To turn off the rendering, just remove the
render_mode="human"argument when creating the environment. - You do not need separate
compute_policy_vandextract_policyfunctions. A singlevalue_iterationfunction with one loop is sufficient. - In each iteration, for each state, compute \(Q(s, a)\) for all actions (not just the action from a fixed policy), then set \(V(s) = \max_a Q(s, a)\).
- Use the same convergence criterion as the demo: stop when the sum of absolute differences between consecutive value functions is below a small threshold (e.g.,
1e-10). - After convergence, extract the policy from \(Q\) using
argmax. - You can reuse the
run_episodefunction from the demo to test your extracted policy.
Verify
If your implementation is correct, the following assertions should pass (assuming V is your converged value function with default gamma=1.0 and is_slippery=True):
assert all(abs(v - 0.82352941) < 1e-5 for v in V[0:5]), "Value function is not calculated correctly"
assert all(abs(v - 0) < 1e-5 for v in V[[5,7,11,12,15]]), "Value function is not calculated correctly"
Part 3: Questions
Answer the following questions in your report. Support your answers with observations from your implementation and experiments.
-
(5 pts) Why does policy iteration need an inner loop (line 38 in demo code) while value iteration does not? Can we get away without the inner loop in policy iteration? Why or why not?
-
(10 pts) How many outer iterations (line 106 in demo code) does policy iteration take to converge? How many iterations does value iteration take to converge? You need to run the code and count the iterations. Why is there such a difference in the number of iterations, and does fewer iterations necessarily mean faster overall?
-
(10 pts) In the value iteration update \(V(s) = \max_a Q(s, a)\), we take the
maxover actions. In policy evaluation for policy iteration, we only consider one action \(\pi(s)\) per state. How does this difference relate to the Bellman expectation equation vs. the Bellman optimality equation? -
(5 pts) Both methods produce an optimal policy for FrozenLake. Run each policy for 100 episodes and report the average score. Why is the score significantly less than 1.0 even though the policy is optimal?*
-
(10 pts) Based on your understanding and experimental observation of the two methods, what are the pros and cons of value iteration compared to policy iteration?
Deliverables and Rubrics
Submit a PDF exported from your Jupyter notebook that includes your code and answers. Your submission should contain:
-
(60 pts) Working value iteration implementation that passes the
assertverification above. -
(40 pts) Answers to all four questions in Part 3.