Lecture 04: MRP Student Activity
MRP Student Activity
In this activity, you will implement a simple Monte Carlo method to estimate the value function of a given Markov Reward Process (MRP). Follow the steps below to complete the activity.
Use the provided below to understand the MRP structure of the student activity.
MRP Student Activity
# %%
import numpy as np
# set random seed for reproducibility
np.random.seed(42)
# %% <markdown>
# The states are
# 0: Class 1
# 1: Class 2
# 2: Class 3
# 3: Pass
# 4: Pub
# 5: Facebook
# 6: Sleep
class Student_MRP(object):
def __init__(self):
self.P = np.zeros((7, 7))
self.P[0, 1] = 0.5
self.P[0, 5] = 0.5
self.P[1, 2] = 0.8
self.P[1, 6] = 0.2
self.P[2, 3] = 0.6
self.P[2, 4] = 0.4
self.P[3, 6] = 1.0
self.P[4, 0] = 0.2
self.P[4, 1] = 0.4
self.P[4, 2] = 0.4
self.P[5, 0] = 0.1
self.P[5, 5] = 0.9
self.P[6, 6] = 1
self.R = np.array([-2, -2, -2, 10, 1, -1, 0])
def next(self, state):
next_state = np.random.choice(self.P.shape[0], p=self.P[state])
return next_state
student = Student_MRP()
value_mc = np.zeros(7)
gamma = 1
n_trials = 100
n_max_steps = 50
# TODO: Part 1.1
# STEP 1: Write code to estimate the value function
# The way to estimate is just running many many trials starting from a certain state
# and then calculate the mean of the total discounted reward as its value
for s0 in range(7):
reward_list = np.zeros(n_trials)
for t in range(n_trials):
# initialize things for each trial
total_reward = 0
for steps in range(n_max_steps):
# TODO: calculate total_reward by rolling out
reward_list[t] = total_reward
value_mc[s0] = reward_list.mean()
assert np.isclose(value_mc[2], -2 + 0.6*value_mc[3] + 0.4*value_mc[4], atol=1e-1), "Bell equation is not satisfied"
# TODO: Part1.2
# The line above manually verify if it satisfied Bell equation by calculating
# if v[2] is equal to -2 + 0.6*v[3] + 0.4*v[4]
# if not, try to increase the n_trials (to 2000) to see the change
print(value_mc)
# TODO Part2
# Since we know the final value function should and will satisfy
# the bellman equation, we can try to use the equation to calculate
# value function. To do this, we just update value function based on
# Bellman equation iteratively. Initially, we can set value function to be all zeros,
# and then update it with Bellman equation for 100 iterations.
# You should see the value function converges to the one
n_iterations = 100
value_iter = np.zeros(7)
for n in range(n_iterations):
old_value = np.copy(value_iter) # backup the value function
for s0 in range(7):
# update value[s0] with bellman equation
# The following code verify if the value function calculated with Bellman equation is close to the one we estimated by running trials.
assert np.allclose(value_iter, value_mc,
atol=1e-1), "Value function calculated with Value Iteration is not close to the one estimated by Monte Carlo method"
# TODO Part3
# Does the assertion above hold? If not, how can you change the code to reduce the gap?
Copy or download the code above into your WLS2 or local machine. Complete the TODO parts in the code.