Lecture 04: MRP Student Activity
MRP Student Activity
In this activity, you will implement a simple Monte Carlo method to estimate the value function of a given Markov Reward Process (MRP). Follow the steps below to complete the activity.
Use the provided below to understand the MRP structure of the student activity.
MRP Student Activity
# %%
import numpy as np
# %% <markdown>
# The states are
# 0: Class 1
# 1: Class 2
# 2: Class 3
# 3: Pass
# 4: Pub
# 5: Facebook
# 6: Sleep
class Student_MRP(object):
def __init__(self):
self.P = np.zeros((7, 7))
self.P[0, 1] = 0.5
self.P[0, 5] = 0.5
self.P[1, 2] = 0.8
self.P[1, 6] = 0.2
self.P[2, 3] = 0.6
self.P[2, 4] = 0.4
self.P[3, 6] = 1.0
self.P[4, 0] = 0.2
self.P[4, 1] = 0.4
self.P[4, 2] = 0.4
self.P[5, 0] = 0.1
self.P[5, 5] = 0.9
self.P[6, 6] = 1
self.R = np.array([-2, -2, -2, 10, 1, -1, 0])
def next(self, state):
next_state = np.random.choice(self.P.shape[0], p=self.P[state])
return next_state
student = Student_MRP()
value = np.zeros(7)
gamma = 1
n_trials = 100
n_max_steps = 50
# TODO: Part 1.1
# STEP 1: Write code to estimate the value function
# The way to estimate is just running many many trials starting from a certain state
# and then calculate the mean of the total discounted reward as its value
for s0 in range(7):
reward_list = np.zeros(n_trials)
for t in range(n_trials):
# initialize things for each trial
total_reward = 0
for steps in range(n_max_steps):
# TODO: calculate total_reward by rolling out
reward_list[t] = total_reward
value[s0] = reward_list.mean()
# TODO: Part1.2
# STEP 2: Once value is calculated, print out value
# STEP 3: Manually verify if it satisfied Bell equation by calculating
# if v[2] is equal to -2 + 0.6*v[3] + 0.4*v[4]
# STEP 4: if not, try to increase the n_trials (to 2000) to see the change
print(value)
# TODO Part2
# Since we know the final value function should and will satisfy
# the bellman equation, we can try to use the equation to calculate
# value function. To do this, we just update value function based on
# Bellman equation iteratively.
n_iterations = 100
value = np.zeros(7)
for n in range(n_iterations):
old_value = np.copy(value) # backup the value function
for s0 in range(7):
# update value[s0] with bellman equation
Copy or download the code above into your WLS2 or local machine. Complete the TODO parts in the code.