MRP Student Activity

In this activity, you will implement a simple Monte Carlo method to estimate the value function of a given Markov Reward Process (MRP). Follow the steps below to complete the activity.

Use the provided below to understand the MRP structure of the student activity.

Download

# %%
import numpy as np

# %% <markdown>
# The states are
# 0: Class 1
# 1: Class 2
# 2: Class 3
# 3: Pass
# 4: Pub
# 5: Facebook
# 6: Sleep


class Student_MRP(object):
    def __init__(self):
        self.P = np.zeros((7, 7))
        self.P[0, 1] = 0.5
        self.P[0, 5] = 0.5
        self.P[1, 2] = 0.8
        self.P[1, 6] = 0.2
        self.P[2, 3] = 0.6
        self.P[2, 4] = 0.4
        self.P[3, 6] = 1.0
        self.P[4, 0] = 0.2
        self.P[4, 1] = 0.4
        self.P[4, 2] = 0.4
        self.P[5, 0] = 0.1
        self.P[5, 5] = 0.9
        self.P[6, 6] = 1

        self.R = np.array([-2, -2, -2, 10, 1, -1, 0])

    def next(self, state):
        next_state = np.random.choice(self.P.shape[0], p=self.P[state])
        return next_state


student = Student_MRP()
value = np.zeros(7)
gamma = 1
n_trials = 100
n_max_steps = 50
# TODO: Part 1.1
#  STEP 1: Write code to estimate the value function
#  The way to estimate is just running many many trials starting from a certain state
#  and then calculate the mean of the total discounted reward as its value
for s0 in range(7):
    reward_list = np.zeros(n_trials)
    for t in range(n_trials):
        # initialize things for each trial
        total_reward = 0
        for steps in range(n_max_steps):
            # TODO: calculate total_reward by rolling out

        reward_list[t] = total_reward

    value[s0] = reward_list.mean()

# TODO: Part1.2
#  STEP 2: Once value is calculated, print out value
#  STEP 3: Manually verify if it satisfied Bell equation by calculating
#    if v[2] is equal to -2 + 0.6*v[3] + 0.4*v[4]
#  STEP 4: if not, try to increase the n_trials (to 2000) to see the change
print(value)

# TODO Part2
# Since we know the final value function should and will satisfy
# the bellman equation, we can try to use the equation to calculate
# value function. To do this, we just update value function based on
# Bellman equation iteratively.

n_iterations = 100
value = np.zeros(7)
for n in range(n_iterations):
    old_value = np.copy(value)  # backup the value function
    for s0 in range(7):
        # update value[s0] with bellman equation

Copy or download the code above into your WLS2 or local machine. Complete the TODO parts in the code.