Lab 02-2: Sarsa
Starter Code
Check my demo code on MC control here.
Implement Sarsa
1-Step Sarsa
With the demo code above, you should be able to quickly adapt the code to
implement 1-step Sarsa.
For the learning rate alpha
, 0.01
is usually a good setting.
Note that you want to change the
eps
to 0 when testing and change back for training.
Different from the way to train MC, since TD method allows training on incomplete sequences, the training does not need to wait until one episode finishes. Therefore, the training iteration can be written as follows:
reward_list = []
n_episodes = 20000
for i in tqdm(range(n_episodes)):
obs = env.reset()
while True:
act = policy(obs)
next_obs, reward, done, _ = env.step(act)
next_act = policy(next_obs)
policy.update(obs, act, reward, next_obs, next_act)
if done:
break
obs = next_obs
policy.eps = max(0.01, policy.eps - 1.0/n_episodes)
mean_reward = np.mean([sum(run_episode(env, policy, False)[2])
for _ in range(2)])
reward_list.append(mean_reward)
policy.eps = 0.0
scores = [sum(run_episode(env, policy, False)[2]) for _ in range(100)]
print("Final score: {:.2f}".format(np.mean(scores)))
import pandas as pd
df = pd.DataFrame({'reward': reward_list})
df.to_csv("./SomeFolderForThisLab/SARSA-1.csv",
index=False, header=True)
Note that you need to change the file path SomeFolderForThisLab
to match your setup.
To show the training process, you can use the following command to plot the learning curve.
python ./SomeFolder/plot.py ./SomeFolderForThisLab reward -s 100
The plot.py
is the script I provided in Lab Submission Guideline. If you need to download, here is the link again - plot.py. Note that the -s 100
means smoothing the curve(s) by running moving average over 100 data points. A sample output is shown below:
n-Step Sarsa
Change the code of 1-step Sarsa to implement:
- 2-step Sarsa
- 5-step Sarsa
Important: Make sure the cases of
reward=1
are included into the calculation.
Similarly, save the testing results from each episode, and plot the learning curves from all Sarsa(s).
HINT: It is handy to use the double queue data type (deque
) in Python. A quick tutorial can be found HERE.
Write Report
Questions to answer in the report:
- (15 pts) Show the learning curves described above.
- (5 pts) Compare Sarsa algorithms with different steps, what are the pros and cons of n-step Sarsa with a large
n
.
Deliverables and Rubrics
Overall, you need to complete the environment installation and be able to run the demo code. You need to submit:
- (80 pts) PDF (exported from jupyter notebook) and python code.
- (20 pts) Reasonable answers to the questions.