Lab 04: DDPG
Understand DDPG
Read the document from OpenAI spinning up HERE
Write Code
- Adapt the DQN code from previous demo
- Solve the
MountainCarContinuous-v0
What you need to change:
Add another neural nets (and its target) as the policy making function, a.k.a., actor
. It takes obs
as input and outputs an action, which in this case should be an np.array
.
- Incorporate the range of action
self.act_lim
intoact_net
Change the q_net
to take [state, act]
as input and outputs a single value as the Q value.
- This include changing all the places you called
self.q_net
. - The way to combine
state
andact
is like thisq_input = torch.cat( [next_obs, self.act_lim*self.target_act_net(next_obs)], axis=1) y = reward + self.gamma * (1 - done) * \ self.target_q_net(q_input).squeeze()
Change the random policy behavior
- Instead of doing the greedy epsilon, just simply add a random noise signal like:
act += self.noise*np.random.randn(n_action)
where
self.noise
is a hyperparameter you can tweak.
Over the course of training, feel free to decay this noise to yield less random “policy”:
if agent.noise > 0.005:
agent.noise -= (1/200)
Note: Decay this noise after every trial/episode, not every step.
Change act to FloatTensor
in the ReplayBuffer
.
Return two kinds of loss for inspection, loss_q
and loss_act
Writing Report
Questions to answer:
- How does the setting of noise scale
self.noise
impact the training and performance?- Preferably, try to run multiple experiments with different
self.noise
settings and plot the results.
- Preferably, try to run multiple experiments with different
- (Open-Ended question) What is the major downside of DDPG according to your understanding/observation?
Deliverables and Rubrics
- (75 pts) PDF (exported from jupyter notebook) and python code.
- The report should include at least reward curve over each training iteration (i.e., episode).
- (15 pts) The result shows your implementation achieved desirable result.
- If your code is correct, the mean reward should be above 80 after 300 episodes. (Check my plots below)
- (10 pts) Reasonable answers to the questions.
Debugging Tips
- Check the loss:
loss_q
should be positive,loss_act
should be negative . Positiveloss_act
usually indicates sparse or non successful trials had been heavily used for training.
Here are my neural nets:
self.q_net = nn.Sequential(
nn.Linear(n_state + n_action, 400),
nn.ReLU(),
nn.Linear(400, 300),
nn.ReLU(),
nn.Linear(300, 1)
)
self.act_net = nn.Sequential(
nn.Linear(n_state, 400),
nn.ReLU(),
nn.Linear(400, 300),
nn.ReLU(),
nn.Linear(300, n_action),
nn.Tanh()
)
I initialize the exploration noise to 2
.
Here is my main train loop. Feel free to copy it: (Change the agent
to your own policy)
loss_q_list, loss_act_list, reward_list = [], [], []
update_freq = 10
n_step = 0
loss_q, loss_act = 0, 0
for i in tqdm(range(500)):
obs, rew = env.reset(), 0
while True:
act = agent(obs)
next_obs, reward, done, _ = env.step(act)
rew += reward
n_step += 1
agent.replaybuff.add(obs, act, reward, next_obs, done)
obs = next_obs
if len(agent.replaybuff) > 1e3 and n_step % update_freq == 0:
loss_q, loss_act = agent.update()
if done:
# if reward > 90:
# print("wow")
break
if i > 0 and i % 50 == 0:
run_episode(env, agent, True)[2]
print("itr:({:>5d}) loss_q:{:>3.4f} loss_act:{:>3.4f} reward:{:>3.1f}".format(
i, np.mean(loss_q_list[-50:]),
np.mean(loss_act_list[-50:]),
np.mean(reward_list[-50:])))
if agent.noise > 0.005:
agent.noise -= (1/200)
loss_q_list.append(loss_q), loss_act_list.append(
loss_act), reward_list.append(rew)
My training curves are like: