In the lab, you need to implement SAC to solve the task of Pendulum-v0 .

Understand SAC

Read the document from OpenAI spinning up HERE

Focus on understanding the twin Q networks and entropy regularization. Feel free to skip the action squashing part, which we have a similar element in PPO .

Start Code

Adapt code from both DDPG (because it is off-policy and uses replay buffer) and PPO (because it uses stochastic policy).

What to change

Stochastic Policy

  • Use DDPG code as a base and migrate the related code from PPO to change the deterministic policy to stochastic policy.

  • Keep the neural networks structure similar to DDPG. Only change the last layer to output \(\mu\) and \(\theta\) (remove the Tanh).
  • Remove the target network for actor target_act_network to be consistent with the original paper setup.

Twin Q networks

  • Rename the original q_net as q1_net, target_q_net as target_q1_net.

  • Create another Q network as q2_net. Feel free to use deepcopy

        self.q1_net = nn.Sequential(
            nn.Linear(n_state + n_action, 400),
            nn.ReLU(),
            nn.Linear(400, 300),
            nn.ReLU(),
            nn.Linear(300, 1)
        )
        self.q2_net = nn.Sequential(
            nn.Linear(n_state + n_action, 400),
            nn.ReLU(),
            nn.Linear(400, 300),
            nn.ReLU(),
            nn.Linear(300, 1)
        )
          self.q1_net.to(device)
          self.target_q1_net = copy.deepcopy(self.q1_net)
          self.target_q1_net.to(device)
          self.q2_net.to(device)
          self.target_q2_net = copy.deepcopy(self.q2_net)
          self.target_q2_net.to(device)

You also need to create dedicate optimizers for q2_net .

  • When calculating the logprob loss from the actor network, refer to the similar part in PPO. Carefully follow the pseudo code in the spinning up document to make sure target networks are used in correct place.

A new kind of Q

As one of the primary contributions of the paper, SAC adapts the soft Q learning that incooperates entropy into the Q function. More mathematic details can be found in the paper or the OpenAI tutorial.

If you have the twin Q networks and stochastic policy network set up, the TD target of soft Q can be calculated as

  logprob_ = dist.log_prob(act_).squeeze()
  # Calculate y
  q_input = torch.cat(
      [next_obs, act_], axis=1)
  y1 = reward + self.gamma * (1 - done) * \
      self.target_q1_net(q_input).squeeze()
  y2 = reward + self.gamma * (1 - done) * \
      self.target_q2_net(q_input).squeeze()
  y = torch.min(y1, y2) - self.alpha*logprob_

Note that logprob_ here need to be calculated as the way we did in PPO. And self.alpha is a hyper-parameter called temperature factor. Feel free to set it to 0.2 for now. You will need to tweak this parameter later.

Reparameterization trick for training act_net

Here is the implementation you can adapt to do reparameterization in Pytorch:

  dist = Normal(mu, var)
  act_ = dist.rsample()
  logprob_ = dist.log_prob(act_).squeeze()

Here, mu and var are something from the act_net. By using rsample() instead of sample(), we are able to make act_ carry gradients from act_net.

To calculate the loss of act_net, you can do

  logprob_ = dist.log_prob(act_).squeeze()
  loss_act = (self.alpha*logprob_ - y).mean()

where y is the minimum Q from q1_net and q2_net.

Write Report

  • Run the program with three self.alpha settings: 0.0, 0.2, 0.5. Record the training log from the three setup and plot the reward curves.
  • Write your observation/insight on the impact of self.alpha.
  • Given several contributions of SAC (e.g., twin Q nets, soft Q learning), which part do you think have the most significant impact on the performance?

Deliverables and Rubrics

Overall, you need to complete the environment installation and be able to run the demo code. You need to submit:

  • (70 pts) A PDF from running the your code in jupyter notebook with accuracy reported in the program output.
  • (15 pts) Performance: The average of reward reaches above -300 within 500 trials.
  • (15 pts) Reasonable answers to the questions (backed up with experiment results)

Debuging Tips

As a reference, here is my training curves comparing with DDPG and PPO:

reward