In this lab, you will implement Soft Actor-Critic (SAC) to solve Pendulum-v1.

Understand SAC

Read the OpenAI Spinning Up documentation on Soft Actor-Critic.

Focus on two ideas: the twin Q networks and entropy regularization. Feel free to skip the action-squashing (tanh) correction in the log-probability — we already clip actions explicitly as we did in PPO and DDPG.

Why SAC exists (context for this lab)

SAC is an off-policy actor–critic algorithm that combines three ingredients you have already seen separately:

  • From DDPG (Lab 04): a replay buffer and target Q networks that let the agent learn from off-policy transitions.
  • From A2C/PPO (Lab 05, Lecture 25): a stochastic Gaussian policy that explores by sampling, rather than by adding external noise to a deterministic policy.
  • New in SAC:
    • Twin Q networks — keep two independent critics and take the minimum of their estimates when building the TD target. This curbs the well-known overestimation bias of single-critic value learning (the same motivation as Double DQN / TD3).
    • Maximum-entropy objective — the agent is trained to maximize \(\sum_t \big( r_t + \alpha\, H(\pi(\cdot \mid s_t)) \big)\) rather than just \(\sum_t r_t\). The extra entropy term, weighted by a temperature \(\alpha\), directly rewards the policy for staying stochastic, giving robust exploration without hand-tuned noise schedules.

Starter Code

Adapt code from DDPG (for the replay buffer and off-policy update) and PPO (for the stochastic Gaussian policy).

What to change

1. Stochastic policy

  • Use the DDPG code as a base; port the stochastic actor from PPO.
  • Keep the actor network structure similar to DDPG. Change only the last layer to output \(2 \cdot n_\text{action}\) values — the first half becomes \(\mu\), the second half becomes \(\sigma\). Remove the final tanh on the raw output; apply tanh only to the \(\mu\) half (and scale by self.act_lim). The \(\sigma\) half must be made positive (e.g. abs() or softplus) and is often clamped to a small range such as [1e-3, 2] to avoid numerical blow-up.
  • Remove the target network for the actor (target_act_net). SAC’s actor is trained directly against the current (live) Q networks, so no actor-target is needed — this matches the original paper.

2. Twin Q networks

  • Rename the original q_netq1_net and target_q_nettarget_q1_net.
  • Create a second Q network q2_net with the same architecture but independent initialization. (Using copy.deepcopy(self.q1_net) would give identical weights, which defeats the purpose of having two critics. Build q2_net as a fresh nn.Sequential so it starts with different random weights. Only the target networks should be deep-copied from their live counterparts.)
self.q1_net = nn.Sequential(
    nn.Linear(n_state + n_action, 400),
    nn.ReLU(),
    nn.Linear(400, 300),
    nn.ReLU(),
    nn.Linear(300, 1)
)
self.q2_net = nn.Sequential(
    nn.Linear(n_state + n_action, 400),
    nn.ReLU(),
    nn.Linear(400, 300),
    nn.ReLU(),
    nn.Linear(300, 1)
)
self.q1_net.to(device)
self.target_q1_net = copy.deepcopy(self.q1_net)
self.target_q1_net.to(device)
self.q2_net.to(device)
self.target_q2_net = copy.deepcopy(self.q2_net)
self.target_q2_net.to(device)
  • Give q2_net its own Adam optimizer, just like q1_net.

3. The soft Q target

SAC’s headline contribution is soft Q-learning: the TD target carries an extra entropy bonus so that the learned value function reflects both reward-to-go and expected future entropy.

Given a transition \((s, a, r, s', d)\) from the replay buffer, sample a next action \(\tilde{a}' \sim \pi_\theta(\cdot \mid s')\) from the current stochastic policy, then form:

\[y = r + \gamma\,(1 - d)\, \Big( \min_{i=1,2} Q_{\bar\phi_i}(s', \tilde{a}') \;-\; \alpha \, \log \pi_\theta(\tilde{a}' \mid s') \Big)\]

The \(\min\) over the target critics controls overestimation; the \(-\alpha \log \pi\) term is the soft-Q entropy bonus. Note that the entropy term sits inside the \(\gamma(1-d)\) factor — at a terminal transition it drops out along with the bootstrap.

In code, computed inside torch.no_grad():

# Sample next action from the current policy at s'
output = self.act_net(next_obs)
mu = self.act_lim * torch.tanh(output[:, :n_action])
sigma = torch.abs(output[:, n_action:]).clamp(1e-3, 2)
dist = Normal(mu, sigma)
act_ = dist.sample()
logprob_ = dist.log_prob(act_).squeeze()

# Target Q values at (s', a')
q_input = torch.cat([next_obs, act_], axis=1)
q1_t = self.target_q1_net(q_input).squeeze()
q2_t = self.target_q2_net(q_input).squeeze()

# Soft-Q TD target
y = reward + self.gamma * (1 - done) * (torch.min(q1_t, q2_t) - self.alpha * logprob_)

Both q1_net and q2_net are then trained with MSE loss against the same target y.

self.alpha is the temperature: a hyper-parameter that trades off exploitation vs. exploration. Start with 0.2; you will vary it in the experiments below.

4. Reparameterization trick for the actor update

When training the actor, we want the gradient to flow through the sampled action back into \(\mu\) and \(\sigma\). A plain dist.sample() breaks this gradient (sampling is not differentiable). PyTorch provides the reparameterized sampler rsample(), which expresses the sample as \(a = \mu + \sigma \cdot \epsilon\) with \(\epsilon \sim \text{Normal}(0, 1)\) — this is differentiable in \(\mu\) and \(\sigma\).

dist = Normal(mu, sigma)
act_ = dist.rsample()              # differentiable sample
logprob_ = dist.log_prob(act_).squeeze()

The SAC actor loss is the negative of the soft Q value:

\[L_\text{act} = E_{s \sim D,\, \tilde{a} \sim \pi_\theta}\!\left[\, \alpha \, \log \pi_\theta(\tilde{a} \mid s) \;-\; \min_{i=1,2} Q_{\phi_i}(s, \tilde{a}) \,\right]\]

i.e. the policy is pushed toward actions with high Q and high entropy (low \(\log \pi\)). This is almost the same as DDPG but with min-double-Q and the entropy term for the stochastic policy. In code:

q_input = torch.cat([obs, act_], axis=1)
q_min = torch.min(self.q1_net(q_input), self.q2_net(q_input)).squeeze()
loss_act = (self.alpha * logprob_ - q_min).mean()

Note that the Q networks here are the live ones (not targets) and we do not wrap this in no_grad — gradients need to flow through q_min → act_ → (mu, sigma) → act_net.


Write-up

  • Run the program with three temperature settings: self.alpha = 0.0, 0.2, 0.5. Record the training log and plot the reward curves on the same axes.
  • Describe your observation of how self.alpha affects exploration and final return. What symptom do you see at alpha = 0.0? What symptom at alpha = 0.5?
  • Of SAC’s main contributions — twin Q networks, soft-Q learning / entropy bonus, and the reparameterized stochastic actor — which do you believe contributes most to the improvement over DDPG? Back your answer with an ablation or with a reasoned argument from the observed losses.

Deliverables and Rubrics

Points Requirement
70 pts PDF exported from the Jupyter notebook, including reward, loss_q, and loss_act curves and the final-score printout.
15 pts Performance: average reward reaches above −300 within 500 training episodes.
15 pts Reasonable answers to the write-up questions, supported by your experiment plots.

Debugging Tips

If SAC performs worse than A2C:

  • Check that both Q-networks are updating — print loss_q1 and loss_q2 separately; they should evolve independently.
  • Verify the entropy term is in the Q target: y = r + γ(1-d) * (min(Q1, Q2) − α · log π(ã′|s′)).
  • Ensure target networks are syncing correctly (either periodic hard copy or Polyak averaging — match your DDPG choice).
  • Confirm the actor update uses rsample(), not sample() — otherwise no gradient reaches the actor.
  • Make sure q2_net is a freshly-initialized network, not a deepcopy of q1_net. Two identical critics collapse back to a single critic and you lose the overestimation fix.
  • If \(\sigma\) collapses to zero (policy becomes deterministic and reward stalls), check that you are clamping \(\sigma\) to a positive floor and that alpha > 0.

Reference training curves

Below are my training curves comparing SAC against DDPG and PPO on Pendulum-v1:

reward