DDPG Debugging Guide
DDPG Debugging Guide
When implementing Deep Deterministic Policy Gradient (DDPG), youβll likely encounter issues. This guide helps you diagnose and fix them.
π Debugging Workflow
Follow this systematic approach:
- Check Environment Setup β Ensure gymnasium API is used correctly
- Verify Network Architecture β Print shapes, ensure forward pass works
- Inspect Training Loop β Add logging for losses, rewards, Q-values
- Analyze Learning Curves β Plot rewards, identify patterns
- Test Component Isolation β Disable parts to find the culprit
π Common Issues and Solutions
Issue 1: Reward Not Improving
Symptom: Reward stays constant (around -100) or barely improves over 200+ episodes
Possible Cause A: Learning Rates Too High
Check:
# Add to training loop:
print(f"Actor loss: {actor_loss.item():.4f}, Critic loss: {critic_loss.item():.4f}")
Expected: Losses should be < 100 and gradually decreasing If losses > 1000 or NaN: Learning rates are too high
Solution:
# Typical good values:
actor_optimizer = optim.Adam(actor.parameters(), lr=1e-4) # Slower for actor
critic_optimizer = optim.Adam(critic.parameters(), lr=1e-3) # Faster for critic
Why different LRs? Actor learns policy (must be stable), critic learns values (can update faster)
Possible Cause B: Incorrect Action Scaling
Check:
# Before env.step(), add:
print(f"Action range: [{action.min():.2f}, {action.max():.2f}]")
Expected: For MountainCarContinuous: [-1.0, 1.0] If outside range: Actions are not scaled correctly
Solution:
# In your actor network:
def forward(self, state):
x = self.network(state)
action = self.action_limit * torch.tanh(x) # Squash to [-act_lim, act_lim]
return action
# When creating actor:
actor = Actor(state_dim, action_dim, action_limit=env.action_space.high[0])
Possible Cause C: Target Network Not Updating
Check:
# After polyak_update, add:
if episode % 50 == 0:
print(f"Target Q mean: {target_q_net(sample_state, sample_action).mean():.2f}")
Expected: Should slowly change over episodes If constant: Target network isnβt being updated
Solution:
# In training loop (AFTER gradient step):
from shared_utils.drl_utils import polyak_update
polyak_update(actor, target_actor, tau=0.001)
polyak_update(critic, target_critic, tau=0.001)
Common mistake: Forgetting to update target networks, or updating before online networks
Possible Cause D: Insufficient Exploration Noise
Check:
# During training, action should have noise:
action = actor(state)
noisy_action = action + exploration_noise
print(f"Noise scale: {exploration_noise.abs().mean():.3f}")
Expected: Noise scale should be 0.05 - 0.2 early in training If too small (< 0.01): Not exploring enough
Solution:
# Add Gaussian noise during training:
action = actor(state).detach().cpu().numpy()
noise = np.random.normal(0, 0.1, size=action.shape) # Mean=0, Std=0.1
action = np.clip(action + noise, -action_limit, action_limit)
Optionally: Decay noise over time:
noise_scale = max(0.01, 0.2 * (1 - episode / n_episodes))
Issue 2: Training Diverges (NaN losses)
Symptom: After some episodes (50-150), losses become NaN and training crashes
Cause A: Gradient Explosion
Check:
# After loss.backward(), before optimizer.step():
total_norm = 0
for p in actor.parameters():
param_norm = p.grad.data.norm(2)
total_norm += param_norm.item() ** 2
total_norm = total_norm ** 0.5
print(f'Gradient norm: {total_norm:.2f}')
Expected: Should be < 10 If > 100: Gradients are exploding
Solution:
# Add gradient clipping:
actor_loss.backward()
torch.nn.utils.clip_grad_norm_(actor.parameters(), max_norm=1.0)
actor_optimizer.step()
critic_loss.backward()
torch.nn.utils.clip_grad_norm_(critic.parameters(), max_norm=1.0)
critic_optimizer.step()
Cause B: Q-Values Exploding
Check:
# During training:
with torch.no_grad():
q_val = critic(state_batch, action_batch)
print(f"Q-value range: [{q_val.min():.1f}, {q_val.max():.1f}]")
Expected: For MountainCar, Q-values should be in [-100, 100] If > 1000: Q-network is unstable
Solutions:
- Reduce learning rate:
lr=1e-4instead of1e-3 - Use reward scaling:
reward = reward / 10.0 # Scale rewards to smaller range - Ensure target networks are used:
# Target Q-value: with torch.no_grad(): target_q = reward + gamma * target_critic(next_state, target_actor(next_state))
Cause C: Incorrect Tensor Operations
Check:
# Common mistakes:
# β Mixing numpy and torch without conversion
action = actor(state) # torch tensor
env.step(action) # ERROR if env expects numpy
# β
Correct:
action = actor(state).detach().cpu().numpy()
env.step(action)
# β Broadcasting issues
reward = torch.FloatTensor(reward) # shape: (batch_size,)
q_value = critic(state, action) # shape: (batch_size, 1)
loss = (q_value - reward).pow(2).mean() # ERROR: shape mismatch
# β
Correct:
reward = torch.FloatTensor(reward).unsqueeze(1) # shape: (batch_size, 1)
Issue 3: Slow Convergence
Symptom: Training works but takes 500+ episodes to solve MountainCar (should be ~200-300)
Optimization Checklist:
| Factor | Recommended Value | Check |
|---|---|---|
| Batch size | 64 - 128 | Small batch β high variance |
| Replay buffer | β₯ 100,000 | Small buffer β lack of diversity |
| Start training after | β₯ 1,000 samples | Training too early β unstable |
| Actor LR | 1e-4 | Too low β slow learning |
| Critic LR | 1e-3 | Too low β poor value estimates |
| Gamma | 0.99 | Too low β ignores future rewards |
| Tau | 0.001 | Too high β unstable targets |
| Exploration noise | 0.1 initially | Too low β not exploring |
Quick fix: Try these recommended hyperparameters first:
config = {
'batch_size': 64,
'buffer_capacity': 100000,
'actor_lr': 1e-4,
'critic_lr': 1e-3,
'gamma': 0.99,
'tau': 0.001,
'noise_scale': 0.1,
}
Issue 4: Unstable Learning Curves
Symptom: Reward oscillates wildly (e.g., 80 β -50 β 90 β -30)
Causes and Fixes:
- Target networks updating too fast
tau = 0.001 # Try smaller value like 0.0001 - No reward averaging in evaluation
# Instead of: test_reward = run_episode(env, actor) # Use average over multiple episodes: test_rewards = [run_episode(env, actor) for _ in range(10)] avg_reward = np.mean(test_rewards) - Exploration during testing
# Ensure no noise during evaluation: def evaluate_policy(env, actor, n_episodes=10): rewards = [] for _ in range(n_episodes): obs, _ = env.reset() total_reward = 0 done, truncated = False, False while not (done or truncated): with torch.no_grad(): action = actor(torch.FloatTensor(obs)).numpy() # NO noise obs, reward, done, truncated, _ = env.step(action) total_reward += reward rewards.append(total_reward) return np.mean(rewards)
π Expected Training Curves
Healthy DDPG on MountainCarContinuous-v0
Episodes 0-50: Random exploration, reward β -100 to -80
Episodes 50-100: Car starts learning, reward β -80 to -50
Episodes 100-200: Rapid improvement, reward β -50 to 0
Episodes 200-300: Converging to solution, reward β 0 to 90
Episodes 300+: Solved, reward β 90-95 (consistently)
Red flags:
- No improvement by episode 100 β Check learning rates, action scaling
- Sudden drop after improving β Target network instability
- Constant oscillation β Reduce tau, average test rewards
π§ͺ Verification Checklist
Before asking for help, verify these:
Environment Setup
- Using
import gymnasium as gym(notgym) obs, info = env.reset()returns tupleobs, reward, done, truncated, info = env.step(action)(5 returns)- Actions are numpy arrays, correct shape and range
Network Architecture
- Actor outputs actions in correct range (use tanh)
- Critic takes (state, action) β Q-value
- Target networks exist and are initialized correctly
- Networks moved to correct device (CPU/GPU)
Training Loop
- Replay buffer has > batch_size samples before training
- Target networks updated AFTER online networks
- Polyak update uses tau β 0.001
- No gradient flow through target networks (
with torch.no_grad()) - Actions have exploration noise during training
- No noise during evaluation
Logging
- Losses are logged and reasonable (< 100)
- Q-values are logged and bounded
- Rewards are plotted and improving
- Random seed is set for reproducibility
π§ Quick Debugging Code
Add this to your training loop for comprehensive diagnostics:
# After every 10 episodes:
if episode % 10 == 0:
print(f"\n{'='*60}")
print(f"Episode {episode} Diagnostics")
print(f"{'='*60}")
# Sample a batch
batch = buffer.sample(64)
states = torch.FloatTensor([s for s, _, _, _, _ in batch])
actions = torch.FloatTensor([a for _, a, _, _, _ in batch])
# Check Q-values
with torch.no_grad():
q_vals = critic(states, actions)
print(f"Q-value range: [{q_vals.min():.2f}, {q_vals.max():.2f}]")
print(f"Q-value mean: {q_vals.mean():.2f}")
# Check actor outputs
with torch.no_grad():
pred_actions = actor(states)
print(f"Action range: [{pred_actions.min():.2f}, {pred_actions.max():.2f}]")
# Check recent rewards
print(f"Recent avg reward: {np.mean(recent_rewards[-10:]):.2f}")
print(f"Buffer size: {len(buffer)}")
print()
π Additional Resources
DDPG Paper:
- Lillicrap et al., βContinuous control with deep reinforcement learningβ (2015)
- https://arxiv.org/abs/1509.02971
Implementation References:
- OpenAI Spinning Up: https://spinningup.openai.com/en/latest/algorithms/ddpg.html
- PyTorch DDPG example: Check
Lecture_18_DQNfolder for similar patterns
Common RL Debugging Tips:
π‘ Pro Tips
- Start simple: Get vanilla DDPG working before adding improvements
- Use tensorboard: Log everything (losses, Q-values, rewards, gradients)
- Test components: Verify replay buffer, networks, target updates separately
- Compare to baseline: Use my demo code as reference
- Patience: DDPG can be unstable; if one run fails, try different seed
When stuck: Come to office hours with:
- Your code
- Training curve plot
- Logged losses and Q-values
- Description of what youβve tried
Good luck! π