DDPG Debugging Guide

When implementing Deep Deterministic Policy Gradient (DDPG), you’ll likely encounter issues. This guide helps you diagnose and fix them.

🔍 Debugging Workflow

Follow this systematic approach:

Check Environment Setup → Ensure gymnasium API is used correctly
Verify Network Architecture → Print shapes, ensure forward pass works
Inspect Training Loop → Add logging for losses, rewards, Q-values
Analyze Learning Curves → Plot rewards, identify patterns
Test Component Isolation → Disable parts to find the culprit

🐛 Common Issues and Solutions

Issue 1: Reward Not Improving

Symptom: Reward stays constant (around -100) or barely improves over 200+ episodes

Possible Cause A: Learning Rates Too High

Check:

# Add to training loop:
print(f"Actor loss: {actor_loss.item():.4f}, Critic loss: {critic_loss.item():.4f}")

Expected: Losses should be < 100 and gradually decreasing If losses > 1000 or NaN: Learning rates are too high

Solution:

# Typical good values:
actor_optimizer = optim.Adam(actor.parameters(), lr=1e-4)  # Slower for actor
critic_optimizer = optim.Adam(critic.parameters(), lr=1e-3)  # Faster for critic

Why different LRs? Actor learns policy (must be stable), critic learns values (can update faster)

Possible Cause B: Incorrect Action Scaling

Check:

# Before env.step(), add:
print(f"Action range: [{action.min():.2f}, {action.max():.2f}]")

Expected: For MountainCarContinuous: [-1.0, 1.0] If outside range: Actions are not scaled correctly

Solution:

# In your actor network:
def forward(self, state):
    x = self.network(state)
    action = self.action_limit * torch.tanh(x)  # Squash to [-act_lim, act_lim]
    return action

# When creating actor:
actor = Actor(state_dim, action_dim, action_limit=env.action_space.high[0])

Possible Cause C: Target Network Not Updating

Check:

# After polyak_update, add:
if episode % 50 == 0:
    print(f"Target Q mean: {target_q_net(sample_state, sample_action).mean():.2f}")

Expected: Should slowly change over episodes If constant: Target network isn’t being updated

Solution:

# In training loop (AFTER gradient step):
from shared_utils.drl_utils import polyak_update

polyak_update(actor, target_actor, tau=0.001)
polyak_update(critic, target_critic, tau=0.001)

Common mistake: Forgetting to update target networks, or updating before online networks

Possible Cause D: Insufficient Exploration Noise

Check:

# During training, action should have noise:
action = actor(state)
noisy_action = action + exploration_noise
print(f"Noise scale: {exploration_noise.abs().mean():.3f}")

Expected: Noise scale should be 0.05 - 0.2 early in training If too small (< 0.01): Not exploring enough

Solution:

# Add Gaussian noise during training:
action = actor(state).detach().cpu().numpy()
noise = np.random.normal(0, 0.1, size=action.shape)  # Mean=0, Std=0.1
action = np.clip(action + noise, -action_limit, action_limit)

Optionally: Decay noise over time:

noise_scale = max(0.01, 0.2 * (1 - episode / n_episodes))

Issue 2: Training Diverges (NaN losses)

Symptom: After some episodes (50-150), losses become NaN and training crashes

Cause A: Gradient Explosion

Check:

# After loss.backward(), before optimizer.step():
total_norm = 0
for p in actor.parameters():
    param_norm = p.grad.data.norm(2)
    total_norm += param_norm.item() ** 2
total_norm = total_norm ** 0.5
print(f'Gradient norm: {total_norm:.2f}')

Expected: Should be < 10 If > 100: Gradients are exploding

Solution:

# Add gradient clipping:
actor_loss.backward()
torch.nn.utils.clip_grad_norm_(actor.parameters(), max_norm=1.0)
actor_optimizer.step()

critic_loss.backward()
torch.nn.utils.clip_grad_norm_(critic.parameters(), max_norm=1.0)
critic_optimizer.step()

Cause B: Q-Values Exploding

Check:

# During training:
with torch.no_grad():
    q_val = critic(state_batch, action_batch)
    print(f"Q-value range: [{q_val.min():.1f}, {q_val.max():.1f}]")

Expected: For MountainCar, Q-values should be in [-100, 100] If > 1000: Q-network is unstable

Solutions:

Reduce learning rate: lr=1e-4 instead of 1e-3

Use reward scaling:

reward = reward / 10.0  # Scale rewards to smaller range

Ensure target networks are used:

# Target Q-value:
with torch.no_grad():
    target_q = reward + gamma * target_critic(next_state, target_actor(next_state))

Cause C: Incorrect Tensor Operations

Check:

# Common mistakes:
# ❌ Mixing numpy and torch without conversion
action = actor(state)  # torch tensor
env.step(action)  # ERROR if env expects numpy

# ✅ Correct:
action = actor(state).detach().cpu().numpy()
env.step(action)

# ❌ Broadcasting issues
reward = torch.FloatTensor(reward)  # shape: (batch_size,)
q_value = critic(state, action)  # shape: (batch_size, 1)
loss = (q_value - reward).pow(2).mean()  # ERROR: shape mismatch

# ✅ Correct:
reward = torch.FloatTensor(reward).unsqueeze(1)  # shape: (batch_size, 1)

Issue 3: Slow Convergence

Symptom: Training works but takes 500+ episodes to solve MountainCar (should be ~200-300)

Optimization Checklist:

Factor	Recommended Value	Check
Batch size	64 - 128	Small batch → high variance
Replay buffer	≥ 100,000	Small buffer → lack of diversity
Start training after	≥ 1,000 samples	Training too early → unstable
Actor LR	1e-4	Too low → slow learning
Critic LR	1e-3	Too low → poor value estimates
Gamma	0.99	Too low → ignores future rewards
Tau	0.001	Too high → unstable targets
Exploration noise	0.1 initially	Too low → not exploring

Quick fix: Try these recommended hyperparameters first:

config = {
    'batch_size': 64,
    'buffer_capacity': 100000,
    'actor_lr': 1e-4,
    'critic_lr': 1e-3,
    'gamma': 0.99,
    'tau': 0.001,
    'noise_scale': 0.1,
}

Issue 4: Unstable Learning Curves

Symptom: Reward oscillates wildly (e.g., 80 → -50 → 90 → -30)

Causes and Fixes:

Target networks updating too fast

tau = 0.001  # Try smaller value like 0.0001

No reward averaging in evaluation

# Instead of: test_reward = run_episode(env, actor)
# Use average over multiple episodes:
test_rewards = [run_episode(env, actor) for _ in range(10)]
avg_reward = np.mean(test_rewards)

Exploration during testing

# Ensure no noise during evaluation:
def evaluate_policy(env, actor, n_episodes=10):
    rewards = []
    for _ in range(n_episodes):
        obs, _ = env.reset()
        total_reward = 0
        done, truncated = False, False

        while not (done or truncated):
            with torch.no_grad():
                action = actor(torch.FloatTensor(obs)).numpy()  # NO noise
            obs, reward, done, truncated, _ = env.step(action)
            total_reward += reward
        rewards.append(total_reward)
    return np.mean(rewards)

📊 Expected Training Curves

Healthy DDPG on MountainCarContinuous-v0

Episodes 0-50:    Random exploration, reward ≈ -100 to -80
Episodes 50-100:  Car starts learning, reward ≈ -80 to -50
Episodes 100-200: Rapid improvement, reward ≈ -50 to 0
Episodes 200-300: Converging to solution, reward ≈ 0 to 90
Episodes 300+:    Solved, reward ≈ 90-95 (consistently)

Red flags:

No improvement by episode 100 → Check learning rates, action scaling
Sudden drop after improving → Target network instability
Constant oscillation → Reduce tau, average test rewards

🧪 Verification Checklist

Before asking for help, verify these:

Environment Setup

Using import gymnasium as gym (not gym)
obs, info = env.reset() returns tuple
obs, reward, done, truncated, info = env.step(action) (5 returns)
Actions are numpy arrays, correct shape and range

Network Architecture

Actor outputs actions in correct range (use tanh)
Critic takes (state, action) → Q-value
Target networks exist and are initialized correctly
Networks moved to correct device (CPU/GPU)

Training Loop

Replay buffer has > batch_size samples before training
Target networks updated AFTER online networks
Polyak update uses tau ≈ 0.001
No gradient flow through target networks (with torch.no_grad())
Actions have exploration noise during training
No noise during evaluation

Logging

Losses are logged and reasonable (< 100)
Q-values are logged and bounded
Rewards are plotted and improving
Random seed is set for reproducibility

🔧 Quick Debugging Code

Add this to your training loop for comprehensive diagnostics:

# After every 10 episodes:
if episode % 10 == 0:
    print(f"\n{'='*60}")
    print(f"Episode {episode} Diagnostics")
    print(f"{'='*60}")

    # Sample a batch
    batch = buffer.sample(64)
    states = torch.FloatTensor([s for s, _, _, _, _ in batch])
    actions = torch.FloatTensor([a for _, a, _, _, _ in batch])

    # Check Q-values
    with torch.no_grad():
        q_vals = critic(states, actions)
        print(f"Q-value range: [{q_vals.min():.2f}, {q_vals.max():.2f}]")
        print(f"Q-value mean: {q_vals.mean():.2f}")

    # Check actor outputs
    with torch.no_grad():
        pred_actions = actor(states)
        print(f"Action range: [{pred_actions.min():.2f}, {pred_actions.max():.2f}]")

    # Check recent rewards
    print(f"Recent avg reward: {np.mean(recent_rewards[-10:]):.2f}")
    print(f"Buffer size: {len(buffer)}")
    print()

📚 Additional Resources

DDPG Paper:

Lillicrap et al., “Continuous control with deep reinforcement learning” (2015)
https://arxiv.org/abs/1509.02971

Implementation References:

OpenAI Spinning Up: https://spinningup.openai.com/en/latest/algorithms/ddpg.html
PyTorch DDPG example: Check Lecture_18_DQN folder for similar patterns

Common RL Debugging Tips:

Debugging RL by John Schulman

💡 Pro Tips

Start simple: Get vanilla DDPG working before adding improvements
Use tensorboard: Log everything (losses, Q-values, rewards, gradients)
Test components: Verify replay buffer, networks, target updates separately
Compare to baseline: Use my demo code as reference
Patience: DDPG can be unstable; if one run fails, try different seed

When stuck: Come to office hours with:

Your code
Training curve plot
Logged losses and Q-values
Description of what you’ve tried

Good luck! 🚀