DDPG Debugging Guide

When implementing Deep Deterministic Policy Gradient (DDPG), you’ll likely encounter issues. This guide helps you diagnose and fix them.


πŸ” Debugging Workflow

Follow this systematic approach:

  1. Check Environment Setup β†’ Ensure gymnasium API is used correctly
  2. Verify Network Architecture β†’ Print shapes, ensure forward pass works
  3. Inspect Training Loop β†’ Add logging for losses, rewards, Q-values
  4. Analyze Learning Curves β†’ Plot rewards, identify patterns
  5. Test Component Isolation β†’ Disable parts to find the culprit

πŸ› Common Issues and Solutions

Issue 1: Reward Not Improving

Symptom: Reward stays constant (around -100) or barely improves over 200+ episodes

Possible Cause A: Learning Rates Too High

Check:

# Add to training loop:
print(f"Actor loss: {actor_loss.item():.4f}, Critic loss: {critic_loss.item():.4f}")

Expected: Losses should be < 100 and gradually decreasing If losses > 1000 or NaN: Learning rates are too high

Solution:

# Typical good values:
actor_optimizer = optim.Adam(actor.parameters(), lr=1e-4)  # Slower for actor
critic_optimizer = optim.Adam(critic.parameters(), lr=1e-3)  # Faster for critic

Why different LRs? Actor learns policy (must be stable), critic learns values (can update faster)


Possible Cause B: Incorrect Action Scaling

Check:

# Before env.step(), add:
print(f"Action range: [{action.min():.2f}, {action.max():.2f}]")

Expected: For MountainCarContinuous: [-1.0, 1.0] If outside range: Actions are not scaled correctly

Solution:

# In your actor network:
def forward(self, state):
    x = self.network(state)
    action = self.action_limit * torch.tanh(x)  # Squash to [-act_lim, act_lim]
    return action

# When creating actor:
actor = Actor(state_dim, action_dim, action_limit=env.action_space.high[0])

Possible Cause C: Target Network Not Updating

Check:

# After polyak_update, add:
if episode % 50 == 0:
    print(f"Target Q mean: {target_q_net(sample_state, sample_action).mean():.2f}")

Expected: Should slowly change over episodes If constant: Target network isn’t being updated

Solution:

# In training loop (AFTER gradient step):
from shared_utils.drl_utils import polyak_update

polyak_update(actor, target_actor, tau=0.001)
polyak_update(critic, target_critic, tau=0.001)

Common mistake: Forgetting to update target networks, or updating before online networks


Possible Cause D: Insufficient Exploration Noise

Check:

# During training, action should have noise:
action = actor(state)
noisy_action = action + exploration_noise
print(f"Noise scale: {exploration_noise.abs().mean():.3f}")

Expected: Noise scale should be 0.05 - 0.2 early in training If too small (< 0.01): Not exploring enough

Solution:

# Add Gaussian noise during training:
action = actor(state).detach().cpu().numpy()
noise = np.random.normal(0, 0.1, size=action.shape)  # Mean=0, Std=0.1
action = np.clip(action + noise, -action_limit, action_limit)

Optionally: Decay noise over time:

noise_scale = max(0.01, 0.2 * (1 - episode / n_episodes))

Issue 2: Training Diverges (NaN losses)

Symptom: After some episodes (50-150), losses become NaN and training crashes

Cause A: Gradient Explosion

Check:

# After loss.backward(), before optimizer.step():
total_norm = 0
for p in actor.parameters():
    param_norm = p.grad.data.norm(2)
    total_norm += param_norm.item() ** 2
total_norm = total_norm ** 0.5
print(f'Gradient norm: {total_norm:.2f}')

Expected: Should be < 10 If > 100: Gradients are exploding

Solution:

# Add gradient clipping:
actor_loss.backward()
torch.nn.utils.clip_grad_norm_(actor.parameters(), max_norm=1.0)
actor_optimizer.step()

critic_loss.backward()
torch.nn.utils.clip_grad_norm_(critic.parameters(), max_norm=1.0)
critic_optimizer.step()

Cause B: Q-Values Exploding

Check:

# During training:
with torch.no_grad():
    q_val = critic(state_batch, action_batch)
    print(f"Q-value range: [{q_val.min():.1f}, {q_val.max():.1f}]")

Expected: For MountainCar, Q-values should be in [-100, 100] If > 1000: Q-network is unstable

Solutions:

  1. Reduce learning rate: lr=1e-4 instead of 1e-3
  2. Use reward scaling:
    reward = reward / 10.0  # Scale rewards to smaller range
    
  3. Ensure target networks are used:
    # Target Q-value:
    with torch.no_grad():
        target_q = reward + gamma * target_critic(next_state, target_actor(next_state))
    

Cause C: Incorrect Tensor Operations

Check:

# Common mistakes:
# ❌ Mixing numpy and torch without conversion
action = actor(state)  # torch tensor
env.step(action)  # ERROR if env expects numpy

# βœ… Correct:
action = actor(state).detach().cpu().numpy()
env.step(action)

# ❌ Broadcasting issues
reward = torch.FloatTensor(reward)  # shape: (batch_size,)
q_value = critic(state, action)  # shape: (batch_size, 1)
loss = (q_value - reward).pow(2).mean()  # ERROR: shape mismatch

# βœ… Correct:
reward = torch.FloatTensor(reward).unsqueeze(1)  # shape: (batch_size, 1)

Issue 3: Slow Convergence

Symptom: Training works but takes 500+ episodes to solve MountainCar (should be ~200-300)

Optimization Checklist:

Factor Recommended Value Check
Batch size 64 - 128 Small batch β†’ high variance
Replay buffer β‰₯ 100,000 Small buffer β†’ lack of diversity
Start training after β‰₯ 1,000 samples Training too early β†’ unstable
Actor LR 1e-4 Too low β†’ slow learning
Critic LR 1e-3 Too low β†’ poor value estimates
Gamma 0.99 Too low β†’ ignores future rewards
Tau 0.001 Too high β†’ unstable targets
Exploration noise 0.1 initially Too low β†’ not exploring

Quick fix: Try these recommended hyperparameters first:

config = {
    'batch_size': 64,
    'buffer_capacity': 100000,
    'actor_lr': 1e-4,
    'critic_lr': 1e-3,
    'gamma': 0.99,
    'tau': 0.001,
    'noise_scale': 0.1,
}

Issue 4: Unstable Learning Curves

Symptom: Reward oscillates wildly (e.g., 80 β†’ -50 β†’ 90 β†’ -30)

Causes and Fixes:

  1. Target networks updating too fast
    tau = 0.001  # Try smaller value like 0.0001
    
  2. No reward averaging in evaluation
    # Instead of: test_reward = run_episode(env, actor)
    # Use average over multiple episodes:
    test_rewards = [run_episode(env, actor) for _ in range(10)]
    avg_reward = np.mean(test_rewards)
    
  3. Exploration during testing
    # Ensure no noise during evaluation:
    def evaluate_policy(env, actor, n_episodes=10):
        rewards = []
        for _ in range(n_episodes):
            obs, _ = env.reset()
            total_reward = 0
            done, truncated = False, False
    
            while not (done or truncated):
                with torch.no_grad():
                    action = actor(torch.FloatTensor(obs)).numpy()  # NO noise
                obs, reward, done, truncated, _ = env.step(action)
                total_reward += reward
            rewards.append(total_reward)
        return np.mean(rewards)
    

πŸ“Š Expected Training Curves

Healthy DDPG on MountainCarContinuous-v0

Episodes 0-50:    Random exploration, reward β‰ˆ -100 to -80
Episodes 50-100:  Car starts learning, reward β‰ˆ -80 to -50
Episodes 100-200: Rapid improvement, reward β‰ˆ -50 to 0
Episodes 200-300: Converging to solution, reward β‰ˆ 0 to 90
Episodes 300+:    Solved, reward β‰ˆ 90-95 (consistently)

Red flags:

  • No improvement by episode 100 β†’ Check learning rates, action scaling
  • Sudden drop after improving β†’ Target network instability
  • Constant oscillation β†’ Reduce tau, average test rewards

πŸ§ͺ Verification Checklist

Before asking for help, verify these:

Environment Setup

  • Using import gymnasium as gym (not gym)
  • obs, info = env.reset() returns tuple
  • obs, reward, done, truncated, info = env.step(action) (5 returns)
  • Actions are numpy arrays, correct shape and range

Network Architecture

  • Actor outputs actions in correct range (use tanh)
  • Critic takes (state, action) β†’ Q-value
  • Target networks exist and are initialized correctly
  • Networks moved to correct device (CPU/GPU)

Training Loop

  • Replay buffer has > batch_size samples before training
  • Target networks updated AFTER online networks
  • Polyak update uses tau β‰ˆ 0.001
  • No gradient flow through target networks (with torch.no_grad())
  • Actions have exploration noise during training
  • No noise during evaluation

Logging

  • Losses are logged and reasonable (< 100)
  • Q-values are logged and bounded
  • Rewards are plotted and improving
  • Random seed is set for reproducibility

πŸ”§ Quick Debugging Code

Add this to your training loop for comprehensive diagnostics:

# After every 10 episodes:
if episode % 10 == 0:
    print(f"\n{'='*60}")
    print(f"Episode {episode} Diagnostics")
    print(f"{'='*60}")

    # Sample a batch
    batch = buffer.sample(64)
    states = torch.FloatTensor([s for s, _, _, _, _ in batch])
    actions = torch.FloatTensor([a for _, a, _, _, _ in batch])

    # Check Q-values
    with torch.no_grad():
        q_vals = critic(states, actions)
        print(f"Q-value range: [{q_vals.min():.2f}, {q_vals.max():.2f}]")
        print(f"Q-value mean: {q_vals.mean():.2f}")

    # Check actor outputs
    with torch.no_grad():
        pred_actions = actor(states)
        print(f"Action range: [{pred_actions.min():.2f}, {pred_actions.max():.2f}]")

    # Check recent rewards
    print(f"Recent avg reward: {np.mean(recent_rewards[-10:]):.2f}")
    print(f"Buffer size: {len(buffer)}")
    print()

πŸ“š Additional Resources

DDPG Paper:

Implementation References:

Common RL Debugging Tips:


πŸ’‘ Pro Tips

  1. Start simple: Get vanilla DDPG working before adding improvements
  2. Use tensorboard: Log everything (losses, Q-values, rewards, gradients)
  3. Test components: Verify replay buffer, networks, target updates separately
  4. Compare to baseline: Use my demo code as reference
  5. Patience: DDPG can be unstable; if one run fails, try different seed

When stuck: Come to office hours with:

  • Your code
  • Training curve plot
  • Logged losses and Q-values
  • Description of what you’ve tried

Good luck! πŸš€