The Policy Gradient: Intuition and Derivation


1. Setting the Stage

In policy gradient methods, we parameterize a policy $\pi_\theta(a \mid s)$ and want to maximize the expected return:

\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\]

where $\tau = (s_0, a_0, s_1, a_1, \ldots)$ is a trajectory and $R(\tau)$ is its total return. We want $\nabla_\theta J(\theta)$ in order to perform gradient ascent.


2. The Core Difficulty

Writing the expectation explicitly:

\[J(\theta) = \sum_\tau P(\tau; \theta)\, R(\tau)\]

Taking the gradient:

\[\nabla_\theta J(\theta) = \sum_\tau \nabla_\theta P(\tau; \theta)\, R(\tau)\]

Here is the problem: this is not an expectation anymore. It is a sum weighted by $\nabla_\theta P$, not by $P$. We cannot sample from $\nabla_\theta P$ — it is not a probability distribution. So we cannot estimate this with Monte Carlo rollouts of our policy.


3. The Log-Derivative Trick

Here is the elegant move. Recall from calculus:

\[\nabla_\theta \log P(\tau; \theta) = \frac{\nabla_\theta P(\tau; \theta)}{P(\tau; \theta)}\]

Rearranging:

\[\nabla_\theta P(\tau; \theta) = P(\tau; \theta)\, \nabla_\theta \log P(\tau; \theta)\]

Substituting back:

\[\nabla_\theta J(\theta) = \sum_\tau P(\tau; \theta)\, \nabla_\theta \log P(\tau; \theta)\, R(\tau) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\nabla_\theta \log P(\tau; \theta)\, R(\tau)\right]\]

This is the whole point of the log. It converts the gradient back into an expectation under $\pi_\theta$, which means we can estimate it by simply rolling out our policy and averaging.


4. Decomposing the Trajectory Probability

A trajectory’s probability factors as:

\[P(\tau; \theta) = \rho(s_0) \prod_{t=0}^{T} \pi_\theta(a_t \mid s_t)\, p(s_{t+1} \mid s_t, a_t)\]

Taking the log turns the product into a sum:

\[\log P(\tau; \theta) = \log \rho(s_0) + \sum_{t=0}^{T} \log \pi_\theta(a_t \mid s_t) + \sum_{t=0}^{T} \log p(s_{t+1} \mid s_t, a_t)\]

When we take $\nabla_\theta$, the environment dynamics $\rho(s_0)$ and $p(s_{t+1} \mid s_t, a_t)$ vanish because they do not depend on $\theta$. We are left with:

\[\nabla_\theta \log P(\tau; \theta) = \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\]

This is the second crucial benefit of the log: it decouples the policy from the (unknown) environment dynamics. Plugging in:

\[\boxed{\;\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, R(\tau)\right]\;}\]

This is the policy gradient theorem in its REINFORCE form.


5. Intuition: What Is This Actually Doing?

Think of $\nabla_\theta \log \pi_\theta(a \mid s)$ as a “direction in parameter space that increases the probability of taking action $a$ in state $s$.” The theorem says:

Take an action, observe its return, and nudge the parameters in a direction that makes that action more likely — with the size of the nudge proportional to how good the return was.

Good actions get reinforced; bad ones get suppressed. Beautiful.


6. Can We Remove the Log?

Short answer: no — not without losing the entire benefit. Here are the three reasons.

Reason 1: Without the log, you cannot estimate the gradient by sampling

If you tried to use $\nabla_\theta \pi_\theta(a \mid s)$ directly, you would have:

\[\nabla_\theta J = \sum_\tau \nabla_\theta P(\tau; \theta)\, R(\tau)\]

This is not an expectation under any distribution you can sample from. You would need to enumerate all trajectories — completely infeasible for any real problem with continuous states/actions or long horizons.

Reason 2: Without the log, environment dynamics do not cancel

Recall that the log turned a product into a sum, which let the unknown $p(s_{t+1} \mid s_t, a_t)$ terms drop out under $\nabla_\theta$. Without the log, you would be stuck with $\nabla_\theta$ of a product that includes transition probabilities you do not know and cannot differentiate. The log decouples policy from dynamics, which is what makes the algorithm model-free.

Reason 3: The log gives a numerically stable, scale-invariant update

Note that $\nabla_\theta \log \pi_\theta = \dfrac{\nabla_\theta \pi_\theta}{\pi_\theta}$. The division by $\pi_\theta$ has a meaningful effect: rare actions (small $\pi_\theta$) that pay off get a large gradient signal, while common actions get a smaller per-occurrence signal (but contribute more often in the expectation). It properly weights how much you should update based on how surprising the action was. Without it, frequent actions would dominate the gradient regardless of their importance.


7. What If You Insist on Using $\nabla_\theta \pi_\theta$ Directly?

You actually can write a valid gradient using $\nabla_\theta \pi$ — it is just the original form:

\[\nabla_\theta J = \sum_\tau R(\tau)\, \nabla_\theta P(\tau; \theta)\]

But to estimate it, you would need importance sampling against some sampling distribution $q$:

\[= \mathbb{E}_{\tau \sim q}\!\left[\frac{\nabla_\theta P(\tau; \theta)}{q(\tau)}\, R(\tau)\right]\]

Now you have introduced a separate distribution $q$, and you still need to compute $\nabla_\theta P$, which involves the unknown dynamics. You have made the problem strictly harder.


8. Summary

The log is not a cosmetic choice — it does three jobs simultaneously:

  1. Sampling — converts the gradient into an expectation we can estimate from rollouts.
  2. Model-free property — eliminates the environment dynamics from the gradient.
  3. Proper weighting — provides the right scale-invariant weighting for actions of different probabilities.

Removing it would force you to either enumerate all trajectories, model the environment, or introduce importance sampling — all of which defeat the purpose of having a clean, model-free, sample-based algorithm.