# So, you don't have a gradient...

Many problems we would like to solve with neural networks deal with discrete spaces either in the action space or the latent space. However, neural networks require a computation graph of differentiable components. Policy gradients

Let's set up a toy problem we can use to explore this problem. We want to find gradient of our agent's parameters with respect to some environment.

$\def\E{\operatorname*{\mathbb{E}}} \textcolor{#66BB6A}{\frac{\partial}{\partial \theta}} \E_{\textcolor{#1976D2}{p_{\theta}(x)}} \Big[ \textcolor{#FFA726}{f(z)} \Big]$

An idea you might have seen before is REINFORCE. REINFORCE is a common way of calculating black-box gradients, but has high varience.

\def\E{\operatorname*{\mathbb{E}}} \begin{aligned} \frac{\partial}{\partial \theta} \E_{p_{\theta}(x)} \Big[ f(z) \Big] & \\ &= \frac{\partial}{\partial \theta} \sum_{p_{\theta}(z)} f(z) p_{\theta}(z) \\ &= \sum_{p_{\theta}(z)} \frac{\partial}{\partial \theta} f(z) p_{\theta}(z) \\ &= f(z) \sum_{p_{\theta}(z)} \frac{\partial}{\partial \theta} p_{\theta}(z) \\ &= f(z) \sum_{p_{\theta}(z)} \frac{\partial}{\partial \theta} p_{\theta}(z) \frac{1}{p_{\theta}(z)} p_{\theta}(z) \\ \end{aligned}

We can rewrite $\frac{1}{p_{\theta}(z)} \frac{\partial}{\partial \theta} p_{\theta}(z)$ Log Derivative Trick

$\frac{\partial}{\partial \theta} \log p_{\theta}(z) = \frac{1}{p_{\theta}(z)} \frac{\partial}{\partial \theta} p_{\theta}(z)$

Sidenote: Why is this called the log derivative trick?

\def\E{\operatorname*{\mathbb{E}}} \begin{aligned} &= f(z) \sum_{p_{\theta}(z)} \frac{\partial}{\partial \theta} \log p_{\theta}(z) p_{\theta}(z) \\ &= f(z) \E_{p_{\theta}(z)} \Big[ \frac{\partial}{\partial \theta} \log p_{\theta}(z) \Big] & \\ &= \E_{p_{\theta}(z)} \Big[ f(z) \frac{\partial}{\partial \theta} \log p_{\theta}(z) \Big] & \\ \end{aligned}

REINFORCE's estimator has high variance, a way to intuitively way is to realize that our estimate is dependent on $f(z)$. Since we know nothing about $f(z)$ (potentially, it's high variance), we want to build estimators that are less dependent on $f(z)$.

# Enter the control variate...

One way we can deduce varience is by We can improve variance by subtracting a control variate. Intuitively, we can see

### Why can we subtract a control variate?

\def\E{\operatorname*{\mathbb{E}}} \begin{aligned} \E_{p_{\theta}(z)} \Big[ \frac{\partial}{\partial \theta} \log p_{\theta}(z) \Big] &\\ &= \sum_{p_{\theta}(z)} \frac{ \frac{\partial}{\partial \theta}p_{\theta}(z)} {p_{\theta}(z)}p_{\theta}(z) \\ &= \sum_{p_{\theta}(z)} \frac{\partial}{\partial \theta}p_{\theta}(z) \\ &= \frac{\partial}{\partial \theta} \sum_{p_{\theta}(z)} p_{\theta}(z) \\ &= \frac{\partial}{\partial \theta} 1 \\ &= 0. \\ \end{aligned}

## Gumble-Softmax / Concrete Distribution

Another way people have thought about sampling from discrete distributions is by smoothing Gumbel distributions by Gumble-Softmax[^1] or Concrete Distribution.

\def\E{\operatorname*{\mathbb{E}}} \begin{aligned} \E_{p_{\theta}(z)} \Big[ f(z) \Big] \\ &= \end{aligned}

In order to compute the gradient

\begin{aligned} z \sim \mathcal{N}(\mu,\,\sigma^{2}) \end{aligned}