# So, you don't have a gradient...

Many problems we would like to solve with neural networks deal with discrete
spaces either in the action space or the latent space. However, neural networks
require a computation graph of differentiable components. Policy gradients

Let's set up a toy problem we can use to explore this problem. We want to find
gradient of our agent's parameters
with respect to some environment.

$\def\E{\operatorname*{\mathbb{E}}}
\textcolor{#66BB6A}{\frac{\partial}{\partial \theta}} \E_{\textcolor{#1976D2}{p_{\theta}(x)}}
\Big[ \textcolor{#FFA726}{f(z)} \Big]$

An idea you might have seen before is **REINFORCE**. **REINFORCE** is a common
way of calculating black-box gradients, but has high varience.

$\def\E{\operatorname*{\mathbb{E}}}
\begin{aligned}
\frac{\partial}{\partial \theta} \E_{p_{\theta}(x)} \Big[ f(z) \Big] & \\
&= \frac{\partial}{\partial \theta} \sum_{p_{\theta}(z)} f(z) p_{\theta}(z) \\
&= \sum_{p_{\theta}(z)} \frac{\partial}{\partial \theta} f(z) p_{\theta}(z) \\
&= f(z) \sum_{p_{\theta}(z)} \frac{\partial}{\partial \theta} p_{\theta}(z) \\
&= f(z) \sum_{p_{\theta}(z)} \frac{\partial}{\partial \theta} p_{\theta}(z) \frac{1}{p_{\theta}(z)} p_{\theta}(z) \\
\end{aligned}$

We can rewrite $\frac{1}{p_{\theta}(z)} \frac{\partial}{\partial \theta} p_{\theta}(z)$
**Log Derivative Trick**

$\frac{\partial}{\partial \theta} \log p_{\theta}(z) =
\frac{1}{p_{\theta}(z)} \frac{\partial}{\partial \theta} p_{\theta}(z)$

*Sidenote: Why is this called the log derivative trick?*
$\def\E{\operatorname*{\mathbb{E}}}
\begin{aligned}
&= f(z) \sum_{p_{\theta}(z)} \frac{\partial}{\partial \theta} \log p_{\theta}(z) p_{\theta}(z) \\
&= f(z) \E_{p_{\theta}(z)} \Big[ \frac{\partial}{\partial \theta} \log p_{\theta}(z) \Big] & \\
&= \E_{p_{\theta}(z)} \Big[ f(z) \frac{\partial}{\partial \theta} \log p_{\theta}(z) \Big] & \\
\end{aligned}$

REINFORCE's estimator has high variance, a way to intuitively way is to realize that
our estimate is dependent on $f(z)$. Since we know **nothing** about $f(z)$
(potentially, it's high variance), we want to build estimators that are less
dependent on $f(z)$.

# Enter the control variate...

One way we can deduce varience is by
We can improve variance by subtracting a control variate. Intuitively, we can see

### Why can we subtract a control variate?

$\def\E{\operatorname*{\mathbb{E}}}
\begin{aligned}
\E_{p_{\theta}(z)} \Big[
\frac{\partial}{\partial \theta} \log p_{\theta}(z)
\Big] &\\
&= \sum_{p_{\theta}(z)} \frac{
\frac{\partial}{\partial \theta}p_{\theta}(z)}
{p_{\theta}(z)}p_{\theta}(z) \\
&= \sum_{p_{\theta}(z)} \frac{\partial}{\partial \theta}p_{\theta}(z) \\
&= \frac{\partial}{\partial \theta} \sum_{p_{\theta}(z)} p_{\theta}(z) \\
&= \frac{\partial}{\partial \theta} 1 \\
&= 0. \\
\end{aligned}$

## Gumble-Softmax / Concrete Distribution

Another way people have thought about sampling from discrete distributions is by
smoothing Gumbel distributions by Gumble-Softmax[^1] or Concrete Distribution.

$\def\E{\operatorname*{\mathbb{E}}}
\begin{aligned}
\E_{p_{\theta}(z)} \Big[ f(z) \Big] \\
&=
\end{aligned}$

In order to compute the gradient

$\begin{aligned}
z \sim \mathcal{N}(\mu,\,\sigma^{2})
\end{aligned}$

Computing the gradient o

# RELAX

# REBAR

# AutoCV

[^1]: This reference footnote contains a paragraph...