Various Control Variates

So, you don't have a gradient...

Many problems we would like to solve with neural networks deal with discrete spaces either in the action space or the latent space. However, neural networks require a computation graph of differentiable components. Policy gradients

Let's set up a toy problem we can use to explore this problem. We want to find gradient of our agent's parameters with respect to some environment.

θEpθ(x)[f(z)]\def\E{\operatorname*{\mathbb{E}}} \textcolor{#66BB6A}{\frac{\partial}{\partial \theta}} \E_{\textcolor{#1976D2}{p_{\theta}(x)}} \Big[ \textcolor{#FFA726}{f(z)} \Big]

An idea you might have seen before is REINFORCE. REINFORCE is a common way of calculating black-box gradients, but has high varience.

θEpθ(x)[f(z)]=θpθ(z)f(z)pθ(z)=pθ(z)θf(z)pθ(z)=f(z)pθ(z)θpθ(z)=f(z)pθ(z)θpθ(z)1pθ(z)pθ(z)\def\E{\operatorname*{\mathbb{E}}} \begin{aligned} \frac{\partial}{\partial \theta} \E_{p_{\theta}(x)} \Big[ f(z) \Big] & \\ &= \frac{\partial}{\partial \theta} \sum_{p_{\theta}(z)} f(z) p_{\theta}(z) \\ &= \sum_{p_{\theta}(z)} \frac{\partial}{\partial \theta} f(z) p_{\theta}(z) \\ &= f(z) \sum_{p_{\theta}(z)} \frac{\partial}{\partial \theta} p_{\theta}(z) \\ &= f(z) \sum_{p_{\theta}(z)} \frac{\partial}{\partial \theta} p_{\theta}(z) \frac{1}{p_{\theta}(z)} p_{\theta}(z) \\ \end{aligned}

We can rewrite 1pθ(z)θpθ(z)\frac{1}{p_{\theta}(z)} \frac{\partial}{\partial \theta} p_{\theta}(z) Log Derivative Trick

θlogpθ(z)=1pθ(z)θpθ(z)\frac{\partial}{\partial \theta} \log p_{\theta}(z) = \frac{1}{p_{\theta}(z)} \frac{\partial}{\partial \theta} p_{\theta}(z)

Sidenote: Why is this called the log derivative trick?

=f(z)pθ(z)θlogpθ(z)pθ(z)=f(z)Epθ(z)[θlogpθ(z)]=Epθ(z)[f(z)θlogpθ(z)]\def\E{\operatorname*{\mathbb{E}}} \begin{aligned} &= f(z) \sum_{p_{\theta}(z)} \frac{\partial}{\partial \theta} \log p_{\theta}(z) p_{\theta}(z) \\ &= f(z) \E_{p_{\theta}(z)} \Big[ \frac{\partial}{\partial \theta} \log p_{\theta}(z) \Big] & \\ &= \E_{p_{\theta}(z)} \Big[ f(z) \frac{\partial}{\partial \theta} \log p_{\theta}(z) \Big] & \\ \end{aligned}

REINFORCE's estimator has high variance, a way to intuitively way is to realize that our estimate is dependent on f(z)f(z). Since we know nothing about f(z)f(z) (potentially, it's high variance), we want to build estimators that are less dependent on f(z)f(z).

Enter the control variate...

One way we can deduce varience is by We can improve variance by subtracting a control variate. Intuitively, we can see

Why can we subtract a control variate?

Epθ(z)[θlogpθ(z)]=pθ(z)θpθ(z)pθ(z)pθ(z)=pθ(z)θpθ(z)=θpθ(z)pθ(z)=θ1=0.\def\E{\operatorname*{\mathbb{E}}} \begin{aligned} \E_{p_{\theta}(z)} \Big[ \frac{\partial}{\partial \theta} \log p_{\theta}(z) \Big] &\\ &= \sum_{p_{\theta}(z)} \frac{ \frac{\partial}{\partial \theta}p_{\theta}(z)} {p_{\theta}(z)}p_{\theta}(z) \\ &= \sum_{p_{\theta}(z)} \frac{\partial}{\partial \theta}p_{\theta}(z) \\ &= \frac{\partial}{\partial \theta} \sum_{p_{\theta}(z)} p_{\theta}(z) \\ &= \frac{\partial}{\partial \theta} 1 \\ &= 0. \\ \end{aligned}

Gumble-Softmax / Concrete Distribution

Another way people have thought about sampling from discrete distributions is by smoothing Gumbel distributions by Gumble-Softmax[^1] or Concrete Distribution.

Epθ(z)[f(z)]=\def\E{\operatorname*{\mathbb{E}}} \begin{aligned} \E_{p_{\theta}(z)} \Big[ f(z) \Big] \\ &= \end{aligned}

In order to compute the gradient

zN(μ,σ2)\begin{aligned} z \sim \mathcal{N}(\mu,\,\sigma^{2}) \end{aligned}

Computing the gradient o




[^1]: This reference footnote contains a paragraph...