So, you don't have a gradient...
Many problems we would like to solve with neural networks deal with discrete
spaces either in the action space or the latent space. However, neural networks
require a computation graph of differentiable components. Policy gradients
Let's set up a toy problem we can use to explore this problem. We want to find
gradient of our agent's parameters
with respect to some environment.
An idea you might have seen before is REINFORCE. REINFORCE is a common
way of calculating black-box gradients, but has high varience.
We can rewrite
Log Derivative Trick
Sidenote: Why is this called the log derivative trick?
REINFORCE's estimator has high variance, a way to intuitively way is to realize that
our estimate is dependent on . Since we know nothing about
(potentially, it's high variance), we want to build estimators that are less
dependent on .
Enter the control variate...
One way we can deduce varience is by
We can improve variance by subtracting a control variate. Intuitively, we can see
Why can we subtract a control variate?
Gumble-Softmax / Concrete Distribution
Another way people have thought about sampling from discrete distributions is by
smoothing Gumbel distributions by Gumble-Softmax[^1] or Concrete Distribution.
In order to compute the gradient
Computing the gradient o
[^1]: This reference footnote contains a paragraph...