?Deep Learning is presented as Energy-Based Learning

Indeed, we train a neural network by running , thereby minimizing the model error–which is like minimizing an Energy.

But what is this Energy ? Deep Learning (DL) Energy functions look nothing like a typical chemistry or physics Energy. Here, we have Free Energy landscapes, frequently which form funneled landscapes–a trade off between energetic and entropic effects.

The confusion arises from assuming Deep Learning is a non-convex optimization problem that looks similar to the zero-Temperature Energy Landscapes from spin glass theory.

I present a different view.  I believe Deep Learning is really optimizing an effective Free Energy function. And this has profound implications on Why Deep Learning Works.

This post will attempt to relate recent ideas in RBM inference to Backprop, and argue that Backprop is minimizing a dynamic, temperature dependent, ruggedly convex, effective Free Energy landscape.

This is a fairly long post, but at least is basic review.  I try to present these ideas in a semi-pedagogic way, to the extent I can in a post, discussing both RBMs, MLPs, Free Energies, and all that entails.

#### BackProp

The Backprop algorithm lets us train a model directly on our  (X) by minimizing the predicted error $E_{train}(mathbf{theta})$, where the parameter set $mathbf{theta}$ includes the weights $(mathbf{W})$, biases $(mathbf{b})$, and activations $(mathbf{a})$ of the network.

$theta={mathbf{W},mathbf{b},mathbf{a}}$.

Let’s write

$E_{train}(mathbf{theta})=underset{mathbf{x}_{mu}inmathbf{X}}{sum}err(mathbf{x}_{mu})$,

where the error $err(x)$ could be a mean squared error (MSE), cross entropy, etc. For example, in simple regression, we can minimize the MSE

$E_{train}(theta)=sum_{mu}(y_{mu}-f(mathbf{x}_{mu},theta))^{2}$,

whereas for multi-class classification, we might minimize a categorical cross entropy

$E_{train}(theta)=sum_{mu}(y_{mu}ln f_{mu}+(1-y_{mu})ln f_{mu})$

where $y_{mu}$ are the labels  and $f_{mu}=f(mathbf{x}_{mu},theta)$ is the network output for each training instance $mu$.

Notice that $err(mathbf{x}_{mu})$ is the training error for instance $mu$, not a test or holdout error.  Notice that, unlike an Support Vector Machine (SVM) or Logistic Regression (LR), we don’t use Cross Validation (CV) during training.   We simply minimize the training error– whatever that is.

Of course, we can adjust the network parameters, regularization, etc, to tune the architecture of the network.  Although it appears that

At this point, many people say that BackProp leads to a complex, non-convex optimization problem; IMHO, this is naive.

It has been known for 20 years that Deep Learning does not suffer from local minima.

Anyone who thinks it does has never read a research paper or book on neural networks.  So what we really would like to know is, Why does Deep Learning Scale ?  Or, maybe, why does it work at all ?!

To implement Backprop, we take derivatives $dfrac{partial}{partialtheta}E_{train}(theta)$ and apply the the chain to the network outputs $f(mathbf{x}_{mu},theta)$, applying it layer-by-layer.

#### Layers and Activations

Let’s take a closer look at the layers and activations.  Consider a simple 1 layer net:

The Hidden activations $mathbf{a}$ are thought to mimic the function of actual neurons, and are computed by applying an activation function $f()$,  to a linear Energy function $mathbf{W}^{T}mathbf{x}+mathbf{b}$,

Indeed, the sigmoid activation function $sigma()$ was first proposed in 1968 by Jack Cowan at the University of Chicago , still used today in models of neural dynamics

$mathbf{a}=sigma(mathbf{Wx}+mathbf{b})$

Moreover, Cowan pioneered using Statistical Mechanics to study the Neocortex.

And we will need a little Stat Mech to explain what our Energy functions are..but just a little.

#### Sigmoid Activations and Statistical Mechanics

While it seems we are simply proposing an arbitrary activation function, we can, in fact, derive the appearance of sigmoid activations–at least when performing inference on a single layer (mean field) Restricted Boltzmann Machine (RBM).

Given the (total) RBM Energy function

$E(mathbf{v},mathbf{h})=mathbf{a}^{T}mathbf{v}+mathbf{b}^{T}mathbf{h}+mathbf{vW}^{T}mathbf{h}$

The log Energy is an un-normalized probability, such that

$P(mathbf{v},mathbf{h})=dfrac{1}{Z}e^{-beta E(mathbf{v},mathbf{h})}$

Where the normalization factor, Z, is an object from statistical mechanics called the (total) partition function Z

$Z(mathbf{v},mathbf{h})=underset{mathbf{v},mathbf{h}}{sum}e^{-beta E(mathbf{v},mathbf{h})}$

and $beta=dfrac{1}{T}$ is an inverse Temperature.  In modern machine learning, we implicitly set $beta=1$.

Following Larochelle, we can factor $P(mathbf{v},mathbf{h})$ by explicitly writing $Z(mathbf{v},mathbf{h})$ in terms of sums over the binary hidden activations $h_{i}=0|1$.  This lets us write the conditional probabilities, for each individual neuron as

$p(v_{i}|h=1)=sigma(sum_{j}W_{i,j}h_{j}+a_{j})$

$p(h_{j}|v=1)=sigma(sum_{i}W_{i,j}v_{i}+b_{i})$.

We note that, this formulation was not obvious, and early work on RBMs used methods from statistical field theory to get this result.

#### RBM Training

We use $p(v_{i}|h=1)$ and  $p(h_{j}|v=1)$ in Contrastive Divergence (CD) or other solvers as part of the Gibbs Sampling step for (unsupervised) RBM inference.

CD has been a puzzling algorithm to understand.  When first proposed, it was unclear what optimization problem is CD solving?  Indeed, Hinton is to have said

“‘the Microsoft Algorithm:’ It asks, ‘where do you want to go today?’ and then doesn’t let you go there.”

Specifically, we run several epochs of:

1. n steps of Gibbs sampling, or some other equilibration method, to set the neuron activations.
2. some form of gradient descent $dfrac{partial}{partialtheta}$ where $theta={mathbf{W},mathbf{b}}$

We will see below that we can cast RBM inference as directly minimizing a Free Energy–something that will prove very useful to related RBMs to MLPs

#### Energies and Activations

The sigmoid, and tanh,  are an old-fashioned activation(s); today we may prefer to use ReLUs (and Leaky ReLUs).

The sigmoid itself was, at first, just an approximation to the heavyside step function used in neuron models.  But the presence of sigmoid activations in the total Energy suggests, at least to me, that Deep Learning Energy functions are more than just random (Morse) functions.

RBMs are a special case of unsupervised nets that still use stochastic sampling. In supervised nets, like MLPs and CNNs (and in unsupervised Autoencoders like VAEs), we use Backprop.  But the activations are not conditional probabilities.  Let’s look in detail:

##### MLP outputs

Consider a MultiLayer Perceptron, with 1 Hidden layer, and 1 output node

$f^{mu}_{MLP}=sigma(underset{hinmathbf{h}}{sum}mathbf{a}_{h})$

$mathbf{a}_{h}=sigma(mathbf{W}^{T}mathbf{v}+mathbf{b})$

where $mathbf{v}=mathbf{x}^{mu}$  for each data point, leading to the layer output

$g^{mu}_{MLP}(theta)=sigma(mathbf{W}^{T}mathbf{x}^{mu}+mathbf{b}))$

and total MLP output

$f^{mu}_{MLP}(theta)=sigma(sum g^{mu}_{MLP}(theta))=sigma(sum sigma(mathbf{W}^{T}mathbf{x}^{mu}+mathbf{b})))$

where $theta={mathbf{W},mathbf{b}}$.

If we add a second layer, we have the iterated layer output:

$g^{mu}_{MLP}(theta')=sigma(mathbf{W}^{T}(sigma(mathbf{W'}^{T}mathbf{x}^{mu}+mathbf{b'}))+mathbf{b}))$

where $theta'={mathbf{W},mathbf{W'},mathbf{b},mathbf{b'}}$.

The final MLP output function has a similar form:

$f^{mu}_{MLP}(theta)=sigma(sumsigma(g^{mu}_{MLP}(theta)))$

$f^{mu}_{MLP}(theta)=sigma(sumsigma(mathbf{W}^{T}(sigma(mathbf{W'}^{T}mathbf{x}^{mu}+mathbf{b'}))+mathbf{b})))$

So with a little bit of stat mech, we can derive the sigmoid activation function from a general energy function.  And we have activations it in RBMs as well as MLPs.

So when we apply Backprop, what problem are we actually solving ?

Are we simply finding a minima on random high dimensional manifold ?  Or can we say something more, given the special structure of these layers of activated energies ?

#### Backprop and Energy Minimization

To train an MLP, we run several epochs of Backprop.   Backprop has 2 passes: forward and backward:

1. Forward: Propagate the inputs ${mathbf{x}^{mu}}$ forward through the network, activating the neurons
2. Backward: Propagate the errors ${err(mathbf{x}^{mu})}$ backward to compute the weight gradients $Deltamathbf{W}$

Each epoch usually runs small batches of inputs at time.  (And we may need to normalize the inputs and control the variances.  These details may be important for out analysis, and we will consider them in a later post).

After each pass, we update the weights, using something like an SGD step (or Adam, RMSProp, etc)

$mathbf{W}rightarrowmathbf{W}+etaDeltamathbf{W}$

For an MSE loss, we evaluate the partial derivatives over the Energy parameters $theta={mathbf{W},mathbf{W'},mathbf{b},mathbf{b'}}$.

$dfrac{partial}{partialtheta}underset{mu}{sum}(y^{mu}-f^{mu}_{MLP}(theta))^{2}$

Backprop works by the chain rule, and given the special form of the activations, lets us transform the Energy derivatives into a sum of Energy gradients–layer by layer

I won’t go into the details here; there are 00 blogs on BackProp today (which is amazing!).  I will say…

Backprop couples the activation states of the neurons to the Energy parameter gradients through the cycle of forward-backward phases.

In a crude sense, Backprop resembles our more familiar RBM training procedure, where we equilibrate to set the activations, and run gradient descent to set the weights. Here, I show a direct connection, and derive the MLP functional form directly from an RBM.

#### Discriminative (Supervised) RBMs

RBMs are unsupervised; MLPs are supervised.  How can we connect them?  Crudely, we can think of an MLP as a single layer RBM with a softmax tacked on the end.   More rigorously, we can look at Generalized Discriminative RBMs, which solve the conditional probability directly, in terms of the Free Energies, cast in the soft-max form

$p(y|mathbf{x})=dfrac{exp(-E_{Free}(mathbf{x},y))}{sum_{y*}exp(-E_{Free}(mathbf{x},y*))}$

So the question is, can we extract Free Energy for an MLP ?

#### the Backward Phase

I now consider the Backward phase, using the deterministic EMF RBM, as a starting point for understanding MLPs.

An earlier post discusses the EMF RBM, from the context of chemical physics.  For a traditional machine learning perspective, see this thesis.

In some sense, this is kind-of obvious. And yet, I have not seen a clear presentation of the ideas in this way.  I do rely upon new research, like the EMF RBM, although I also draw upon fundamental ideas from complex systems theory–something popular in my PhD studies, but which is perhaps ancient history now.

The goal is to relate RBMs, MLPs, and basic Stat Mech under single conceptual umbrella.

In the EMF approach, we see RBM inference as a sequence of deterministic annealing steps, from 1 quasi-equilibrium state to another, consisting of 2 steps for each epoch:

1. Forward: equilibrate the neuron activations by minimizing the TAP Free Energy
2. Backward: compute weight gradients of the TAP Free Energy

At the end of each epoch, we update the weights, with weight (temperature) constraints (i.e. reset the L1 or L2 norm).  BTW, it may not obvious that weight regularization is like a Temperature control; I will address this in a later post.

(1) The so-called Forward step solves a fixed point equation (which is similar in spirit to taking n steps of Gibbs sampling).  This leads to a pair of coupled, recursion relations for the TAP magnetizations (or just nodes).   Suppose we take t+1 iterations.  Let us ignore the second Onsager correction, and consider the mean field updates:

$h_{i}[t+1]leftarrowsigmaleft[b_{i}+underset{j}{sum}w_{i,j}v_{j}[t+1]-cdotsright]$

$v_{i}[t+1]leftarrowsigmaleft[a_{i}+underset{j}{sum}w_{i,j}h_{j}[t]-cdotsright]$

Because these are deterministic steps, we can express the $h_{i}[t+1]$ in terms of $h_{i}[t]$:

$mathbf{h}[t+1]leftarrowsigmaleft[mathbf{b}+mathbf{W}^{T}sigma(mathbf{b}+mathbf{W}^{T}mathbf{h}[t])right]$

At the end of the recursion, we will have a forward pass that resembles a multi-layer MLP, but that shares weights and biases between layers:

$mathbf{h}[t+1]leftarrowsigmaleft[mathbf{b}+mathbf{W}^{T}sigma(mathbf{b}+cdotssigma(mathbf{b}+mathbf{v}mathbf{W}^{T}))right]$

We can now associate an n-layer MLP, with tied weights,

$theta={mathbf{W}=mathbf{W'}=cdots;;;mathbf{b}=mathbf{b'}=cdots}$,

to an approximate (mean field) EMF RBM,  with n fixed point iterations (ignoring the Onsager correction for now).  Of course, an MLP is supervised, and an RBM is unsupervised, so we need to associate the RBM hidden nodes with the MLP output function at the last layer ($g^{mu}_{MLP}(theta)$), prior to adding the MLP output node

$g^{mu}_{MLP}(theta)=g^{mu}_{RBM}(theta)=mathbf{h}[n](x^{mu})$

This leads naturally to the following conjecture:

The EMF RBM and the BackProp Forward and Backward steps effectively do the same thing–minimize the Free Energy

#### Is this right ?

This is a work in progress

Formally, it is simple and compelling.  Is it the whole story…probably not.  It is merely an observation–food for thought.

So far, I have only removed the visible magnetizations $mathbf{v}[n](x^{mu})$ to obtain the MLP layer function$g^{mu}_{MLP}(theta)$ as a function of the original visible units.  The unsupervised EMF RBM Free Energy, however, contains expressions in terms of both the hidden and visible magnetizations ( $mathbf{v}[n],mathbf{h}[n]= mathbf{m_{v}},mathbf{m_{h}}$ ).  To get a final expression, it is necessary to either

• unravel the network, like a variational auto encoder (VAE)
• replace the visible magnetizations with the true labels, and introduce the softax loss

The result itself should not be so surprising, since it has already been pointed out by Kingma and Welling, Auto-Encoding Variational Bayes, that a Bernoulli MLP is like a variational decoder.  And, of course, VAEs can be formulated with BackProp.

Nore importantly, It is unclear how good the RBM EMF really is.  Some followup studies indicate that second order is not as good as, say, AIS, for estimating the partition function.  I have coded a python emf_rbm.py module using the scikit-learn interface, and testing is underway.  I will blog this soon.

Note that the EMF RBM relies on the Legendre Transform, which is like a convex relaxation.  Early results indicates that this does degrade the RBM solution compared to traditional Cd.  Maybe BackProp may be effective relaxing the convexity constraint by, say, relaxing the condition that the weights are tied between layers.

Still, I hope this can provide some insight.  And there are …

#### Implications

Free Energy is a first class concept in Statistical Mechanics.  In machine learning, not always so much. It appears in much of Hinton’s work, and, as a starting point to deriving methods like Variational Auto Encoders and Probabilistic Programing.

But Free Energy minimization plays an important role in non-convex optimization as well.  Free energies are a Boltzmann average of the zero-Temperature Energy landscape, and, therefore, convert a non-convex surface into something at least less non-convex.

Indeed, in one of the very first papers on mean field Boltzmann Machines (1987), it is noted that

“An important property of the effective [free] energy function E'(V,0,T) is that it has a smoother landscape than E(S) due to the extra terms. Hence, the probability of getting stuck in a local minima decreases.”

Moreover, in protein folding, we have even stronger effects, which can lead to a ruggedly convex, energy landscape.  This arises when the system runs out of configurational entropy (S), and energetic effects (E) dominate.

Most importantly, we want to understand, when does Deep Learning generalize well, and when does it overtrain ?

LeCun has very recently pointed out that Deep Nets fail when they run out of configuration entropy–an argument I also have made from theoretical analysis using the Random Energy Model.  So it is becoming more important to understand what the actual energy landscape of a deep net is, how to separate out the entropic and energetic terms, and how to characterize the configurational entropy.

Hopefully the small insight will be useful and lead to a further understanding of Why Deep Learning Works.

SHARE