Restricted Boltzmann Machines (RBMs) are like the H atom of Deep Learning.

They are basically a solved problem, and while of academic interest, not really used in complex modeling problems.  They were, upto 10 years ago, used for pretraining deep supervised nets.  Today, we can train very deep, supervised nets directly.

RBMs are the foundation of unsupervised deep learning–

an unsolved problem.

RBMs appear to outperform Variational Auto Encoders (VAEs) on simple data sets like the Omniglot set–a data set developed for one shot learning, and used in deep learning research.

RBM research continues in areas like semi-supervised learning with deep hybrid architectures, Temperature dependence,  infinitely deep RBMs, etc.

Many of basic concepts of Deep Learning are found in RBMs.

Sometimes clients ask, “how is related to Deep Learning ?”

In this post, I am going to discuss a recent advanced in RBM theory based on ideas from theoretical condensed matter physics and physical chemistry,

the Extended Mean Field Restricted Boltzmann Machine: EMF_RBM

(see: Training Restricted Boltzmann Machines via the Thouless-Anderson-Palmer Free Energy

[Along the way, we will encounter several Nobel Laureates, including the physicists David J Thouless (2016) and Philip W. Anderson (1977), and the physical chemist Lars Onsager (1968).]

RBMs are pretty simple, and easily implemented from scratch.  The original EMF_RBM is in Julia; I have ported EMF_RBM to python, in the style of the scikit-learn BernoulliRBM package.

### Show me the code

https://github.com/charlesmartin14/emf-rbm/blob/master/EMF_RBM_Test.ipynb

### Theory

We examined RBMs in the last post on Cheap Learning: Partition Functions and RBMs. I will build upon that here, within the context of statistical mechanics.

RBMs are defined by the function

$E(mathbf{v},mathbf{h})=-mathbf{v}^{T}mathbf{W}mathbf{h}-mathbf{v}^{T}mathbf{a}-mathbf{b}^{T}mathbf{h}$

To train an RBM, we minimize the log likelihood ,

$mathcal{L}=-F^{c}(mathbf{v})-F(mathbf{v,h})$

the sum of the clamped and (actual) Free Energies, where

$F^{c}=ln;underset{mathbf{h}}{sum};E(mathbf{v},mathbf{h})$

and

$F=ln;Z(mathbf{v,h})=ln;underset{mathbf{v},mathbf{h}}{sum};E(mathbf{v},mathbf{h});;,$

The sums range over a space of  $mathcal{O}({2^{N}});;$,

$mathbf{v}in[0,1]^{N_v},mathbf{h}in[0,1]^{N_h}$

which is intractable in most cases.

Training an RBM requires computing the log Free Energy; this is hard.

When training an RBM,  we

1. take a few ‘steps toward equilibration’, to approximate F
2. take a gradient step, $dfrac{partial F}{partial w_{i,j}}$, to get W, a, b
3. regularize W (i.e. by weight decay)
4. update W, a, b
5. repeat until some stopping criteria is met

We don’t include label information, although a trained RBM can provide for a down-stream classifier.

#### Extended Mean Field Theory:

The Extended Mean Field (EMF)  RBM is a straightforward application of known statistical mechanics theories.

There are, literally, thousands of papers on spin glasses.

The EMF RBM is a great example of how to operationalize spin glass theory.

#### Mean Field Theory

The Restricted Boltzmann Machine has a very simple Energy function, which makes it very easy to factorize the partition function Z , explained by Hugo Larochelle, to obtain the conditional probabilities

$p(h_{i}=1|v)=sigma(b_{i}+mathbf{W}_{i}mathbf{h})$

$p(v_{i}=1|h)=sigma(a_{i}+mathbf{W}_{i}mathbf{v})$

The conditional probabilities let us apply Gibbs Sampling, which is simply

• hold $mathbf{v}$ fixed, sample $p(mathbf{h}|mathbf{v})$
• hold $mathbf{h}$ fixed, sample $p(mathbf{h}|mathbf{h})$
• repeat for 1 or more equlibiration steps

In statistical mechanics, this is called a mean field theory.  This means that the Free Energy (in $mathbf{v}$) can be written as a simple linear average over the hidden units

$F^{RBM}=mathbf{a}^{T}mathbf{v}+mathbf{b}^{T}mathbf{h}+mathbf{v}^{T}(mathbf{Wh})$.

where $mathbf{Wh}$ is the mean field of the hidden units.

At high Temp., for a spin glass, a mean field model seems very sensible because the spins (i.e. activations) become uncorrelated.

$langle s_{i}cdots s_{j}rangleunderset{Trightarrowinfty}{rightarrow}langle s_{i}ranglecdotslangle s_{j}rangle$

This is non-trivial to prove

Theoreticians use mean field like the p-spin spherical spin glass to study deep learning because of their simplicity.  Computationally, we frequently need more.

How we can go beyond mean field theory ?

#### The Onsager Correction

Onsager was awarded 1968 the Nobel Prize in Chemistry for the development of the Onsager Reciprocal Relations, sometimes called the ‘4th law of Thermodynamics’

The Onsager relations provides the theory to treat thermodynamic systems that are in a quasi-stationary, local equilibrium.

Onsager was the first to show how to relate the correlations in the fluctuations to the linear response.  And by tying a sequence of quasi-stationary systems together, we can describe an irreversible process…

..like learning.  And this is exactly what we need to train an RBM.

In an RBM, the fluctuations are variations in the hidden and visible nodes.

In a BernoulliRBM, the activations can be 0 or 1, so the fluctuation vectors are

$(mathbf{h}-mathbf{0}), (mathbf{h}-mathbf{1}) ;; and;; (mathbf{v}-mathbf{0}),(mathbf{v}-mathbf{1})$

The simplest correction to the mean field Free Energy, at each step in training, are the correlations in these fluctuations:

$left[(mathbf{h}-mathbf{0})^{T}(mathbf{h}-mathbf{1})^{T}mathbf{W}^{2} (mathbf{v}-mathbf{0})(mathbf{v}-mathbf{1})right]$

where W is the Energy weight matrix.

Unlike normal RBMs, here is we work in an Interaction Ensemble, so the hidden and visible units become hidden and visible magnetizations:

$mathbf{v}leftrightarrowmathbf{m}^{v},;mathbf{h}leftrightarrowmathbf{m}^{h}$

To simplify (or confuse?) the presentations here, I don’t write magnetizations (until the Appendix).

The corrections make sense under the stationarity constraints, that the Extended Mean Field RBM Free Energy ($F^{EMF}$) is at a critical point

$left[dfrac{dF^{EMF}}{dmathbf{v}},dfrac{dF^{EMF}}{dmathbf{h}}right]=0$,

That is, small changes in the activations $(mathbf{v},mathbf{h})$ do not change the Free Energy.

We will show that we can write

$F^{EMF}=S+beta;F^{MF}+frac{1}{2}beta^{2}F^{Onsager}+cdots$

as a Taylor series in $beta$, the inverse Temperature, where $S$ is the Entropy

$S=mathbf{v};ln(mathbf{v})+(mathbf{1}-mathbf{v});ln(mathbf{1}-mathbf{v})+mathbf{h};ln(mathbf{h})+(mathbf{1}-mathbf{h});ln(mathbf{1}-mathbf{h})$,

$F^{MF}$ is the standard, mean field RBM Free energy

$F=mathbf{a}^{T}mathbf{v}+mathbf{b}^{T}mathbf{h}+mathbf{v}^{T}mathbf{W}mathbf{h}$,

and $F^{Onsager}$ is the Onsager correction

$F^{Onsager}=left[(mathbf{h}-mathbf{h}^{2})^{T}mathbf{W}^{2} (mathbf{v}-mathbf{v}^{2})right]$.

Given the expressions for the Free Energy, we must now evaluate it.

#### Thouless-Anderson-Palmer TAP Theory

The Taylor series above is a result of the TAP theory — the Thouless-Anderson-Palmer approach developed for spin glasses.

The TAP theory is outlined in the Appendix; here it is noted that

#### Temperature Dependence and Weight Decay

Being a series in inverse Temperature $beta$, the theory applies at low $beta$, or high Temperature.   For fixed $beta$, this also corresponds to small weights W.

Specifically, the expansion applies at Temperatures above the glass transition–a concept which I describe in a recent video blog.

Here, to implement the EMF_RBM, we set

$beta=1;;$,

and, instead, apply weight decay to keep the weights W from exploding

$mathbf{dW}rightarrowmathbf{dW}+deltaVertmathbf{W}Vert$

where $Vertmathbf{W}Vert$ may be an L1 or L2 norm.

Weight Decay acts to keep the Temperature high.

#### RBMs with explicit Temperature

Early RBM computational models were formulated using statistical mechanics  (see the Appendix) language, and so included a Temperature parameter, and were solved using techniques like simulated annealing and the (mean field) TAP equations (described below).

Adding  Temperature allowed the system to ‘jump’ out of the spurious local minima.  So any usable model required a non-zero Temp, and/or some scheme to avoid local minima that generalized poorly.  (See:  Learning Deep Architectures for AI, by Bengio)

These older approaches did not work well –then — so Hinton proposed the Contrastive Divergence (CD) algorithm.  Note that researchers struggled for some years to ‘explain’ what optimization problem CD actually solves.

More that recent work on Temperature Bases RBMs also suggests that higher T solutions perform better, and that

“temperature is an essential parameter controlling the selectivity of the firing neurons in the hidden layer.”

#### Training without Sampling:  Fixed Point equations

Standard RBM training approximates the (unconstrained) Free Energy, F=ln Z, in the mean field approximation, using (one or more steps of) Gibbs Sampling.  This is usually implemented as Contrastive Divergence (CD), or Persistent Contrastive Divergence (PCD).

Using techniques of statistical mechanics, however, it is possible to train an RBM directly, without sampling, by solving a set of deterministic fixed point equations.

Indeed, this approach clarifies how to view an RBM as solving a (determinisitic) fixed point equation of the form

$f(mathbf{X})rightarrowmathbf{X}$

Consider each step, at at fixed ($mathbf{W}, mathbf{a}, mathbf{b}$), as a Quasi-Stationary system, which is close to equilibrium, but we don’t need to evaluate ln Z(v,h) exactly.

We can use the stationary conditions to derive a pair of coupled, non-linear equations

$v_{i}simeqsigmaleft[a_{i}+underset{j}{sum}w_{i,j}h_{j}-(v_{i}-frac{1}{2})w^{2}_{i,j}(h_{j}-(h_{j})^{2})+cdotsright]$

$h_{i}simeqsigmaleft[b_{i}+underset{j}{sum}w_{i,j}v_{j}-(h_{i}-frac{1}{2})w^{2}_{i,j}(v_{j}-(v_{j})^{2})+cdotsright]$

They extend the standard formula of sigmoid linear activations $sigma$ with additional, non-linear, inter-layer interactions.

They differs significantly from (simple) Deep Learning activation functions because the activation for each layer explicitly includes information from other layers.

This extension couples the mean ($v_{i}-frac{1}{2}$) and total fluctuations ($v_{j}-(v_{j})^{2}$) between layers.  Higher order correlations could also be included, even to infinite order, using techniques from field theory.

We can not satisfy both equations simultaneously, but we can satisfy each condition $left[dfrac{dF^{EMF}}{dmathbf{v}},dfrac{dF^{EMF}}{dmathbf{h}}right]=0$ individually, letting us write a set of recursion relations

$v_{i}[t+1]leftarrowsigmaleft[a_{i}+underset{j}{sum}w_{i,j}h_{j}[t]-(v_{i}[t]-frac{1}{2})w^{2}_{i,j}(h_{j}[t]-(h_{j}[t])^{2})right]$

$h_{i}[t+1]leftarrowsigmaleft[b_{i}+underset{j}{sum}w_{i,j}v_{j}[t+1]-(h_{i}[t+1]-frac{1}{2})w^{2}_{i,j}(v_{j}[t+1]-(v_{j}[t+1])^{2})right]$

These fixed point equations converge to the stationary solution, leading to a local equilibrium.  Like Gibbs Sampling, however,  we only need a few iterations (say t= to 5). Unlike Sampling, however, the EMF RBM is deterministic.

#### Implementation

https://github.com/charlesmartin14/emf-rbm/

If there is enough interest, I can do a pull request on sklearn to include it.

The next blog post will demonstrate how the python code in action.

### Appendix

#### the Statistical Mechanics of Inference

##### Early work on Hopfield Associative Memories

Most older physicists will remember the Hopfield model.  They peaked in 1986, although interesting work continued into the late 90s (when I was a post doc).

Originally, Boltzmann machines were introduced as a way to avoid spurious local minima while including ‘hidden’ features into Hopfield Associative Memories (HAM).

Hopfield himself was a theoretical chemist, and his simple model HAMs were of  great interest to theoretical chemists and physicists.

Hinton explains Hopfield nets in his on-line lectures on Deep Learning.

The Hopfield Model is a kind of spin glass, which acts like a ‘memory’ that can recognize ‘stored patterns’.  It was originally developed as a quasi-stationary solution of more complex, dynamical models of neuronal firing patterns (see the Cowan-Wilson model).

Early theoretical work on HAMs studied analytic approximations to ln Z to compute their capacity ($rho/N$), and their phase diagram.  The capacity is simply the number of patterns  $rho$ a network of size N can memorize without getting confused.

The Hopfield model was traditionally run at T=0.

Looking at the T=0 line, at extremely low capacity, the system has stable mixed states that correspond to ‘frozen’ memories.  But this is very low capacity, and generally unusable.  Also, when the capacity too large, $dfrac{p}{N}ge 0.138$, (which is really not that large),  the system abruptly breaks down completely.

There is a small window of capacity, $dfrac{p}{N}le~0.05$,with stable pattern equilibria, dominated by frozen out, spin glass states.   The problem is, for any realistic system, with correlated data, the system is dominated by spurious local minima which look like low energy spin glass states.

So the Hopfield model suggested that

glass states can be useful minima, but

we want to avoid low energy  (spurious) glassy states.

One can try to derive direct mapping between Hopfield Nets and RBMs (under reasonable assumptions). Then the RBM capacity is proportional to the number of  Hidden nodes.    After that, the analogies stop.

The intuition about RBMs is different since (effectively) they operate at a non-zero Temperature.  Additionally, it is unclear to this blogger if the proper description of deep learning is a mean field spin glass, with many useful local minima, or a strongly correlated system, which may behave very differently, and more like a funneled energy landscape.

Thouless-Anderson-Palmer Theory of Spin Glasses

The TAP theory is one of the classic analytic tools used to study spin glasses and even the Hopfield Model.

We will derive the EMF RBM method following the Thouless-Anderson-Palmer (TAP) approach to spin glasses.

On a side note he TAP method introduces us to 2 more Nobel Laureates:

• David J Thouless, 2016 Nobel Prize in Physics
• Phillip W Anderson, 1977 Nobel Prize in Physics

The TAP theory, published in 1977, presented a formal approach to study the thermodynamics of mean field spin glasses.  In particular, the TAP theory provides an expression for the average spin

$langle s_{i}rangle=tanh(C+beta;MeanField)-beta^{2};Onsager)$

where $tanh()$ is like an activation function, C is a constant, and the MeanField and Onsager terms are like our terms above.

In 1977, they argued that the TAP approach would hold for all Temperatures (and external fields), although it was only proven until 25 years later by Talagrand.  It is these relatively new, rigorous approaches that are cited by Deep Learning researchers like LeCun , Chaudhari, etc.    But many of the cited results have been suggested using the TAP approach.  In particular, the structure of the Energy Landscape, has been understood looking at the stationary points of the TAP free energy.

More importantly, the TAP approach can be operationalized, as a new RBM solver.

#### The TAP Equations

We start with the RBM Free Energy $F$.

Introduce an ‘external field’ $mathbf{q}={q_{i}}$ which couples to the spins, adding a linear term to the Free Energy

$beta F[mathbf{q}]=ln;sumlimits_{mathbf{s}}exp(-beta[ E(mathbf{s})-mathbf{q}^{T}mathbf{s}])$

Physically, $mathbf{q}$ would be an external magnetic field which drives the system out-of-equilibrium.

As is standard in statistical mechanics, we take the Legendre Transform,  in terms of a set of conjugate variables $mathbf{m}={m_{ij}}$.  These are the magnetizations of each spin under the applied field,  and describe how the spins behave outside of equilibrium

$-beta;Gamma[mathbf{m}]=-beta;underset{mathbf{q}}{max}left[F[mathbf{q}]+mathbf{m}^{T}mathbf{q}right]$

The transform which effectively defines a new interaction ensemble $Gamma[mathbf{m}]$.  We now set $mathbf{q}=0$, and note

$-beta F=beta;F[mathbf{q}=0]=beta;underset{mathbf{q}}{min}left[Gamma[mathbf{m}]right]=-beta;Gamma[mathbf{m}^{*}]$

Define an interaction Free Energy to describe the interaction ensemble

$A(beta,mathbf{m})=beta;Gamma[mathbf{m}]$

which equals the original Free Energy when $mathbf{q}$.

Note that because we have visible and hidden spins (or nodes), we will identify magnetizations for each

$mathbf{m}=[mathbf{m^{v},m^{h}}]$

Now, recall we want to avoid the glassy phase; this means we keep the Temperature high.   Or $beta$ low.

We form a low Taylor series expansion $beta$  in the new ensemble

$A(beta,mathbf{m})=A(0,mathbf{m})+betadfrac{partial A(beta,mathbf{m})}{partialbeta}Bigvert_{beta=0}+dfrac{beta}{2}dfrac{partial^{2} A(beta,mathbf{m})}{partialbeta^{2}}Bigvert_{beta=0}+cdots$

which, at low order in $beta$, we expect to be reasonably accurate even away from equilibrium, at least at high Temp.

This leads to an order-by-order expansion for the Free Energy $A(beta,mathbf{m})$.  The first order ($mathcal{O}(beta)$) correction is the mean field term.  The second order term ($mathcal{O}(beta^{2})$) is the Onsager correction.

##### Second Order Free Energy

Upto $mathcal{O}(beta^{2})$, we have

$A(0,mathbf{m^{v}},mathbf{m^{h}})=S(mathbf{m^{v}})+S(mathbf{m^{h}})$

$dfrac{partial}{partialbeta}A(beta,mathbf{m^{v}},mathbf{m^{h}})=mathbf{a^{T}m^{v}}+mathbf{b^{T}m^{h}}+mathbf{(m^{v})^{T}Wm^{h}}$

$dfrac{partial^{2}}{partialbeta^{2}}A(beta,mathbf{m^{v}},mathbf{m^{h}})=frac{1}{2}mathbf{(m^{v}-(m^{v})^{2})^{T}W^{2}(m^{h}-(m^{h})^{2})}$

or

$A(0,mathbf{m^{v}})=underset{i}{sum}m^{v};ln;m^{v}+(1-m^{v})ln(1-m^{v})$

$A(0,mathbf{m^{h}})=underset{i}{sum}m^{h};ln;m^{h}+(1-m^{h})ln(1-m^{h})$

$dfrac{partial}{partialbeta}A(beta,mathbf{m^{v}},mathbf{m^{h}})=underset{i}{sum}a_{i}m^{v}_{i}+underset{i}{sum}b_{i}m^{h}_{i}+underset{i,j}{sum}m^{v}_{i}w_{i,j}m^{h}_{j}$

$dfrac{partial^{2}}{partialbeta^{2}}A(beta,mathbf{m^{v}},mathbf{m^{h}})=frac{1}{2}underset{i,j}{sum}(m^{v}_{i}-(m^{v})^{2}_{i})w^{2}_{i,j}(m^{h}_{j}-(m^{h})^{2}_{j})$

##### Stationary Condition

Rather than assume equilibrium, we assume that, at each step during inference –at fixed ($mathbf{W}, mathbf{a}, mathbf{b}$–the system satisfies a quasi-stationary condition.  Each step reaches a a local saddle point in phase space, s.t.

$dfrac{dA}{dmathbf{m}}=left[dfrac{dA}{dmathbf{m^{v}}},dfrac{dA}{dmathbf{m^{h}}}right]=0$

##### Stationary Magnetizations

Applying the stationary conditions lets us write coupled equations for the individual magnetizations that effectively define the (second order), high Temp, quasi-equilibrium states

$m^{v}_{i}simeqsigmaleft[a_{i}+underset{j}{sum}w_{i,j}m^{h}_{j}-(m^{v}_{i}-frac{1}{2})w^{2}_{i,j}(m^{h}_{j}-(m^{h}_{j})^{2})right]$

$m^{h}_{i}simeqsigmaleft[b_{i}+underset{j}{sum}w_{i,j}m^{v}_{j}-(m^{h}_{i}-frac{1}{2})w^{2}_{i,j}(m^{v}_{j}-(m^{v}_{j})^{2})right]$

Notice that the $m^{v}_{i},m^{h}_{i}$ resemble the RBM conditional probabilities; in fact, at first order in $beta$, they are the same.

##### Mean Field Theory

$beta=1$ gives a mean field theory, and

$P(h|v)=m^{h}_{i}$

$P(v|h)=m^{v}_{i}$

And in the late 90’s, mean field TAP theory was attempted, unsuccessfully,  to create an RBM solver.

##### Fixed Point Equations

At second order, the magnetizations $m^{v}_{i},m^{h}_{i}$ are coupled through the Onsager corrections.  To solve them, we can write down the fixed point equations, shown above.

### Higher Order Corrections

We can include higher order correction the Free Energy $A$ by including more terms in the Taylor Series. This is called a Plefka expansion.  The terms can be represented using diagrams

Plefka derived these terms in 1982, although it appears he only published up to the Onsager correction; a recent paper shows how to obtain all high order terms.

The Diagrammatic expansion appears not to have been fully worked out, and is only sketched above.

I can think of at least 3 ways to include these higher terms:

• Include all terms at a higher order in $beta$ .   The Sphinx Boltzmann.jl packages includes all terms up to $mathcal{O}(beta^{3})$ .  This changes both the Free Energy and the associated fixed point equations for the magnetizations.
• Resum all diagrams of a certain kind. This is an old-school renormalization procedure, and it would include an $infty$ number of terms.   This requires working out the analytic expressions for some class of diagrams, such as all circle (RPA-like) diagrams

This is similar, in some sense, to the infinite RBM by Larochelle, which uses an resummation trick to include an infinite number of  Hidden nodes.

• Use a low order Pade Approximant, to improve the Taylor Series.   The idea is to include higher order terms implicitly by expanding the Free Energy as a ratio of polynomial functions  P, Q

$A^{P}(beta,mathbf{m})=dfrac{P(beta,mathbf{m})}{Q(beta,mathbf{m})}$  , where

$A(beta,mathbf{m}) -A^{P}(beta,mathbf{m})=0$,

Obviously there are lots of interesting things to try.

### Beyond Binary Spins; Real-Valued RBMs

The current python EMF_RBM only treats binary data, just like the scikit learn BernoulliRBM.  So for say MNIST, we have to use the binarized MNIST.

There is some advantage to using Binarized Neural Networks on GPUs.

Still, a non-binary RBM may be useful.  Tremel et. al. have suggested how to use real-valued data in the EMF_RBM, although in the context of Compressed Sensing Using Generalized Boltzmann Machines.