Restricted Boltzmann Machines () are like the H atom of Deep Learning.

They are basically a solved problem, and while of academic interest, not really used in complex modeling problems.  They were, upto 10 years ago, used for pretraining deep supervised nets.  Today, we can train very deep, supervised nets directly.

RBMs are the foundation of unsupervised deep learning–

an unsolved problem.

 RBMs appear to outperform Variational Auto Encoders (VAEs) on simple sets like the Omniglot set–a data set developed for one shot learning, and used in deep learning research.

Omniglot Dataset
Omniglot Dataset

RBM research continues in areas like semi-supervised learning with deep hybrid architectures, Temperature dependence,  infinitely deep RBMs, etc.

Many of basic concepts of Deep Learning are found in RBMs.

Sometimes clients ask, “how is Physical Chemistry related to Deep Learning ?”

In this post, I am going to discuss a recent advanced in RBM theory based on ideas from theoretical condensed matter physics and physical chemistry,

the Extended Mean Field Restricted Boltzmann Machine: EMF_RBM

(see: Training Restricted Boltzmann Machines via the Thouless-Anderson-Palmer Free Energy

[Along the way, we will encounter several Nobel Laureates, including the physicists David J Thouless (2016) and Philip W. Anderson (1977), and the physical chemist Lars Onsager (1968).]

RBMs are pretty simple, and easily implemented from scratch.  The original EMF_RBM is in Julia; I have ported EMF_RBM to python, in the style of the scikit-learn BernoulliRBM package.

Show me the code

https://github.com/charlesmartin14/emf-rbm/blob/master/EMF_RBM_Test.ipynb

Theory

We examined RBMs in the last post on Cheap Learning: Partition Functions and RBMs. I will build upon that here, within the context of statistical mechanics.

RBMs are defined by the Energy function

E(mathbf{v},mathbf{h})=-mathbf{v}^{T}mathbf{W}mathbf{h}-mathbf{v}^{T}mathbf{a}-mathbf{b}^{T}mathbf{h}

To train an RBM, we minimize the log likelihood ,

mathcal{L}=-F^{c}(mathbf{v})-F(mathbf{v,h})  

the sum of the clamped and (actual) Free Energies, where

F^{c}=ln;underset{mathbf{h}}{sum};E(mathbf{v},mathbf{h}) 

and

F=ln;Z(mathbf{v,h})=ln;underset{mathbf{v},mathbf{h}}{sum};E(mathbf{v},mathbf{h});;,

The sums range over a space of  mathcal{O}({2^{N}});;,

mathbf{v}in[0,1]^{N_v},mathbf{h}in[0,1]^{N_h}

which is intractable in most cases.

an RBM requires computing the log Free Energy; this is hard.

When training an RBM,  we

  1. take a few ‘steps toward equilibration’, to approximate F
  2. take a gradient step, dfrac{partial F}{partial w_{i,j}}, to get W, a, b
  3. regularize W (i.e. by weight decay)
  4. update W, a, b 
  5. repeat until some stopping criteria is met

We don’t include label information, although a trained RBM can provide features for a down-stream classifier.

Extended Mean Field Theory:

The Extended Mean Field (EMF)  RBM is a straightforward application of known statistical mechanics theories.

There are, literally, thousands of papers on spin glasses.

The EMF RBM is a great example of how to operationalize spin glass theory.

Mean Field Theory

The Restricted Boltzmann Machine has a very simple Energy function, which makes it very easy to factorize the partition function Z , explained by Hugo Larochelle, to obtain the conditional probabilities

p(h_{i}=1|v)=sigma(b_{i}+mathbf{W}_{i}mathbf{h})

p(v_{i}=1|h)=sigma(a_{i}+mathbf{W}_{i}mathbf{v})

The conditional probabilities let us apply Gibbs Sampling, which is simply

  • hold mathbf{v} fixed, sample p(mathbf{h}|mathbf{v})
  • hold mathbf{h} fixed, sample p(mathbf{h}|mathbf{h})
  • repeat for 1 or more equlibiration steps

In statistical mechanics, this is called a mean field theory.  This means that the Free Energy (in mathbf{v}) can be written as a simple linear average over the hidden units

F^{RBM}=mathbf{a}^{T}mathbf{v}+mathbf{b}^{T}mathbf{h}+mathbf{v}^{T}(mathbf{Wh}).

where mathbf{Wh} is the mean field of the hidden units.

At high Temp., for a spin glass, a mean field model seems very sensible because the spins (i.e. activations) become uncorrelated.

langle s_{i}cdots s_{j}rangleunderset{Trightarrowinfty}{rightarrow}langle s_{i}ranglecdotslangle s_{j}rangle

This is non-trivial to prove

Theoreticians use mean field models like the p-spin spherical spin glass to study deep learning because of their simplicity.  Computationally, we frequently need more.

How we can go beyond mean field theory ?

The Onsager Correction

Onsager was awarded 1968 the Nobel Prize in Chemistry for the development of the Onsager Reciprocal Relations, sometimes called the ‘4th law of Thermodynamics’

The Onsager relations provides the theory to treat thermodynamic systems that are in a quasi-stationary, local equilibrium.

Onsager was the first to show how to relate the correlations in the fluctuations to the linear response.  And by tying a sequence of quasi-stationary systems together, we can describe an irreversible process…

..like learning.  And this is exactly what we need to train an RBM.

In an RBM, the fluctuations are variations in the hidden and visible nodes.

rbm-nodes

In a BernoulliRBM, the activations can be 0 or 1, so the fluctuation vectors are

(mathbf{h}-mathbf{0}), (mathbf{h}-mathbf{1}) ;; and;; (mathbf{v}-mathbf{0}),(mathbf{v}-mathbf{1})

The simplest correction to the mean field Free Energy, at each step in training, are the correlations in these fluctuations:

left[(mathbf{h}-mathbf{0})^{T}(mathbf{h}-mathbf{1})^{T}mathbf{W}^{2} (mathbf{v}-mathbf{0})(mathbf{v}-mathbf{1})right]

where W is the Energy weight matrix.

Unlike normal RBMs, here is we work in an Interaction Ensemble, so the hidden and visible units become hidden and visible magnetizations:  

mathbf{v}leftrightarrowmathbf{m}^{v},;mathbf{h}leftrightarrowmathbf{m}^{h}

To simplify (or confuse?) the presentations here, I don’t write magnetizations (until the Appendix).

The corrections make sense under the stationarity constraints, that the Extended Mean Field RBM Free Energy (F^{EMF}) is at a critical point

left[dfrac{dF^{EMF}}{dmathbf{v}},dfrac{dF^{EMF}}{dmathbf{h}}right]=0,

That is, small changes in the activations (mathbf{v},mathbf{h}) do not change the Free Energy.

We will show that we can write

F^{EMF}=S+beta;F^{MF}+frac{1}{2}beta^{2}F^{Onsager}+cdots

as a Taylor series in beta, the inverse Temperature, where S  is the Entropy

S=mathbf{v};ln(mathbf{v})+(mathbf{1}-mathbf{v});ln(mathbf{1}-mathbf{v})+mathbf{h};ln(mathbf{h})+(mathbf{1}-mathbf{h});ln(mathbf{1}-mathbf{h}),

F^{MF} is the standard, mean field RBM Free energy

F=mathbf{a}^{T}mathbf{v}+mathbf{b}^{T}mathbf{h}+mathbf{v}^{T}mathbf{W}mathbf{h},

and F^{Onsager} is the Onsager correction

F^{Onsager}=left[(mathbf{h}-mathbf{h}^{2})^{T}mathbf{W}^{2} (mathbf{v}-mathbf{v}^{2})right].

Given the expressions for the Free Energy, we must now evaluate it.

Thouless-Anderson-Palmer TAP Theory

The Taylor series above is a result of the TAP theory — the Thouless-Anderson-Palmer approach developed for spin glasses.

vortices.gif

 The TAP theory is outlined in the Appendix; here it is noted that

Thouless just shared the 2016 Nobel Prize in Physics (for his work in topological phase transitions)

Temperature Dependence and Weight Decay

Being a series in inverse Temperature beta, the theory applies at low beta, or high Temperature.   For fixed beta, this also corresponds to small weights W. 

Specifically, the expansion applies at Temperatures above the glass transition–a concept which I describe in a recent video blog.

Here, to implement the EMF_RBM, we set

beta=1;;,

and, instead, apply weight decay to keep the weights W from exploding

mathbf{dW}rightarrowmathbf{dW}+deltaVertmathbf{W}Vert

where Vertmathbf{W}Vert may be an L1 or L2 norm.

Weight Decay acts to keep the Temperature high.

RBMs with explicit Temperature

Early RBM computational models were formulated using statistical mechanics  (see the Appendix) language, and so included a Temperature parameter, and were solved using techniques like simulated annealing and the (mean field) TAP equations (described below).

Adding  Temperature allowed the system to ‘jump’ out of the spurious local minima.  So any usable model required a non-zero Temp, and/or some scheme to avoid local minima that generalized poorly.  (See:  Learning Deep Architectures for AI, by Bengio)

These older approaches did not work well –then — so Hinton proposed the Contrastive Divergence (CD) algorithm.  Note that researchers struggled for some years to ‘explain’ what optimization problem CD actually solves.

More that recent work on Temperature Bases RBMs also suggests that higher T perform better, and that

 “temperature is an essential parameter controlling the selectivity of the firing neurons in the hidden layer.”

Training without Sampling:  Fixed Point equations

Standard RBM training approximates the (unconstrained) Free Energy, F=ln Z, in the mean field approximation, using (one or more steps of) Gibbs Sampling.  This is usually implemented as Contrastive Divergence (CD), or Persistent Contrastive Divergence (PCD).

Using techniques of statistical mechanics, however, it is possible to train an RBM directly, without sampling, by solving a set of deterministic fixed point equations.

Indeed, this approach clarifies how to view an RBM as solving a (determinisitic) fixed point equation of the form

f(mathbf{X})rightarrowmathbf{X}

Consider each step, at at fixed (mathbf{W}, mathbf{a}, mathbf{b}), as a Quasi-Stationary system, which is close to equilibrium, but we don’t need to evaluate ln Z(v,h) exactly.

We can use the stationary conditions to derive a pair of coupled, non-linear equations

v_{i}simeqsigmaleft[a_{i}+underset{j}{sum}w_{i,j}h_{j}-(v_{i}-frac{1}{2})w^{2}_{i,j}(h_{j}-(h_{j})^{2})+cdotsright]

h_{i}simeqsigmaleft[b_{i}+underset{j}{sum}w_{i,j}v_{j}-(h_{i}-frac{1}{2})w^{2}_{i,j}(v_{j}-(v_{j})^{2})+cdotsright]

They extend the standard formula of sigmoid linear activations sigma with additional, non-linear, inter-layer interactions.

They differs significantly from (simple) Deep Learning activation functions because the activation for each layer explicitly includes information from other layers.

This extension couples the mean (v_{i}-frac{1}{2}) and total fluctuations (v_{j}-(v_{j})^{2}) between layers.  Higher order correlations could also be included, even to infinite order, using techniques from field theory.

We can not satisfy both equations simultaneously, but we can satisfy each condition left[dfrac{dF^{EMF}}{dmathbf{v}},dfrac{dF^{EMF}}{dmathbf{h}}right]=0 individually, letting us write a set of recursion relations

v_{i}[t+1]leftarrowsigmaleft[a_{i}+underset{j}{sum}w_{i,j}h_{j}[t]-(v_{i}[t]-frac{1}{2})w^{2}_{i,j}(h_{j}[t]-(h_{j}[t])^{2})right]

h_{i}[t+1]leftarrowsigmaleft[b_{i}+underset{j}{sum}w_{i,j}v_{j}[t+1]-(h_{i}[t+1]-frac{1}{2})w^{2}_{i,j}(v_{j}[t+1]-(v_{j}[t+1])^{2})right]

These fixed point equations converge to the stationary solution, leading to a local equilibrium.  Like Gibbs Sampling, however,  we only need a few iterations (say t=3 to 5). Unlike Sampling, however, the EMF RBM is deterministic.

Implementation

https://github.com/charlesmartin14/emf-rbm/

If there is enough interest, I can do a pull request on sklearn to include it.

The next blog post will demonstrate how the python code in action.

Appendix

the Statistical Mechanics of Inference

Early work on Hopfield Associative Memories

Most older physicists will remember the Hopfield model.  They peaked in 1986, although interesting work continued into the late 90s (when I was a post doc).

Originally, Boltzmann machines were introduced as a way to avoid spurious local minima while including ‘hidden’ features into Hopfield Associative Memories (HAM).

Hopfield himself was a theoretical chemist, and his simple model HAMs were of  great interest to theoretical chemists and physicists.

Hinton explains Hopfield nets in his on-line lectures on Deep Learning.

The Hopfield Model is a kind of spin glass, which acts like a ‘memory’ that can recognize ‘stored patterns’.  It was originally developed as a quasi-stationary solution of more complex, dynamical models of neuronal firing patterns (see the Cowan-Wilson model).

Early theoretical work on HAMs studied analytic approximations to ln Z to compute their capacity (rho/N), and their phase diagram.  The capacity is simply the number of patterns  rho a network of size N can memorize without getting confused.

g35
Phase diagram of a Hopfield Network

The Hopfield model was traditionally run at T=0.

Looking at the T=0 line, at extremely low capacity, the system has stable mixed states that correspond to ‘frozen’ memories.  But this is very low capacity, and generally unusable.  Also, when the capacity too large, dfrac{p}{N}ge 0.138, (which is really not that large),  the system abruptly breaks down completely.

There is a small window of capacity, dfrac{p}{N}le~0.05,with stable pattern equilibria, dominated by frozen out, spin glass states.   The problem is, for any realistic system, with correlated data, the system is dominated by spurious local minima which look like low energy spin glass states.

So the Hopfield model suggested that

glass states can be useful minima, but

we want to avoid low energy  (spurious) glassy states.

One can try to derive direct mapping between Hopfield Nets and RBMs (under reasonable assumptions). Then the RBM capacity is proportional to the number of  Hidden nodes.    After that, the analogies stop.

The intuition about RBMs is different since (effectively) they operate at a non-zero Temperature.  Additionally, it is unclear to this blogger if the proper description of deep learning is a mean field spin glass, with many useful local minima, or a strongly correlated system, which may behave very differently, and more like a funneled energy landscape.

Thouless-Anderson-Palmer Theory of Spin Glasses

The TAP theory is one of the classic analytic tools used to study spin glasses and even the Hopfield Model.

We will derive the EMF RBM method following the Thouless-Anderson-Palmer (TAP) approach to spin glasses.

On a side note he TAP method introduces us to 2 more Nobel Laureates:

  • David J Thouless, 2016 Nobel Prize in Physics
  • Phillip W Anderson, 1977 Nobel Prize in Physics

The TAP theory, published in 1977, presented a formal approach to study the thermodynamics of mean field spin glasses.  In particular, the TAP theory provides an expression for the average spin

langle s_{i}rangle=tanh(C+beta;MeanField)-beta^{2};Onsager)

where tanh() is like an activation function, C is a constant, and the MeanField and Onsager terms are like our terms above.

In 1977, they argued that the TAP approach would hold for all Temperatures (and external fields), although it was only proven until 25 years later by Talagrand.  It is these relatively new, rigorous approaches that are cited by Deep Learning researchers like LeCun , Chaudhari, etc.    But many of the cited results have been suggested using the TAP approach.  In particular, the structure of the Energy Landscape, has been understood looking at the stationary points of the TAP free energy.

More importantly, the TAP approach can be operationalized, as a new RBM solver.

The TAP Equations

We start with the RBM Free Energy F.

Introduce an ‘external field’ mathbf{q}={q_{i}} which couples to the spins, adding a linear term to the Free Energy

beta F[mathbf{q}]=ln;sumlimits_{mathbf{s}}exp(-beta[ E(mathbf{s})-mathbf{q}^{T}mathbf{s}])

Physically, mathbf{q} would be an external magnetic field which drives the system out-of-equilibrium.

As is standard in statistical mechanics, we take the Legendre Transform,  in terms of a set of conjugate variables mathbf{m}={m_{ij}}.  These are the magnetizations of each spin under the applied field,  and describe how the spins behave outside of equilibrium

-beta;Gamma[mathbf{m}]=-beta;underset{mathbf{q}}{max}left[F[mathbf{q}]+mathbf{m}^{T}mathbf{q}right]

The transform which effectively defines a new interaction ensemble Gamma[mathbf{m}].  We now set mathbf{q}=0, and note

-beta F=beta;F[mathbf{q}=0]=beta;underset{mathbf{q}}{min}left[Gamma[mathbf{m}]right]=-beta;Gamma[mathbf{m}^{*}]

Define an interaction Free Energy to describe the interaction ensemble

A(beta,mathbf{m})=beta;Gamma[mathbf{m}]

which equals the original Free Energy when mathbf{q}.

Note that because we have visible and hidden spins (or nodes), we will identify magnetizations for each

mathbf{m}=[mathbf{m^{v},m^{h}}]

Now, recall we want to avoid the glassy phase; this means we keep the Temperature high.   Or beta low.

We form a low Taylor series expansion beta  in the new ensemble

A(beta,mathbf{m})=A(0,mathbf{m})+betadfrac{partial A(beta,mathbf{m})}{partialbeta}Bigvert_{beta=0}+dfrac{beta}{2}dfrac{partial^{2} A(beta,mathbf{m})}{partialbeta^{2}}Bigvert_{beta=0}+cdots

which, at low order in beta, we expect to be reasonably accurate even away from equilibrium, at least at high Temp.

This leads to an order-by-order expansion for the Free Energy A(beta,mathbf{m}).  The first order (mathcal{O}(beta)) correction is the mean field term.  The second order term (mathcal{O}(beta^{2})) is the Onsager correction.

Second Order Free Energy

Upto mathcal{O}(beta^{2}), we have

A(0,mathbf{m^{v}},mathbf{m^{h}})=S(mathbf{m^{v}})+S(mathbf{m^{h}})

dfrac{partial}{partialbeta}A(beta,mathbf{m^{v}},mathbf{m^{h}})=mathbf{a^{T}m^{v}}+mathbf{b^{T}m^{h}}+mathbf{(m^{v})^{T}Wm^{h}}

dfrac{partial^{2}}{partialbeta^{2}}A(beta,mathbf{m^{v}},mathbf{m^{h}})=frac{1}{2}mathbf{(m^{v}-(m^{v})^{2})^{T}W^{2}(m^{h}-(m^{h})^{2})}

or

A(0,mathbf{m^{v}})=underset{i}{sum}m^{v};ln;m^{v}+(1-m^{v})ln(1-m^{v})

A(0,mathbf{m^{h}})=underset{i}{sum}m^{h};ln;m^{h}+(1-m^{h})ln(1-m^{h})

dfrac{partial}{partialbeta}A(beta,mathbf{m^{v}},mathbf{m^{h}})=underset{i}{sum}a_{i}m^{v}_{i}+underset{i}{sum}b_{i}m^{h}_{i}+underset{i,j}{sum}m^{v}_{i}w_{i,j}m^{h}_{j}

dfrac{partial^{2}}{partialbeta^{2}}A(beta,mathbf{m^{v}},mathbf{m^{h}})=frac{1}{2}underset{i,j}{sum}(m^{v}_{i}-(m^{v})^{2}_{i})w^{2}_{i,j}(m^{h}_{j}-(m^{h})^{2}_{j})

Stationary Condition

Rather than assume equilibrium, we assume that, at each step during inference –at fixed (mathbf{W}, mathbf{a}, mathbf{b}–the system satisfies a quasi-stationary condition.  Each step reaches a a local saddle point in phase space, s.t.

dfrac{dA}{dmathbf{m}}=left[dfrac{dA}{dmathbf{m^{v}}},dfrac{dA}{dmathbf{m^{h}}}right]=0

Stationary Magnetizations

Applying the stationary conditions lets us write coupled equations for the individual magnetizations that effectively define the (second order), high Temp, quasi-equilibrium states

m^{v}_{i}simeqsigmaleft[a_{i}+underset{j}{sum}w_{i,j}m^{h}_{j}-(m^{v}_{i}-frac{1}{2})w^{2}_{i,j}(m^{h}_{j}-(m^{h}_{j})^{2})right]

m^{h}_{i}simeqsigmaleft[b_{i}+underset{j}{sum}w_{i,j}m^{v}_{j}-(m^{h}_{i}-frac{1}{2})w^{2}_{i,j}(m^{v}_{j}-(m^{v}_{j})^{2})right]

Notice that the m^{v}_{i},m^{h}_{i} resemble the RBM conditional probabilities; in fact, at first order in beta, they are the same.

Mean Field Theory

beta=1 gives a mean field theory, and

P(h|v)=m^{h}_{i}

P(v|h)=m^{v}_{i}

And in the late 90’s, mean field TAP theory was attempted, unsuccessfully,  to create an RBM solver.

Fixed Point Equations

At second order, the magnetizations m^{v}_{i},m^{h}_{i} are coupled through the Onsager corrections.  To solve them, we can write down the fixed point equations, shown above.

Higher Order Corrections

We can include higher order correction the Free Energy A by including more terms in the Taylor Series. This is called a Plefka expansion.  The terms can be represented using diagrams

Plefka Expansion for the Free Energy
Plefka (Diagrammatic) Expansion for the Free Energy

Plefka derived these terms in 1982, although it appears he only published up to the Onsager correction; a recent paper shows how to obtain all high order terms.

The Diagrammatic expansion appears not to have been fully worked out, and is only sketched above.

I can think of at least 3 ways to include these higher terms:

  • Include all terms at a higher order in beta .   The Sphinx Boltzmann.jl packages includes all terms up to mathcal{O}(beta^{3}) .  This changes both the Free Energy and the associated fixed point equations for the magnetizations.
Third Order Corrections
Third Order Corrections
Fourth Order Corrections
Fourth Order Corrections
  • Resum all diagrams of a certain kind. This is an old-school renormalization procedure, and it would include an infty number of terms.   This requires working out the analytic expressions for some class of diagrams, such as all circle (RPA-like) diagrams
    All RPA-like Diagrams, to infinite order
    all circle diagrams

    This is similar, in some sense, to the infinite RBM by Larochelle, which uses an resummation trick to include an infinite number of  Hidden nodes.

  • Use a low order Pade Approximant, to improve the Taylor Series.   The idea is to include higher order terms implicitly by expanding the Free Energy as a ratio of polynomial functions  P, Q

A^{P}(beta,mathbf{m})=dfrac{P(beta,mathbf{m})}{Q(beta,mathbf{m})}  , where

A(beta,mathbf{m}) -A^{P}(beta,mathbf{m})=0,

Obviously there are lots of interesting things to try.

Beyond Binary Spins; Real-Valued RBMs

The current python EMF_RBM only treats binary data, just like the scikit learn BernoulliRBM.  So for say MNIST, we have to use the binarized MNIST.

There is some advantage to using Binarized Neural Networks on GPUs.

Still, a non-binary RBM may be useful.  Tremel et. al. have suggested how to use real-valued data in the EMF_RBM, although in the context of Compressed Sensing Using Generalized Boltzmann Machines.



Source link

LEAVE A REPLY

Please enter your comment!
Please enter your name here