Why does deep and cheap learning work so well?

This is the question posed by a recent article.  Deep Learning seems to require knowing the Partition Function–at least in old fashioned Restricted Boltzmann Machines ().

Here, I will discuss some aspects of this paper,  in the context of RBMs.


We can use RBMs for unsupervised learning, as a clustering algorithm, for pretraining larger nets, and for generating sample data.   Mostly, however, RBMs are an older, toy model useful for understanding unsupervised Deep Learning.


We define an RBM with an Energy Function


and it’s associated Partition Function

Z(mathbf{v},mathbf{h})=sumlimits_{mathbf{v},mathbf{h}} e^{-E(mathbf{v},mathbf{h}})

The joint probability is then


and the probability of  the visible units is computed by marginalizing over the hidden units


Note we also mean the probability of observing the X={v}, given the weights W.


The Likelihood is just the log of the probability


We can break this into 2 parts:


 F(mathbf{v,h}) is just the standard Free Energy


We call F^{c}(mathbf{v}) the clamped Free Energy


because it is like a Free Energy, but with the visible units clamped to the data X.

The clamped F^{c}(mathbf{v}) is easy to evaluate in the RBM formalism, whereas F(mathbf{v,h}) is computationally intractable.

Knowing the Z(mathbf{v,h}) is ‘like’ knowing the equilibrium distribution function, and methods like RBMs appear to approximate ln;Z(mathbf{v,h}) in some form or another.

RBM Training

Training an RBM proceeds iteratively by approximating the Free Energies at each step,


and then updating W with a gradient step


RBMs are usually trained via Contrastive Divergence (CD or PCD).  The Energy function, being quadratic, lets us readily factor Z using a mean field approximation, leading to simple expressions for the conditional probabilities



and the weight update rule

dmathbf{w}=langle mathbf{v}^{T}mathbf{h} rangle_{0}-langlemathbf{v}^{T}mathbf{h}rangle_{infty}

positive and negative phases

RBM codes may use the terminology of positive and negative phases:

positive;;langlemathbf{v,h}rangle^{+}rightarrow negative;;langlemathbf{v,h}rangle^{-}

langlemathbf{v,h}rangle^{+}:  The expectation langlemathbf{v,h}rangle_{0} is evaluated, or clamped, on the data.

langlemathbf{v,h}rangle^{-}:   The expectation langlemathbf{v,h}rangle_{infty} is to be evaluated on the prior distribution p(mathbf{X}|mathbf{W}).  We also say langledotrangle_{infty} is evaluated in the limit of infinite sampling, at the so-called equilibrium distribution.  But we don’t take the infinite limit.

CD approximates langlemathbf{v}^{T}mathbf{h}rangle_{infty} –effectively evaluating the (mean field) Free Energy   by running only 1 (or more) steps of Gibbs Sampling.

So we may see

dmathbf{w}=langle mathbf{v}^{T}mathbf{h} rangle_{0}-langlemathbf{v}^{T}mathbf{h}rangle_{1}

or, more generally, and in some code bases, something effectively like

dmathbf{w}=langle mathbf{v}^{T}mathbf{h} rangle^{+}-langlemathbf{v}^{T}mathbf{h}rangle^{-}

Pseudocode is:

Initialize the positive  (mathbf{v,h})^{+} and negative (mathbf{v,h})^{-}

Run N iterations of:

Run 1 Step of Gibbs Sampling to get the negative  (mathbf{v,h})^{-}:

sample the hiddens given the (current) visibles: p(h^{-}_{i}=1|mathbf{v})

sample the visibles given the hiddens (above): p(v^{-}_{i}=1|mathbf{h})

Calculate the weight gradient:

dw_{i,j}=langle mathbf{v}^{T}mathbf{h} rangle^{+}-langlemathbf{v}^{T}mathbf{h}rangle^{-}

Apply Weight decay or other regularization (optional):


Apply a momentum (optional):


Update the Weights:


Energy Renormalizations

What is Cheap about learning ?  A technical proof in the Appendix notes that

knowing the Partition function Z(mathbf{v,h}) is not the same as knowing the underlying distribution p(mathbf{v}).

This is because the Energy E(mathbf{v,h}) can be rescaled, or renormalized, in many different ways, without changing Z(mathbf{v,h}) .

This is a also key idea in Statistical Mechanics.

The Partition function is a generating function; we can write all the macroscopic, observable thermodynamic quantities as partial derivatives of ln;Z.  And we can do this without knowing the exact distribution or energies–just their renormalized forms.

Of course, our W update rule is a derivative of ln;Z(mathbf{v,h})

The proof is technically straightforward, albeit a bit odd at first.

Matching Z(y) does not imply matching p(y)

Let’s start with the visible units mathbf{v}=.  Write



We now introduce the hidden units, mathbf{h}, into the model, so that we have a new, joint probability distribution


and a new, Renormalized , partition function


Where RG means Renormalization Group.  We have already discussed that the general RBM approach resembles the Kadanoff Variational Renormalization Group (VRG) method, circa 1975. This new paper points out a small but important technical oversight made in the ML literature, namely that

having Z^{RG}(mathbf{v,h})=Z(mathbf{v})does not imply tilde{p}(mathbf{v})=sumlimits_{mathbf{h}}tilde{p}(mathbf{v})=p(mathbf{v})

That is, just because we can estimate the Partition function well does not mean we know the probability distributions.

Why?   Define an arbitrary non-constant function K(mathbf{v}) and write


K is for Kadanoff RG Transform, and ln K is the normalization.

We can now write an joint Energy E(mathbf{v,h}) with the same Partition function as our RBM E(mathbf{h}), but with completely different joint probability distributions.  Let


Notice what we are actually doing.  We use the K matrix to define the RBM joint Energy function.  In RBM theory, we restrict E(mathbf{v,h}) to a quadratic form, and use variational procedure to learn the weights , thereby learning K.

In a VRG approach, we have the additional constraint that we restrict the form of K to satisfy constraints on it’s partition function, or, really, how the Energy function is normalized.  Hence the name ‘Renormalization.‘   This is similar, in spirit, but not necessarily in form, to how the RBM training regularizes the weights (above).

Write the total, or renormalized, Z as


Expanding the Energy function explicitly, we have


where the Kadanoff normalization factor tilde{Z} appears now the denominator.

We can can break the double sum into sums over and h


Identify tilde{Z} in the numerator


which factors out, giving a very simple expression in h


In the technical proof in the paper, the idea is that since h is just a dummy variable, we can replace h with v.  We have to be careful here since this seems to only applies to the case where we have the same number of hidden and visible units–a rare case.  In an earlier post on VRG, I explain more clearly how to construct an RG transform for RBMs.  Still,  the paper is presenting a counterargument for arguments sake, so, following the argument in the paper,  let’s say

sumlimits_{mathbf{h}}e^{-E(mathbf{h})} =sumlimits_{mathbf{v}}e^{-E(mathbf{v})}

This is like saying we constrain the Free Energy at each layer to be the same.


This is also another kind of Layer Normalization–a very popular method for modern Deep Learning methods these days.

So, by construction, the renormalized and data Partition functions are identical

Z^{RG}(mathbf{v,h}) =Z(mathbf{v})

The goal of Renormalization Group theory is to redefine the Energy function on a difference scale, while retaining the macroscopic observables.

But , and apparently this has been misstated in some ML papers and books, the marginalized probabilities can be different !

To get the marginals, let’s integrate out only the h variables


Looking above, we can write this in terms of and its normalization tilde{Z}


which implies

tilde{p}(mathbf{v})ne p(mathbf{v})  

So what is cheap about deep learning ?

RBMs let us represent data using a smaller set of hidden features.  This is, effectively, Variational Renormalization Group algorithm,  in which we approximate the Partition function, at each step in the RBM learning procedure, without having to learn the underlying joining probability distribution.  And this is easier.  Cheaper.

In other words, Deep Learning is not Statistics.  It is more like Statistical Mechanics.  

And the hope is that we can learn from this old scientific field — which is foundational to chemistry and physics — to improve our deep learning models.

Post-Post Comments

Shortly after this paper came out, Comment on “Why does deep and cheap learning work so well?”that the proof in the Appended is indeed wrong–as I suspected and pointed out above.

It is noted that the point of the RG theory is to preserve the Free Energy form one layer to another, and, in VRG, this is expressed as a trace condition on the Transfer operator


where -T(mathbf{v},mathbf{h})=E(mathbf{v})+K(mathbf{v})+lntilde{Z}

It is, however, technically possible to preserve the Free Energy and not preserve the trace condition.  Indeed, because K(mathbf{y}) is not-constant, thereby violating the trace condition.

From this bloggers perspective, the idea of preserving Free Energy, via either a trace condition, or, say, by layer normalization, is the import point.  And this may mean to only approximately satisfy the trace condition.

In Quantum Chemistry, there is a similar requirement, referred to as a Size-Consistency and/ or Size-Extensivity condition.  And these requirements proven essential to obtaining highly accurate, ab initio solutions of the molecular electronic Schrodinger equation–whether implemented exactly or approximately.

And, I suspect, a similar argument, at least in spirit if not in proof, is at play in Deep Learning.

Please chime in our my YouTube Community Channel

see also: https://m.reddit.com/r/MachineLearning/comments/4zbr2k/what_is_your_opinion_why_is_the_concept_of/)

Source link


Please enter your comment!
Please enter your name here