Hinton introduced Free Energies in his 1994 paper,

This paper, along with his wake-sleep algorithm, set the foundations for modern variational learning.  They appear in his RBMs, and more recently, in Variational AutoEncoders (VAEs) .

Of course, Free Energies come from Chemical Physics.  And this is not surprising, since Hinton’s graduate advisor was a famous theoretical chemist.

They are so important that Karl Friston has proposed the  The Free Energy Principle : A Unified Brain Theory ?

(see also the wikipedia and this 2013 review)

What are free Energies and why do we use them in Deep Learning ?

#### The Free Energy is a Temperature Weighted Average Energy

In (Unsupervised) Deep Learning, Energies are quadratic forms over the weights. In an RBM, one has

$E(mathbf{h},mathbf{v})=mathbf{v}^{T}mathbf{a}+mathbf{b}^{T}mathbf{h}+mathbf{v}^{T}mathbf{Wh}$

This is the T=0 configurational Energy, where each configuration is some $(mathbf{h},mathbf{v})$ pair.  In chemical physics, these Energies resemble an Ising model.

The Free Energy $F$ is a weighted average of the all the global and local minima $E_{i}$

$e^{-beta F}=sumlimits_{i}e^{-beta E_{i}}$

##### Zero Temperature Limit

Note: as $Trightarrow 0$, the the Free Energy becomes the T=0 global energy minima $E_{0}$.  In limit of zero Temperature, all the terms in the sum approach zero

$e^{-beta E_{i}}rightarrow dfrac{1}{e^{infty}}dfrac{1}{e^{E_{i}}}rightarrow 0$

and only the largest term, the largest negative Energy, survives.

$F(Trightarrow 0)rightarrow E_{0}$

##### Other Notation

We may also see F written in terms of the partition function Z:

$-beta F=langle;ln;Z;rangle$

$Z=sumlimits_{i}e^{-beta E_{i}}$

where the brakets $langlecdotsrangle$ denote an equilibrium average, and expected value $mathbb{E_P}[cdots]$ over some equilibrium probability distribution $mathbb{P}$(we don’t normalize with 1/N here;  in principle, the sum could be infinite.)

Of course, in deep learning, we may be trying to determine the distribution $mathbb{P}$, and/or we may approximate it with some simpler distribution $mathbb{Q}simmathbb{P}$ during inference. (From now on, I just write P and Q for convenience)

But there is more to Free Energy learning than just approximating a distribution.

#### The Free Energy is an average solution to a non-convex optimization problem

In a chemical system, the Free Energy averages over all global and local minima below the Temperature T–with barriers below T as well.  It is the Energy available to do work.

##### Being Scale Free: T=1

For convenience, Hinton explicitly set T=1.  Of course, he was doing inference, and did not know the scale of the weights W.  Since we don’t specify the Energy scale, we learn the scale implicitly when we learn W.  We call this being scale-free

So in the T=1, scale free case, the Free Energy implicitly averages over all Energy minima where $E_{i}<1$, as we learn the weights  W.   Free Energies solve the problem of Neural Nets being non-convex by averaging over the global minima and nearby local minima.

##### Highly degenerate non-convex problems

Because Free Energies provide an average solution, they can even provide solutions to highly degenerate non-convex optimization problems:

##### When do Free Energy solutions fail ?

They will fail, however, when the barriers between Energy basins are larger than the Temperature.

This can happen if the effective Temperature drops close to zero during inference.  Since T=1 implicitly in inference, this means when the weights W are exploding.

See: Normalization in Deep Learning

Systems may also get trapped if the Energy barriers grow very large –as, say, in the glassy phase of a mean field spin glass. Or a supercooled liquid–the co-called Adam Gibbs phenomena.  I will discuss this in a future post.

In either case, if the system, or solver, gets trapped in a single Energy basin, it may appear to be convex, and/or flat (the Hessian has lots of zeros).  But this is probably not the optimal solution to learning when using a Free Energy method.

#### Free Energies produce Ruggedly Convex Landscapes

It is sometimes argued that Deep Learning is a non-convex optimization problem.  And, yet, it has been known for over 20 years that networks like CNNs don’t suffer from the problems of local minima?  How can this be ?

At least for unsupervised , it has been clear since 1987 that:

An important  property of the effective [Free] Energy function E(V,0,T) is that it has a smoother landscape than E(S) [T=0] …

Hence, the probability of getting stuck in a local minima decreases

Although this is not specifically how Hinton argued for the Free Energy — a decade later.

#### The Hinton Argument for Free Energies

Why do we use Free energy methods ? Hinton used the bits-back argument:

Imagine we are encoding some training and sending it to someone for decoding.  That is, we are building an Auto-Encoder.

If have only 1 possible encoding, we can use any vanilla encoding method and the receiver knows what to do.

But what if have 2 or more equally valid codes ?

Can we save 1 bit by being a little vague ?

##### Stochastic Complexity

Suppose we have N possible encodings $[h_{1},h_{2},cdots]$, each with Energy $E_{i}$.    We say the data has stochastic complexity.

Pick a coding with probability $p_{i}$ and send it to the receiver.   The expected cost of encoding is

$langle costrangle_{encode}=sumlimits_{i}p_{i}E_{i}$

Now the receiver must guess which encoding $h_{i}$ we used.  The decoding cost of the receiver is

$langle costrangle_{decode}=sumlimits_{i}p_{i}E_{i}-H$

where H is the Shannon Entropy of the random encoding

$H=sumlimits_{i}p_{i}ln(p_{i})$

The decoding cost looks just like a Helmholtz Free Energy.

Moreover, we can use a sub-optimal encoding, and they suggest using a Factorized (i.e. mean field) Feed Forward Net to do this.

To understand this better,  we need to relate

#### Thermodynamics and Inference

In 1957, Jaynes formulated the MaxEnt principle which considers equilibrium thermodynamics and statistical mechanics as inference processes.

In 1995, Hinton formulated the Helmholtz Machine and showed us how to define a quasi-Free Energy.

In Thermodynamics, the Helmholtz Free Energy F(T,V,N) is an Energy that depends on Temperature instead of Entropy.  We need

$E(S,V,N)rightarrow F(T,V,N)$

and F is defined as

$F(T,V,N) = E(S,V,N) - TS(V,N)$

In ML, we set T=1. Really, the Temperature equals how much the Energy changes with a change in Entropy (at fixed V and N)

$T=left(dfrac{partial E}{partial S}right)_{N,V}$

Variables like E and S depend on the system size N.  That is,

as $Nrightarrow 2N$

$E(2N)=2E(N),;;S(2N)=2S(N),;;T(2N)=T(N)=T$

We say S and T are conjugate pairs;  S is extensive, T is intensive.

(see more on this in the Appendix)

##### Legendre Transform

The conjugate pairs are used to define Free Energies via the  Legendre Transform:

Helmholtz Free Energy:  F(T) = E(S) – TS

We switch the Energy from depending on S to T, where $T=left(dfrac{partial E}{partial S}right)$.

Why ? In a physical system, we may know the Energy function E, but we can’t directly measure or vary the Entropy S.  However, we are free to change and measure the Temperature–the derivative of E w/r.t. S:

$T=left(dfrac{partial E}{partial S}right)_{N,V}$

This is a powerful and general mathematical concept.

Say we have a convex function f(x,y,z), but we can’t actually vary x. But we do know the slope, w, everywhere along x

$w=left(dfrac{partial f}{partial x}right)_{y,z}$.

Then we can form the Legendre Transform , which gives g(w,y,z) as

the ‘Tangent Envelope of f() along x

$f(x,y,z)rightarrow g(w,y,z)$,

$g(w,y,z)=f(x,y,z)-xleft(dfrac{partial f}{partial x}right)_{y,z}$.

or, simply

$g(w)=f(x)-wx$.

Note: we have converted a convex function into a concave one.  The Legendre transform is concave in the intensive variables and convex in the extensive variables.

Of course, the true Free Energy F is convex; this is central to Thermodynamics (see Appendix).  But that is because while it is concave in T, we evaluate it at constant T.

But what if the Energy function is not convex in the Entropy ?  Or, suppose we extract an pseudo-Entropy from sampling some data, and we want to define a free energy potential (i.e. as in protein folding).  These postulates also fail in systems like blog post on spin chains.

How can we  always form a convex Free energy ?

Answer:  Take the convex hull

##### Legendre Fenchel Transform

When a convex Free Energy can not be readily be defined as above, we can use the the generalized the Legendre Fenchel Transform, which provides a convex relaxation via

the Tangent Envelope , a convex relaxation

$g(w)=maxlimits_{x}left(f(x)-wxright)$.

The Legendre-Fenchel Transform can provide a Free Energy, convexified along the direction internal (configurational) Entropy,  allowing the Temperature to control how many local Energy minima are sampled.

#### Practical Applications

Variational Inference is a growing are with lots of open source codes.  A few highlights:

Thanks again for reading and feedback is welcome.

Happy Fourth of July

#### Appendix

Extra stuff I just wanted to write down…

##### Convexity in Thermodynamics and Statistical Physics
I summarize the discussion in  Isarel  and the introduction by Wightman.

Gibbs formulated Thermodynamics in 1873, without the guidance of Statistical principles.  Using just the second law of thermodynamics, he reasoned that the stability of Equilibrium states implies:

1. S and V (or U and V) are the coordinates for the manifold of Equilibrium states
2. The Energy U is a convex function of S, V, and S is concave
3. The Temperature $T=left(dfrac{partial E}{partial S}right)_{N,V}$

Convexity has always been fundamental to thermodynamics and equilibrium stability.  Gibbs reasoned this from the properties of convex bodies.  And 20th Century statistical physics relied heavily on formal convex constructs like tangent potentials.

##### Extensivity and Weight Constraints

If we assume T=1 at all times, and we assume our Deep Learning Energies are extensive–as they would be in an actual thermodynamic system–then the weight norm constraints act to enforce the size-extensivity.

as $nrightarrow Mn$,

if $E(Mn)rightarrow ME(n)$,

and $E(n)simVertmathbf{W}_{n}Vert$,

then W should remain bounded to prevent the Energy E(n) from growing faster than Mn.  And, of course, most Deep Learning algorithms do bound W in some form.

##### Back to Stat Mech

Jaynes equates the Gibbs and Shannon Entropies.  This is controversial.

There is another, more direct way to get the Entropy from statistical mechanics.

If we define the configurational Entropy $S_{c}(E)$ as the log sum of the number of Energy configurations, or density of states, $Omega(E)$
$S_{c}(E)=k lnOmega(E)$  (and let k = 1)

Then the Temp-dependent canonical partition function is the Laplace transform over the density states

$Z(beta)=int_{0}^{infty}dEOmega(E)e^{(-beta E)}$

If we know the Free Energy as a function of T in terms of the partition function (and not just by Legendre Transform, see part II of this blog)

$beta F(beta)=-ln Z(beta)$

then we can reconstruct the configurational Entropy (in principle, numerically) by taking an inverse Laplace Transform

$Omega(E)=int_{C}dbeta e^{-beta[E-F(beta)]}$

where C denotes a contour integral.

This is important in the theory of glasses–part III of this post (2 holidays away).

SHARE