A friend from grad school pointed out a great foundational paper on Boltzmann .  It is a paper from complex systems theory

A Mean Field Theory Learning Algorithm for Neural Networks

just a couple years after Hinton’s seminal 1985 paper , “A Learning Algorithm for Boltzmann Machines“.

What I really like is how we see the foundations of deep learning arose from statistical physics and theoretical chemistry. My top 10 favorite take-a-ways are:

• The relation between Boltzmann Machines and the nearly forgotten Hopfield Associative Memory.  And why Hidden nodes made a difference in just coming up with reasonable algorithm.

• What an actual mean theory (MFT) is.  They don’t just factor the Energy function or use a bi-partite graph. They introduce continuous fields (U,V)  via the delta function, and then take a saddle point approximation. Today we only see MFTs expressed as the resulting factorized models like RBMs and Sum-Product Networks; we don’t see the fields.

• That an annealing schedule was not about adjusting the learning rate–it was about adjusting the temperature schedule.   And that adjusting the annealing schedule is a huge flexible part of the model.  Yes, Boltzmann Machines were originally Temperature dependent.

• How to derive the learning rules for neural nets using Markov chains and the principle of microscopic reversibility.

• Where the tanh activation function came from.  They come from the MFT.  So today, sure, we use ReLUs.  But we don’t just have a random Energy function.  There is a deep reason for these activation functions.

• How the MFT here is also a quenched approximation.  This comes up all the time in analyzing the replica symmetric of mean field spin glasses (i.e. replica symmetry breaking (RSB)).  You can’t understand the phase diagram of a spin glass without understanding this.

•  Why the Free Energy is smoother (i.e. closer to convex) than the highly non-convex, T=0 Energy landscape.

• We do not anneal to T=0.  RBMs, and, I suspect, Deep Learning in general, is not about traversing the T=0 Energy Landscape.

• The paper presents a very simple example that anyone can code up and run in a day:  2 input nodes, 1 output node, 4 hidden nodes.

• The references !  Great stuff.

This is, all in all, a fantastic paper from the statistical physics point of view.  And if there is interest, I can go through the details here.

Happy New Year everyone!