A friend from grad school pointed out a great foundational paper on Machines.  It is a 1987 paper from complex systems theory

A Mean Field Theory Learning Algorithm for Neural Networks

just a couple years after Hinton’s seminal 1985 paper , “A Learning Algorithm for Boltzmann Machines“.

What I really like is how we see the foundations of deep arose from statistical physics and theoretical chemistry. My top 10 favorite take-a-ways are:

• The relation between Boltzmann Machines and the nearly forgotten Hopfield Associative Memory.  And why Hidden nodes made a big difference in just coming up with reasonable training algorithm.

• What an actual mean theory (MFT) is.  They don’t just factor the Energy function or use a bi-partite graph. They introduce continuous fields (U,V)  via the delta function, and then take a saddle point approximation. Today we only see MFTs expressed as the resulting factorized models like RBMs and Sum-Product Networks; we don’t see the fields.

• That an annealing schedule was not about adjusting the learning rate–it was about adjusting the temperature schedule.   And that adjusting the annealing schedule is a huge flexible part of the model.  Yes, Boltzmann Machines were originally Temperature dependent.

• How to derive the learning rules for neural nets using Markov chains and the principle of microscopic reversibility.

• Where the tanh activation function came from.  They come from the MFT.  So today, sure, we use ReLUs.  But we don’t just have a random Energy function.  There is a deep reason for these activation functions.

• How the MFT here is also a quenched approximation.  This comes up all the time in analyzing the replica symmetric solutions of mean field spin glasses (i.e. replica symmetry breaking (RSB)).  You can’t understand the phase diagram of a spin glass without understanding this.

•  Why the Free Energy is smoother (i.e. closer to convex) than the highly non-convex, T=0 Energy landscape.

• We do not anneal to T=0.  RBMs, and, I suspect, Deep Learning in general, is not about traversing the T=0 Energy Landscape.

• The paper presents a very simple example that anyone can code up and run in a day:  2 input nodes, 1 output node, 4 hidden nodes.

• The references !  Great stuff.

This is, all in all, a fantastic paper from the statistical physics point of view.  And if there is interest, I can go through the details here.

Happy New Year everyone!

SHARE