I recently gave a talk (called Deep Learning Without The Hype), and in that talk I showed a classic slide (inserted below) from the classic Stanford’s Andrew Ng’s MOOC Machine Learning course . After the talk I was approached by some of the students and they had a number of interrelated questions raised specifically by this slide, and at the time I don’t believe I gave (or even had) a completely satisfactory answer to them.
So since then I formulated a (hopefully) more satisfactory answer to it. And it surprisingly showed a few fundamental holes in my understanding (even though I’m a successful enough practitioner of using this stuff in industry settings). This isn’t anything new – but it is a hopefully much more thorough revision of some fundamental stuff. I thought to share in the hope it’s interesting to take this step back and really aim for full clarity, especially when teaching, and to maybe receive any further insights from people.
So here goes the revised questions and answers.
Q: “Do you actually have that surface? If so why not just take the minimum of it?”
Q: “Do you really have all the gradients? If not then how do you do gradient descent? And if so, why not also just pick one where the gradient is zero?”
One-liner Answer: It’s due to computational limitations we hit with large models that we use gradient descent and nothing to do with it being a special “learning” algorithm — just repeatedly choosing random sets of parameters and keeping the best performing one would work if we had a powerful enough machine+enough time.
Abridged Answer: You mathematically and theoretically have the entire Cost surface — it’s functional form is fully specified (that’s what your model and cost function are), all you have to do is put actual values in for your model parameters and compute it. And you mathematically and theoretically have the entire vector field of gradients for that surface — i.e. you have all the gradients. But in reality you only have what you are able to compute with your limited resources and time. The aim of the game is to find a set of parameters that make for a low error, and there’s no rules on how you do that. But trying to do a grid search (which is akin to plotting/computing the loss surface) in a 1million+ dimension parameter space is going to be a slow progress (think taking years even on current state of the art computers). It is precisely a computation limitation problem that gradient descent overcomes, not a mathematical one. In reality it might be better to think of it as a “search” algorithm rather than an “optimisation” algorithm, i.e. it efficiently searches about in parameter space for areas of low(er) loss.A nice way to think about the extent of the computational efficiency/gains that gradient descent gives you over say a grid search over the model parameters is to think about the dimension of the space each traverses. If your model has a million paramters then a volume in that space is a million dimensional volume and we know that volumes grow like V=l^D — that’s going to be painfully big. But gradient descent only traverses a line through parameter space (by following gradients), and lines are only 1D. Hopefully that’s persuasive to you.
Ulta-Extended And Complete With Coded Examples Answer:
See this notebook
Link to post on Medium https://medium.com/@DBCerigo/on-why-gradient-descent-is-even-needed-25160197a635