The classic explanation for the effectiveness of regularization in NNs is that it prevents overfitting by simplifying the model (Occam’s razor). I have an additionally explanation and am curious to know the community’s thoughts.
Imagine a weight such that on some batches it has a positive gradient, on some a negative but on average is near 0. Training the network without regularization would not cause this weight to converge to its optimal value but rather it will remain near its initial value or possibly move about in a random fashion. When it comes to test time, this weight on average will decrease the accuracy of the model. Regulation pushes such a weight to zero preventing its negative interference.
This is different from the classic explanation of regularization as it claims that in some cases the weights are not overfitting but rather failing to converge.
Do such weights exist in practice?
Does a random weight (which will not converge from SGD) decrease the accuracy of the model?