Until very recently, the incredible generalization capabilities of deep neural networks have been attributed to flat minimas. That is, “flat minimas” generalize better because of the intuitive explanation that flatter minimas are more robust to perturbation (train dataset sampling). For some time, this discovery has been attributed to (Keskar et al., 2017), often accompanying their sketch of the intuition.
Although this “intuitive” explanation has been somewhat scientifically disputed multiple times until now. I do not now where the scientific concensus is now on this subject. Because of this, I have never got too deep into this topic, and I only knew the attribution to Keskar et al. However, it turned out that J. Schmidhuber and S. Hochreiter came up with this idea in… 1994! Hell, they even have a paper named “Flat minima” (Hochreiter & Schmidhuber, 1997). Even better, one of their paper on the topic had been presented at NIPS’94 (Hochreiter & Schmidhuber, 1994). To me personally, this sets a whole new standard on being Schmidhubered… (Fortunately though, the two papers by Schmidhuber co. did and still do get properly cited.)
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaProceedings of the International conference on learning representations, Feb 2017
Flat MinimaNeural Computation, Jan 1997
Simplifying Neural Nets by Discovering Flat MinimaIn Advances in Neural Information Processing Systems, Jan 1994