Naftali Tishby, The Hebrew University of Jerusalem, Israel


The Information Theory of Deep Learning


Based on analytic and numerical studies, I will describe novel theoretical understating of Deep Neural Network (DNN) based on the analysis of the representations of the layers in the Information Plane - the mutual information between each layer and the input layer and the desired output of the layer. I will show that the standard Stochastic Gradient Descent (SGD) training of DNNs has two distinct phases: (1) fast drift with fitting the training data and thus increase the mutual information between the layers and the desired tables; (2) slow diffusion which compress the representation of the input layer and reduce the mutual information between the layers and the input. The compression phase, which - as we prove - dramatically improves the generalization power of the network, is in fact the more important aspect of SGD. It also provides a new understating of the benefit of the hidden layers, the lack of overfitting, and the features the layers represent Based on joint works with Ravid Schwartz-Ziv and Noga Zaslavsky