This blog post is a summary of the paper Meta-Learning Update Rules for Unsupervised Representation Learning that was authored by Metz, L., Maheswaranathan, N., Cheung, B., & Sohl-Dickstein, J., and was published in the International Conference on Learning Representations in 2019.
Table of Contents
Motivation and Background
In learning scenarios with input-output pairs, supervised learning procedures have been reliable in terms of performance. Nevertheless, when labels are unknown or unavailable, unsupervised (representation) learning is the means of choice. Its goal is to uncover latent representations of the data at hand.
Unsupervised learning, however, comes with its own pitfalls. Such algorithms exhibit a target task mismatch where the optimization of unsuited objectives leads to meaningful representations only collaterally. This issue can be counteracted by using meta-learning approaches where learning the task itself is the focal point instead of optimizing an explicit objective. Ultimately, an update rule rather than features is learned.
The idea of meta-learning has been prevalent since the 1980s where the need has been expressed for approaches which do not merely learn features. These systems include at least two levels of problem solving [2], referred to as an outer loop (or meta-training loop) and an inner loop. Optimization of the latter is parametrized by meta-parameters and evaluated by a meta-objective in the outer loop. In the 1990s, a parametric learning rule was proposed with focus on simulating biological learning. With the argumentation that traditional backprogagation does not lie within the space of biological plausible learning rules, the newly defined supervised synaptic change model incorporated pre-synaptic and post-synaptic signals for weight updates [3, 4].
Figure 1: Biological synaptic model [4]
Unlike previous work in this area, Metz et al. specifically developed a meta-learning scheme with an unsupervised inner loop to learn representations from unlabeled data and compensate for the aforementioned situations without access to labels. Simultaneously, they aimed at biological plausibility and making the model perform well in few-shot learning scenarios, i.e. when only few datapoints are available.
Methodology
Concept Overview
The general concept is displayed in Figure 2 and consists of an outer loop for training the meta-objective with labeled data, and an inner loop operating on the base model and the unsupervised update rule – using unlabeled data only. The update rule itself is updated in the outer loop by stochastic gradient descent on the meta-objective.
Figure 2: Meta-lerning scheme overview [1]
Base Model
The base model is a multi-layer perceptron (MLP) with batch normalization and ReLU activation functions. As depicted in Figure 3, an (unlabeled) input x_0 is fed into the network in order to yield a representation x^L that is supposed to perform well in few-shot scenarios. The base model incorporates forward weights W^l together with biases b^l and separate backward weights V^l. All base model weights are denoted by \phi.
Figure 3: Base model architecture [1]
Each neuron in each layer is associated with a so-called update network, a MLP that takes the pre- and post-nonlinearity activations, the backward weights and error signal of the previous (upper) layer. Its output is a hidden state whose linear projection generates an error signal \delta. Note that there is no actual loss, but a learned top-down signal used in backpropagation.
The fact that meta-parameters \theta are shared in all update networks is crucial for generalization. \theta includes hyperparameters such as the network and optimizer weights.
Weight Updates and Backward Error Propagation
The weights at the next step are the exponential moving averages of the updates where the latter are computed using the unsupervised update rule:
W^l_{t+1} = W^l_t (1-\lambda)+\Delta W^l\lambda \\ V^l_{t+1} = V^l_t (1-\lambda)+\Delta V^l\lambda |
More concretely, \Delta W^l and \Delta V^l are functions of the previous and current layer's hidden states (and previous weights). The update computation employs neuron-local terms where the hidden states are generated by post- and pre-nonlinearity activations as mentioned above, which enhances similarity to biological synaptic updates. Decorrelation terms are also utilized to avoid neurons capturing identical information, therefore leading to higher generalizability.
Figure 4: Backward pass of the base model
Another aspect elementary to biological plausibility is the backpropagation procedure, displayed in Figure 4, using backward weights independent from the forward weights. The top-down learning signal d^L is calculated by a network operating on the top-level representation x^L, which yields a vector instead of a scalar loss. The procedure is recursive and similar to traditional backprop.
To get the current layer's error signal, the current learning signal is multiplied element-wise with the post-nonlinearity activation and summed with the linear projection of the hidden state (note again that \theta is shared across all layers):
\delta^l_{ijd} = d^l_{ijd} \odot \sigma(z^l_{ijd}) + \sum\limits_k^{64}(\theta_{errorPropW})_{kd}h^l_{ijk}+(\theta_{errorPropB})_d |
(l: layer index, i: batch index, j: neuron index, d: feature index)
A detailed hidden state computation is not further specified here but it essentially consists of unit and batch dimension convolution sequences while incorporating statistics across those as well.
By multiplication of the upper backward weights V^{l+1} and the upper error signal \delta^{l+1}, plus a normalization step, the next learning signal is obtained as follows:
\tilde{d}^l_{imd} = \sum\limits_j^{N^{l+1}} \delta_{ijd}^{l+1}(V^{l+1})_{mj} \quad\quad\quad\quad d^l_{imd} = \tilde{d}^l_{imd}(\frac{1}{32} \sum\limits_a^{32} \tilde{d}^l_{ima})^{-\frac{1}{2}} |
(l: layer index, i: batch index, m: neuron index, d: feature index, N^{l+1}: previous layer hidden size)
Meta-Objective
In contrast to the base model, the meta-objective is computed with labeled data in the outer loop, aiming at measuring the representations' quality. The authors formulated their meta-objective as linear (ridge) regression. The weight estimation and performance evaluation are done on different data batches to further magnify generalization. Due to centering and normalization of the targets for stability reasons, the distance between estimation and target corresponds to the cosine distance.
\textrm{MetaObjective}(.;\phi)=\textrm{CosDist}(y_b, \hat{v}^Tx^L_b)\\ \hat{v} = \arg\min\limits_{v}(||y_a-v^Tx_a^L||^2+\lambda||v||^2) |
Meta-optimization of \theta is done by minimizing the sum of meta-objectives (i: iteration step) over the task expectation, where \phi depends on \theta:
\theta^* = \arg\min\limits_{\theta} \mathbb{E}_{task}\Big[\sum\limits_{t}\textrm{MetaObjective}(\phi_t)\Big] |
Experimental Setup and Results
Data Distribution, Architecture Distribution and Stability
Training data samples came from CIFAR 10 [5], Imagenet [6] and Glyph/Alphabet [1] with image resolutions of 16x16 or less. The test data was sampled from MNIST [6], Fashion MNIST [7], hold-out classes from Imagenet as well as movie reviews from IMDB [8] with 14x14 and 28x28 image resolutions. Note that the evaluation contained binary sentiment classification text data even though training was done exclusively on image data!
Variation within the base model architectures was ensured by sampling the number of layers (uniformely from [2,5]) and the number of units per layer (logarithmically from [64,512]).
The application of dropout and data augmentation amplified generalizability. More precisely, input permutations such as shifts, rotations and noise were adopted. To further overcome the difficulty of training an update rule instead of mere features, Metz et al. went to great lengths to ensure stability. In addition to previously mentioned choices, gradients were approximated by gradient clipping and truncated backpropagation with further sampling strategies, e.g. for the unrolling step numbers. The maximal inner loop iterations were restricted, and 512 workers trained the model in a distributed fashion.
With this setup, training yielded 200 updates and took eight days.
Results
Assessment of the meta-optimization was handled by a rolling average of all datasets, architectures and unrolling step numbers. As visible in Figure 5, the steady decrease in the training meta-objective, even after 200 hours, is an indication for the approximation effectiveness. Similar declining trends can be seen in testing for all datasets except IMDB (text domain), for which it seems to overfit.
Figure 5: Meta-Objective for (left) training and (right) testing distributions ("Mini"/"Tiny" refers to resolution size) [1]
In order to examine how the learned update rule generalizes to other datasets and domains, few-shot classification was done on several architectures (learned optimizer, random initialization, variational autoencoder, supervised model) with the same base model. Comparing the values (see Figure 6), the learned optimizer outperforms all others with good accuracy, which is remarkable as none of the datasets had been used in training before, some of them with a larger image resolution in testing.
Despite the inherent domain mismatch of IMDB, there is an outstanding improvement of 10% in terms of accuracy within the first thirty hours – it still decreases later on. Note again that there were no text training samples, only images!
Figure 6: Accuracy for (left) different models on evaluation data and (right) IMDB data [1]
When assessing the generalization over network architectures, good accuracy was maintained for both depth variation of up to eleven layers and width manifoldness of up to 10^4 units per layer (see Figure 7). Noteworthy is that sampling for training was done with much lower values. Besides, the use of different activation functions resulted in advances for all functions compared to random initialization. Overall, most functions have decent accuracy values even though the model was exclusively trained on the ReLU function.
Figure 7: Accuracy for (left) network architectures and (right) activation functions [1]
Conclusion
Altogether, Metz et al. managed to meta-learn an unsupervised update rule where biological constraints were adopted as proposed in earlier works, and several measures and strategies were deployed for the sake of generalization and stability. In the end, the update rule does not only have the ability to generalize to unseen datasets and varying domains (even to text data) as well as to different network architectures, but it also outperforms or at least matches existing methods.
Personal Review
The main paper is neatly structured in meaningful sections, which is a great benefit in terms of overview. Furthermore, the authors provided a large amount of details in the appendices. This caters to the reader's understanding since especially the methodology and setup are explained in a granular fashion. Another impressive aspect is that it was the first proposal of this kind, while simultaneously achieving such great results. Especially the fact that the learned update rule was able to generalize to text data, a domain entirely different from training, is highly interesting.
Nevertheless, certain weaknesses have not gone unnoticed. While the paper does contain a "discussion" section, it is rather a summary of the paper, unlike the title suggests. The authors failed to critically examine their work and highlight aspects that could have been further improved. On top of that, there was no future work mentioned to take up on the existing approach.
Another point to be aware of is the resource-dependent reproducability. If an institution wanted to reproduce the training approach, comparable computation power as Google Brain had it available for their research would be necessary. Related to that issue is that the code is only partly available [10].
Moreover, the paper displayed inconsistencies, e.g. the discrepancy of the meta-optimizer in the main paper (SGD) and the appendices (Adam), which lead to confusion.
Overall, the paper is definitely worth reading, so take a look!
References
[1] Metz, L., Maheswaranathan, N., Cheung, B., & Sohl-Dickstein, J. (2018). Meta-Learning Update Rules for Unsupervised Representation Learning. International Conference on Learning Representations. arXiv preprint arXiv:1804.00222.
[2] Schmidhuber, J. (1987). Evolutionary Principles in Self-referential Learning. Diploma Thesis. Department of Informatics, Technical University Munich.
[3] Bengio, Y., Bengio, S., & Cloutier, J. (1991). Learning a Synaptic Learning Rule. In IJCNN-91-Seattle International Joint Conference on Neural Networks (Vol. 2, pp. 969 vol). IEEE.
[4] Bengio, S., Bengio, Y., Cloutier, J., & Gecsei, J. (1992). On the Optimization of a Synaptic Learning Rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks (Vol. 2). Univ. of Texas.
[5] Krizhevsky, A., & Hinton, G. (2009). Learning Multiple Layers of Features from Tiny Images.
[6] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi:10.1007/s11263-015-0816-y.
[7] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11), 2278-2324.
[8] Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv preprint arXiv:1708.07747.
[9] Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 142-150).
[10] Tensorflow, Learning Unsupervised Learning Rules (2018). GitHub Repository. https://github.com/tensorflow/models/tree/master/research/learning_unsupervised_learning [Accessed on July 1st, 2020]