Bayesian Meta-Learning for the Few-Shot Setting via Deep Kernels

This is the blogpost for the paper 'Bayesian Meta-Learning for the Few-Shot Setting via Deep Kernels'.

Written by Massimiliano Patacchiola, Jack Turner, Elliot J. Crowley, Michael O’Boyle and Amos Storkey

Introduction
Methodology
Experiments and Results
Conclusion
Own Review

Introduction

Most machine learning methods necessitates a large amount of labeled data to recognize patterns, which may yet be hard to acquire in real-life scenarios. Under the few-shot setting where the data is insufficient to constrain the problem, one of the most prevalent methods is meta-learning. Meta-learning means learning to learn on the new tasks given the old and has the typical hierarchical framework of learning task-specific parameters in a lower hierarchy and task-common ones in an upper hierarchy [1] [2].

One representative Meta-learning algorithm is MAML, which learns from multiple old tasks a good initialization parameter for new tasks. The core idea is to propagate the gradient learned from multiple tasks back to a common starting parameter, so that the common parameter can capture general knowledge and generalize across tasks. Another groups of algorithms widely used for the few-shot classification case are MatchingNet, ProtoNet, RelationNet, Baseline++. Generally speaking, they are about learning to project inputs into a new space and measure the class similarity, whose differences lie in their adopted strategies to compare the distance.

Different from these aforementioned methods, this work proposes a Bayesian integral of the meta-learning lower hierarchy, motivated by the fact that human ability for few-shot inductive reasoning could derive from a Bayesian inference mechanism [3] [4]. This integral of the individual tasks is implicitly implemented via a Gaussian process model with a deep kernel. The learned deep kernel can transfer to new tasks, which is called Deep Kernel Transfer (DKT).

Methodology

Gaussian Process and Kernel

In machine learning, different models can be used to generalize the input-output relationship. Among them, Gaussian process models make predictions based on the prior belief shaped by the measurement of the similarity between points (the kernel), also providing uncertainty information of the estimation.

A Gaussian process is a collection of random functions and the probability distribution over possible functions is fully specified by the mean function $\begin{array}{l}m(x)\end{array}$ and covariance kernel $\begin{array}{l}k(x,x’)\end{array}$ , written as:

$\begin{array}{l}\displaystyle f(x)∼GP(m(x), k(x,x’))\end{array}$

where $\begin{array}{l}x\end{array}$ and $\begin{array}{l}x’\end{array}$ are arbitrarily two training instances [5]. Kernels are chosen based on the structure of training data and can take a variety of forms: linear, Radial Basis Function (RBF), Matérn, Polynomial, Cosine Similarity (CosSim), BatchNorm CosSim (BNCosSim) and spectral mixture [6]. A kernel can be generally written as $\begin{array}{l}k’(x, x’|θ)\end{array}$ with some hyperparameters $\begin{array}{l}θ\end{array}$ to learn and serves as the base kernel for deep kernel learning.

Deep Kernel Learning

A deep kernel is formulated as:

$\begin{array}{l}\displaystyle k(x, x’|θ, 𝜙) = k’(F_𝜙(x), F_𝜙(x’)|θ)\end{array}$

where the inputs are transformed via some non-linear mapping $\begin{array}{l}F_𝜙\end{array}$ with a deep architecture parameterized by a set of weights 𝜙 (e.g. Convolutional Neural Network for image inputs) [7] [8]. When applying the Gaussian process model in this work, the deep kernel replaces the original kernel, so θ and 𝜙 are learned jointly across tasks by maximizing the log marginal-likelihood of a Bayesian meta-learning model described in the following sections.

Meta-Learning

Meta-learning is about learning how to learn by gaining experience from a set of training tasks and evaluating using a set of test tasks. Each task $\begin{array}{l}T = \{S, Q\}\end{array}$ consists of a support set $\begin{array}{l}S = \{(x_l, y_l)\}_{l=1}^L\end{array}$ for learning how to solve this task and a query set $\begin{array}{l}Q = \{(x_m, y_m)\}_{m=1}^M\end{array}$ for evaluating the performance on this task. $\begin{array}{l}M\end{array}$ and $\begin{array}{l}L\end{array}$ are the number of instances in each set and $\begin{array}{l}M\end{array}$ is usually one order of magnitude greater than $\begin{array}{l}L\end{array}$ . A collection of such tasks define the dataset $\begin{array}{l}D = \{T_t\}_{t=1}^N\end{array}$ .

Bayesian Treatment

Instead of differentiating task-specific parameters through the lower hierarchy to obtain derivatives of the task-common parameters in the upper hierarchy as some state-of-the-art meta-learning methods, which can cause instability problems [9], the present paper proposed to replace the lower hierarchy with a Bayesian integral. This is called a maximum likelihood type II (Empirical Bayes) approach and is formulated as:

$\begin{array}{l}\displaystyle P(T_t^y|T_t^x,θ,𝜙) = \int \prod_k P(y_k|x_k,θ,𝜙,ρ_t){\rm d}ρ_t\end{array}$

where $\begin{array}{l}T_t^x\end{array}$ and $\begin{array}{l}T_t^y\end{array}$ denote the input and output data for task $\begin{array}{l}t\end{array}$ respectively, $\begin{array}{l}k\end{array}$ enumerates elements $\begin{array}{l}x_k ∈ T_t^x\end{array}$ , $\begin{array}{l}y_k ∈ T_t^y\end{array}$ and $\begin{array}{l}ρ_t\end{array}$ denotes task-specific parameters for task $\begin{array}{l}t\end{array}$ . The task-specific integral is implicitly implemented by using a Gaussian process model with a deep kernel [7]. Hence, the closed-form expression of $\begin{array}{l}P(T_t^y|T_t^x, θ, 𝜙)\end{array}$ can be gained with the help of this Gaussian process model without really implementing the integration. Since the tasks are independently identically distributed, the following marginal-likelihood regarding the ensemble of input and output datasets ( $\begin{array}{l}D_x\end{array}$ and $\begin{array}{l}D_y\end{array}$ ) over all tasks can be derived as:

$\begin{array}{l}\displaystyle P(D_y|D_x,θ,𝜙) = \prod_t P(T_t^y|T_t^x,θ,𝜙)\end{array}$

Gaussian Process Regression

In regression tasks, we are interested in a continuous output $\begin{array}{l}y_∗\end{array}$ generated by a clean signal $\begin{array}{l}f_∗(x_∗)\end{array}$ corrupted by homoscedastic Gaussian noise 𝜀 with variance $\begin{array}{l}𝜎^2\end{array}$ . According to [5], since a Gaussian process model is applied, the distribution over functions $\begin{array}{l}f_∗\end{array}$ is a Gaussian with the following closed-form mean and covariance:

$\begin{array}{l}\displaystyle E[f_∗] = k_∗^T(K + 𝜎^2I)^{−1}y\end{array}$

$\begin{array}{l}\displaystyle cov(f_∗) = k_{∗∗}− k_∗^T(K + 𝜎^2I)^{−1}k_∗\end{array}$

where $\begin{array}{l}{\bf{k_∗}}= k(x_∗, {\bf{x}})\end{array}$ , $\begin{array}{l}k_{∗∗}= k(x_∗, x_∗)\end{array}$ and $\begin{array}{l}K = k({\bf{x}}, {\bf{x}})\end{array}$ with $\begin{array}{l}{\bf{x}}\end{array}$ denoting the training points in $\begin{array}{l}D\end{array}$ and $\begin{array}{l}k(·)\end{array}$ identifying the kernel.

The noise covariance $\begin{array}{l}σ^2{\bf{I}}\end{array}$ is actually treated as part of the learnable hyperparameters θ of the kernel. Then the aforementioned marginal likelihood can be expressed as:

$\begin{array}{l}log P(D_y|D_x,θ, 𝝓) = \sum_{ t } - \begin{matrix} \underbrace{ \frac{1}{2} {\bf{y}}_t^T[K_t(θ,𝝓)]^{−1}{\bf{y}}_t } \\ data-fit \end{matrix} - \begin{matrix} \underbrace{ \frac{1}{2} log|K_t(θ, 𝝓)| } \\ penalty \end{matrix} +c\end{array}$

where $\begin{array}{l}y_t\end{array}$ contains all the target data of task $\begin{array}{l}t\end{array}$ , $\begin{array}{l}K_t\end{array}$ denotes the kernel for task $\begin{array}{l}t\end{array}$ , $\begin{array}{l}c\end{array}$ is a constant. The expression pleasingly separates into automatically calibrated data-fit and penalty terms.

Stochastic Gradient Learning

With the obtained closed-form expressions, stochastic gradient learning is performed to learn the kernel parameters. Each batch contains the data for a single task, and θ as well as 𝝓 are jointly optimized by maximizing the log marginal likelihood. The learned parameters are then used to make predictions on the query set $\begin{array}{l}Q_∗\end{array}$ conditioning on the support set $\begin{array}{l}S_∗\end{array}$ at test time. The corresponding Pseudocode is presented in Algorithm 1.

Algorithm 1: pseudocode for the stochastic gradient learning of DKT

Classification Based on Label Regression

For classification tasks, a classifier is derived based on label regression (LR) [10], which treats the classification problem as if it were a regression one, so that the obtained closed-form expressions can be reused.

In the simplest binary classification setting with the class label $\begin{array}{l}c ∈ \{0, 1\}\end{array}$ , the model is trained as a regressor with a target $\begin{array}{l}y_+ = 1\end{array}$ for $\begin{array}{l}c = 1\end{array}$ , and $\begin{array}{l}y_− = −1\end{array}$ for $\begin{array}{l}c = 0\end{array}$ . Classification for an input $\begin{array}{l}x_∗\end{array}$ is made by selecting:

$\begin{array}{l}\displaystyle c_∗= argmax_c (σ(m_c(x_∗)))\end{array}$

where $\begin{array}{l}m(x)\end{array}$ is the predictive mean calculated via Gaussian process regression, $\begin{array}{l}σ(·)\end{array}$ the sigmoid function and $\begin{array}{l}c_∗∈ \{0,1\}\end{array}$ .

For the more generalized multi-class setting, the one-versus-rest scheme is applied where C binary classifiers are used to classify each class against all the rest and the multi-class prediction is similar to the binary one with $\begin{array}{l}c_∗∈ \{1, ..., C\}\end{array}$ .

Experiments and Results

General Setting

To realize a fair comparison between methods, the proposed algorithm is integrated into the framework released by [11] using PyTorch and GPyTorch [12]. Besides, this work applies relatively shallow backbone networks for the deep kernel since according to [11], shallow backbones have been shown to highlight differences between methods.

Regression

Datasets and Settings

Two regression tasks are performed: amplitude prediction for unknown periodic functions with the sine wave dataset that motivated MAML [13] and head pose trajectory estimation from images with the Queen Mary University of London multiview face dataset (QMUL [14]). Their coresponding backbone networks are a two-layer MLP and a three-layer convolutional neural network respectively. Algorithms are tested under different data range settings where the training set and the testing set cover the same range (in-range setting) and where they have different ranges (out-of-range setting). The average Mean-Squared Error (MSE) between predictions and true values is used as metric. Following algorithms are compared:

- DKT with a RBF kernel

- DKT with a spectral kernel

- feature transfer

- MAML [13]

- DKBaseline (a baseline where a deep kernel is trained from scratch on the support points of every task without transfer)

Performances of ADKL [15], R2-D2 [16], and ALPaCA [17] obtained on a similar task (as defined in [18]) are also compared together.

Results

Results for the regression experiments are summarized in Table 1. DKT with a spectral kernel outperforms all the other methods for both tasks under both settings with the lowest MSE.

The DKBaseline performs significantly worse than DKT in all conditions, confirming the necessity of using kernel transfer for few-shot problems. Besides, as DKT with an RBF kernel has significantly worse performance than DKT with a spectral kernel when predicting the periodic functions (which is explainable since the spectral kernel suits for recurrent data patterns), the choice of an appropriate base kernel for the deep kernel proves to be important.

Table 1: Average MSE and standard deviation (three runs) for the regression tasks

DKT method also has the advantage that it can quantify prediction uncertainty, which is reflected in the result of the experiment where one head pose trajectory input is corrupted with Cutout noise [19]. Qualitative results in Figure 1 show that DKT predicts a mean value (red line) close to the true one (blue line) while giving a high level of uncertainty (red shadow), whereas feature transfer performs poorly at the same location.

Figure 1: qualitative uncertainty quantification in the head trajectory estimation

Within- and Cross-Domain Classification

Datasets and Settings

Both within- and cross-domain settings for classification are studied. The datasets Caltech-UCSD Birds containing 200 bird species (CUB-200, [20]) and mini-ImageNet [21] are used for within-domain classification and the cross-domain classification is implemented on mini-ImageNet→CUB (training set from mini-ImageNet and validation/testing set from CUB) and Omniglot (with characters taken from 50 different languages) [22] →EMNIST [23] (containing single digits and characters from English).

DKT with different kernels as well as various state-of-the-art methods are compared, including:

- MAML [13]

- ProtoNets [24]

- MatchingNet [25]

- RelationNet [26]

- feature transfer

- Baseline++ from [11]

All these methods have been trained from scratch with the same backbone and learning schedule.

Results

The results for the more challenging 1-shot case on CUB and cross-domain classification with a four-layer CNN as the backbone are reported in Table 2. DKT achieves the highest accuracy under the three settings.

Table 2: Average accuracy and standard deviation (percentage) (three runs) for the classification tasks

Conclusion

This work introduces DKT, a simple and highly flexible Bayesian meta-learning model based on deep kernel learning. DKT has many advantages: is straightforward to implement as a single optimizer, provides uncertainty quantification, and does not require estimation of task-specific parameters. It outperforms several few-shot-learning algorithms in regression, within- and cross-domain classification while providing a measure of uncertainty and can thus replace these complex meta-learning routines.

Own Review

This work is informative and provides a novel idea of using the Gaussian process to replace the explicit integration of task-specific parameters, which is otherwise analytically intractable. Since a Gaussian process model can be regarded as a weighted aggregation of functions, it is actually equivalent to the integration. Besides, by leveraging the regressive essence of Gaussian process models and converting the classification task to a regression one, the proposed method is generally applicable for both regression and classification, whereas some other few-shot learning methods are only suitable for a certain type of task. The integration of a deep kernel into the original Gaussian process model can make use of the expressive power of deep architectures, empowering the model to generalize more complex data patterns such as massive images; meanwhile, Gaussian process models are non-parametric, or rather, infinite-parametric, so the deep architecture can be viewed to have a Gaussian process layer of unbounded width, whose parameter amount grows with the data.

However, some mathematical terminologies and deductions are not explained in this work, probably due to the length limit, which may however hinder the understandability of the proposed method. Therefore, additional self-study and literature research are necessary to interpret this work, and some supplementary additional explanations are integrated into the blog post.

Although in original Gaussian processes different kernels are suitable for different data structures, due to the application of the deep architecture, the choice of the base kernel may not be easily explainable. Thus, more research could be done to regain the model interpretability for DKT. Furthermore, this work applies relatively shallow backbones in order to obtain clear comparison results. However, it is not yet explored whether deeper networks would result in performance boost or convergence for DKT and thus worth investigating. Another direction of future work is to apply DKT to further tasks such as object detection and image segmentation, preferably under the cross-domain setting since it is often encountered in real-life scenarios.

References

[1] C. Finn, K. Xu, and S. Levine, “Probabilistic model-agnostic meta- learning,” in Advances in Neural Information Processing Systems, 2018, pp. 9516–9527.  

[2] E. Grant, C. Finn, S. Levine, T. Darrell, and T. Griffiths, “Recast- ing gradient-based meta-learning as hierarchical bayes,” arXiv preprint arXiv:1801.08930, 2018.  

[3] M. Steyvers, T. L. Griffiths, and S. Dennis, “Probabilistic inference in human semantic memory,” Trends in cognitive sciences, vol. 10, no. 7, pp. 327–334, 2006.  

[4] J. B. Tenenbaum, C. Kemp, T. L. Griffiths, and N. D. Goodman, “How to grow a mind: Statistics, structure, and abstraction,” science, vol. 331, no. 6022, pp. 1279–1285, 2011.  

[5] C. E. Rasmussen, “Gaussian processes in machine learning,” in Summer School on Machine Learning. Springer, 2003, pp. 63–71.  

[6] A. Wilson and R. Adams, “Gaussian process kernels for pattern discov- ery and extrapolation,” in International conference on machine learn- ing, 2013, pp. 1067–1075.  

[7] G. E. Hinton and R. R. Salakhutdinov, “Using deep belief nets to learn covariance kernels for gaussian processes,” Advances in neural informa- tion processing systems, vol. 20, pp. 1249–1256, 2007.  1  

[8] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing, “Deep kernel learning,” in Artificial intelligence and statistics, 2016, pp. 370–378.  

[9] A. Antoniou, H. Edwards, and A. Storkey, “How to train your maml,” arXiv preprint arXiv:1810.09502, 2018.  

[10] M. Kuss, “Gaussian process models for robust regression, classification, and reinforcement learning,” Ph.D. dissertation, echnische Universit ̈at Darmstadt Darmstadt, Germany, 2006.  

[11] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B. Huang, “A closer look at few-shot classification,” arXiv preprint arXiv:1904.04232, 2019.  

[12] J. Gardner, G. Pleiss, K. Q. Weinberger, D. Bindel, and A. G. Wilson, “Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration,” Advances in Neural Information Processing Systems, vol. 31, pp. 7576–7586, 2018.  

[13] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” arXiv preprint arXiv:1703.03400, 2017.  

[14] S. Gong, S. McKenna, and J. J. Collins, “An investigation into face pose distributions,” in Proceedings of the Second International Conference on Automatic Face and Gesture Recognition. IEEE, 1996, pp. 265–270.  

[15] P. Tossou, B. Dura, F. Laviolette, M. Marchand, and A. Lacoste, “Adaptive deep kernel learning,” arXiv preprint arXiv:1905.12131, 2019.  

[16] L. Bertinetto, J. F. Henriques, P. H. Torr, and A. Vedaldi, “Meta-learning with differentiable closed-form solvers,” arXiv preprint arXiv:1805.08136, 2018.  

[17] J. Harrison, A. Sharma, and M. Pavone, “Meta-learning priors for ef- ficient online bayesian regression,” in International Workshop on the Algorithmic Foundations of Robotics. Springer, 2018, pp. 318–337.  

[18] J. Yoon, T. Kim, O. Dia, S. Kim, Y. Bengio, and S. Ahn, “Bayesian model-agnostic meta-learning,” Advances in Neural Information Pro- cessing Systems, vol. 31, pp. 7332–7342, 2018.  2  

[19] T. DeVries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv:1708.04552, 2017.  

[20] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-ucsd birds-200-2011 dataset,” 2011.  

[21] S. Ravi and H. Larochelle, “Optimization as a model for few-shot learn- ing,” 2016.  

[22] B. Lake, R. Salakhutdinov, J. Gross, and J. Tenenbaum, “One shot learning of simple visual concepts,” in Proceedings of the annual meeting of the cognitive science society, vol. 33, no. 33, 2011.  

[23] G. Cohen, S. Afshar, J. Tapson, and A. Van Schaik, “Emnist: Extend- ing mnist to handwritten letters,” in 2017 International Joint Confer- ence on Neural Networks (IJCNN). IEEE, 2017, pp. 2921–2926.  

[24] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” in Advances in neural information processing systems, 2017, pp. 4077–4087.  

[25] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matching networks for one shot learning,” in Advances in neural information processing systems, 2016, pp. 3630–3638.  

[26] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1199–1208.  

Seitenhierarchie