This is a blog post for the paper 'Learning to Learn How to Learn: Self-Adaptive Visual Navigation Using Meta-Learning' of CVPR 2019.

Written by Mitchell Wortsman, Kiana Ehsani, Mohammad Rastegari, Ali Farhadi and Roozbeh Mottaghi.

Introduction

For traditional deep learning models, training and inference are strictly separated and totally different. The parameters of models are only trained and updated during training and frozen during inference, which may result in great difficulties generalizing to unseen data distributions or environments. On the contrary, there is no obvious boundary between training and testing for human beings, since people learn a new task often relying on explicit external supervision and afterward can further adapt to new settings via interaction with the environment.

In this paper, the problem of learning to learn and adaptation at both training and testing is explored and solved in the context of visual navigation. The goal of visual navigation is to move to the target object or region given an initial position, with the help of visual information. The key challenge is how the agent can generalize to a new scene that has not been observed and learned during training. Therefore, a self-adaptive visual navigation (SAVN) model is proposed to tackle this challenge, which learns to adapt during inference without external supervision but using a self-adaptive interaction loss.

In formal, the solution is a meta-reinforcement learning approach. During training, not only a navigation loss is minimized using external supervision labels, but also an interaction loss is minimized, which is learned through a NN and has a similar gradient with the navigation loss. Therefore, it is possible to further minimize the interaction loss during inference and upgrade the model parameters

The paper performs experiments using the AI2-THOR[1] framework and shows that SAVN outperforms the non-adaptive baseline and other hand-crafted losses in terms of success rate and SPL.

Methodology

Task Definition

The input data includes a set of scenes S, an initial position p , and a target class o. The agent is required to navigate to the target object using only the egocentric RGB images. The target class is given in a Glove embedding[2].

At each time t, the agent must take an action a chosen from the action set A until the termination action is issued. An episode is regarded as successful if the target is sufficiently close and visible when the termination action is issued. The termination is triggered if sufficient similarity is detected in two consecutive states.

Learning

The architecture of the model is illustrated in Figure 1.

Figure 1: Architecture overview

In order to handle the RGM scene images, they are first fed into a ResNet18[3] pretrained on ImageNet[4] to extract the features and the parameters of it are frozen all the time. Meanwhile, the target word is embedded and becomes a representation matrix after an FC. For the next step, the joint feature-map of scenes and target word are obtained with a pointwise convolution. The output is then flattened and given as input to a Long Short-Term Memory network (LSTM) and output policy distributions \pi_\theta(s_t) and value of the state v_\theta(s_t)after a linear layer. 

The output of LSTM will be compared with supervision labels to compute the navigation loss L_{nav} , which is a traditional supervised actor-critic navigation loss[5][6], while the hidden state of LSTM and the output of policy are concatenated and utilized for the interaction loss  L_{int}^\phi parameterized by \phi.

Learning to Learn

During Training for visual navigation, the agent learns how to adapt from interaction, which is accomplished by the navigation loss based on the Model Agnostic Meta-Learning (MAML)[7] algorithm. If the distribution of training and testing tasks are sufficiently similar, the network is capable to adapt to novel test tasks.

For a simple navigation task \tau with only one update, D^{int}_\tau denotes the actions, observations and internal state representations for the first k steps of the agent's trajectory, while D^{nav}_\tau denotes the same information for the rest of it. The training objective if given by:

\min \limits_{\theta} \sum \limits_{\tau \in T_{train}} L_{nav}(\theta-\alpha\bigtriangledown_\theta L_{int}(\theta, D_\tau^{int}),D_\tau^{nav})

At first, the agent interacts with the environment for k steps and then adapt to it by using the gradient of interaction loss to update the parameters of the navigation loss. As in [8], the formula can be written as follows by a first-order Taylor expansion:

\min \limits_{\theta} \sum \limits_{\tau \in T_{train}} L_{nav}(\theta,D_\tau^{nav})-\alpha\langle \bigtriangledown_\theta L_{int}(\theta,D_\tau^{int}), \bigtriangledown_\theta L_{nav}(\theta,D_\tau^{nav})\rangle

\langle\cdot , \cdot\rangle denotes an inner product. Therefore, minimizing the loss function equals to maximize the inner product, which leads to making the gradients of both losses similar. If similar, the 'training' can be continued during inference even when labels are not provided and the navigation loss does not exist. 

However, it is difficult to choose a proper interaction loss that fulfills these requirements and the optimal method is to learn it.

Learning to Learn How to Learn

In order to learn the interaction loss, one-dimensional temporal convolutions are used as the architecture, which contains two layers, the first with 10\times1 filters and the next with 1\times1. The l_2 norm of the output is taken to obtain the scalar objective. [9] The interaction loss is minimized during both training and inference and keep the parameters of the model updated all the time.

Two hand crafted Interaction losses are experiment with to make a fair comparison with the former learned loss. The first one is a diversity loss which encourages the agent to take varied actions. Accordingly,

L_{int}^{div}(\theta, D^{int}_\tau)=\sum \limits_{i<j\le k} g(s_i, s_j) log(\pi_\theta^{(a_i)}(s_j)),

where s_t is the agent's state at time t, a_t is the action the agent takes at time t, and g is 1 if the pixel difference between s_i and s_j  is below a certain threshold and 0 otherwise.

The second loss function is a prediction loss where the agent wants to predict the success of each action. Let \pi_\theta(s_t) be the old policy distribution and q_\theta(s_t) be the success probability, the new action is sampled from \pi_\theta(s_t) \times q_\theta(s_t) instead of \pi_\theta(s_t), and the loss function becomes:

L^{pred}_{int}(\theta, D^{int}_\tau)=\sum \limits_{t=0}^{k-1} H(q_\theta^{(a_t)}, 1-g(s_t, s_{t+1}))

where H(\cdot,\cdot) denotes the binary cross-entropy.

Experiments and Evaluations

Experiment Settings

The experiments are implemented using the AI2-THOR environment, which provides indoor 3D synthetic scenes in four room categories, kitchen, living room, bedroom, and bathroom. For each room type, 20 scenes are used for training, 5 for validation and 5 for testing. The actions A include MoveAhead, RotateLeft, RotateRight, LoodDown, LookUp, and Done. Horizontal rotations change the camera for 45 degrees and looking up and down change for 30 degrees. A navigation task is successfully completed if the Done action, which is the termination, issued when an instance from the target class is within 1 meter from the agent's camera and within the field of view.

For the navigation loss, the agent gets a reward of 5 for finding the object and  -0.01 for taking a step. The frequency of interaction gradients k is 6 and performed totally 1000 different episodes for inference with different initial positions.

Evaluation Metrics

Two metrics are utilized to evaluate the results: the first is Success Rate (SR) and the second is Success weighted by Path Length (SPL)[10], which are shown below respectively:

SR=\frac{1}{N}\sum\limits_{i=1}^{N} S_i

SPL=\frac{1}{N}\sum \limits_{i=1}^N S_i \frac{L_i }{max(P_i, L_i)}

N is the number of episodes, S_i  is a binary indicator of success in episode i, P_i is path length and L_i is the length of the optimal trajectory.

Results

The model in the paper is compared with many baselines. The first is the Random agent (Random), which means the agent randomly samples an action using a uniform distribution. The Nearest neighbor (NN) baseline selects the most similar visual observation among scenes in the training set. No adaptation (A3C) baseline [11] utilizes no interaction gradient and no interaction loss. Also, the two hand crafted interaction losses are compared with. The results are shown in Table 1.

Table 1: Results of different models

It is obvious that the SAVN model outperforms all other models in both Success Rate and SPL metrics. The self-supervised objective not only learns to navigate more effectively but also efficiently.

Ablation Study

In order to gain further insight into the result, some ablation studies are performed. Firstly, some modules are added to the non-adaptive baseline, including augmenting with a memory module that performs self-attention on the latest k hidden states of LSTM (A3C w/mem). Secondly, add the prediction loss to the training loss (A3C w/prediction loss). These experiments reveal that the SAVN model is not only a simple consequence of additional losses.

Besides, the termination signal is switched to the ground truth of the environment (GT obj), since issuing the termination action at the correct location plays a significant role in the navigation, and the SAVN still outperforms all the others.

Table 2: Ablation results

Conclusions

This paper introduces a self-adaptive visual navigation agent (SAVN) that learns during both training and inference, which is a novel approach to solve the generalization problems encountered by many deep learning models. The key idea is to learn a self-supervised interaction loss that can be used when there is no supervision during inference. The model achieves best results compared to all other non-adaptive baselines and handcrafted losses.

References

[1] Eric Kolve, Roozbeh Mottaghi, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv, 2017.

[2] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.

[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.

[5] Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, 2017.

[6] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J. Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, Dharshan Kumaran, and Raia Hadsell. Learning to navigate in complex environments. In ICLR, 2017.

[7] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
[8] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales. Learning to generalize: Meta-learning for domain generalization. In AAAI, 2018.

[9] Tianhe Yu, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot imitation from observing humans via domain-adaptive meta-learning. In RSS, 2018.

[10] Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir R. Zamir. On evaluation of embodied navigation agents. arXiv, 2018. 

[11] Volodymyr Mnih, Adria Puigdom`enech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, 2016.

  • Keine Stichwörter