SSL representation learning by/for object detection

This is the blog post for the topic: SSL representation learning by/for object detection. The self-supervised learning (SSL) method, contrastive learning, will be introduced and discussed in this blog. Three papers using part/object-level contrastive learning techniques will be presented and compared, with their respective experimental results.

Blog post author: Unbekannter Benutzer (ge32zix)

Tutor: Unbekannter Benutzer (ga59mat)

1. Introduction

The past years have seen the power of supervised learning, with many deep neural networks (DNNs) widely used in computer vision tasks (like object detection, image classification, semantic segmentation, etc.) and natural language processing tasks (like sentiment prediction, text categorization and classification, named entity recognition, etc.). These supervised learning approaches often rely on large-scale datasets like ImageNet. However, the collection and annotations of these large-scale datasets are always expensive and time-consuming.

To avoid this, many self-supervised learning methods are proposed. SSL aims to learn representations without relying on human annotations, often the learning is driven by objectives defined by pretext tasks. Early attempts pretext tasks are hand-designed and can be divided into generation-based ones (like image colorization, image inpainting, image generation with GAN, etc.) and context-based ones (like clustering, image jigsaw puzzle, context prediction, etc.), see Figure 1.

Figure 1. SSL methods

Although the downstream application may have nothing to do with colorization or inpainting, the intuition of these hand-designed methods is that in order to predict the color (colorization) or missing patch (image inpainting), the network needs to understand the context of an image and learns useful representation. However, hand-designed pretext tasks may limit the generality of the learned representations.

Recently, contrastive learning broke away from this tradition in that the pretext task is data-driven, more specifically, it requires representations of samples to be distinguishable from others. Intuitively, the model tries to learn representations by comparing different samples. More details and instances will be given in the next section.

2. Related Work

By definition, contrastive learning aims to learn an embedding space where positive pairs (e.g. two augmented views of the same image) are pulled closer and negative pairs (e.g. views from different images) are pushed apart. Figure 2 provides an illustration of contrastive learning.

Figure 2. contrastive learning paradigm [2]

According to MOCO [3], we can also define this problem as a key-query matching problem: a query should be similar to its matching key and dissimilar to others. Based on this, the commonly used contrastive loss InfoNCE [4] can be defined as:

where q represents the query, k₊ represents the positive key, τ is the temperature parameter, and · means dot product, which measures the similarity. The similarity of the query and the positive key is over the summation of similarities of all query and key pairs, including the positive one.

Though contrastive learning methods could have similar learning paradigms, they may differ in strategies to generate instances.

End-to-end approaches generate samples in a mini-batch, and the encoder for query and key both receive gradients flow. But the batch size could limit the number of negative samples, which are believed to play an important role in learning useful representations. To address this problem, Wu et al. [5] proposed a memory bank, which stores the features of all samples calculated in the previous step, so that negative samples can be sampled from it. He et al. [3] further proposed MoCo, which contains a momentum encoder to encoder keys and makes use of a queue to store calculated representations. The momentum encoder is updated by an exponential moving average of the parameters from the query encoder, this ensures that the negative samples are consistent. These different approaches are shown in Figure 3.

Figure 3. different strategies to generate instances [3]

Contrastive learning methods may also vary in contrastive losses. Except for InfoNCE [4] and its variants, Zbontar et al. [6] propose a new objective that aims to make representations of a positive pair to be similar while minimizing redundant information between them. Li et al. [7] introduce ProtoNCE loss, an extended version of the InfoNCE loss, which encourages sample representations to be closer to their assigned prototypes. The prototypes are cluster centroids, which are generated by clustering samples in an embedding space.

However, Olivier J., et al [10] raise a question: is this instance-wise contrastive learning method too simple to generate good representations? Because these methods define an instance as an augmented view from an image, which neglects the fact that images are composed of many objects, and may cause the learned representations to risk dampening the selectivity, relationship, and layout of objects. Holding this question, the blog will introduce three recently proposed methods, which all focus on part or object-level contrastive learning and achieve state-of-the-art results.

3. Methodology

3.1 Contrastive Part Discovery [8]

This method aims to discover and segment parts in an unsupervised manner. First, it is important to understand the differences between parts and objects:

Objects appear arbitrarily in scenes while parts have more constraints (e.g. arms must connect to torso)
Objects are orderless and typeless in scenes while parts refer to specific and nameable entities (like the head of a bird)

To define a meaningful part, the authors set a list of rules:

Parts should have uniform feature information
Parts should be consistent across images and distinctive from other parts
Parts should be invariant to geometric and photometric transformations
Parts should be visually consistent

To that end, the authors propose a proxy task to fulfill the above rules, which yields four losses that correspond to each rule:

Figure 4. Training objectives of the paper [8]

where feature loss corresponds to rule 1, contrastive loss corresponds to rule 2, equivariance loss corresponds to rule 3, and so on.

For implementation, two separate networks are needed: a deep neural network f that receives images and produces segmentation masks, and a perceptual network Φ that extracts features.

The mask M holds for all pixels u that:

$\begin{array}{l}\displaystyle \sum_{k = 1}^{K}M_u = 1\end{array}$

Where K is the number of parts. Together with mask M and features extracted by Φ, the authors define an average part descriptor z:

where Ω represents all foreground pixels in the image. Part is now described as a single descriptor that is differentiable from the mask.

A. Feature loss

Further, feature consistency can be ensured by minimizing the variance of descriptors:

By doing so, pixels are assigned to the same part if they have similar representations.

B. Contrastive loss

The contrastive loss is very similar to the InfoNCE loss:

The loss is over N images and K parts, where the positive sample $\begin{array}{l}\^{z_k}\end{array}$ is defined as the same part in another image. Negative parts $\begin{array}{l}{z_j}\end{array}$ are defined as different parts from different images.

C. Equivariance loss

The idea behind this loss is that transformations should not change part assignments. The objective simply minimizes the Kullback–Leibler divergence:

Where T is a randomly selected transformation, and f is the network that maps images to segmentations.

D. Visual consistency loss

This term ensures that parts are roughly uniformly colored by minimizing the variance of pixels:

where pixels are assumed to be i.i.d from identical Gaussians.

3.2 DetCon: Contrastive Detection [9]

The authors propose contrastive detection, which encourages the model to learn representations by performing object-level contrastive learning. The method is presented in Figure. 5:

Figure 5. The contrastive detection method [9]

First, perform off-the-shelf segmentation algorithm and get segmentation masks, which are represented in different colors in Figure 5. The images are then applied data augmentations twice, so as the masks.

Next, compute a mask-pooled hidden vector for each mask:

These mask-pooled vectors are further projected to v_m by a multi-layer perception, and contrastive loss is defined as:

The authors also explore different segmentation algorithms, including spatial heuristic, Felzenszwalb-Huttenlocher (FH), Multiscale Combinatorial Grouping (MCG), and human-annotated masks. Here's an example:

Figure 6. Example masks used by DetCon

3.3 Odin: Object Discovery and Representation Networks [10]

Instead of using an off-the-shelf segmentation algorithm, Odin proposed an object discovery network to get segmentations and representation networks to learn representations.

Figure 7. Object discovery and representation networks [10]

Object Discovery Network

the object discovery network takes as input a spanning view, which is the minimum span of augmented view 1 and augmented view 2, and generates features, which are then clustered using K-means to produce segmentation masks

Object Representation Networks

In object representation networks, each augmented view is input into two networks: one online network (whose feature extractor is f_θin Figure 7) and one target network (whose feature extractor is f_ξin Figure 7). The masks are performed the same data augmentations and are also input into these two networks.

The idea is that an object-level feature is predictable given another view. First, compute a mask-pooled hidden vector for each mask (i.e. object-level features):

where $\begin{array}{l}h_θ^{k,l}\end{array}$ denotes the mask-pooled hidden vector computed by the online network, k denotes the k-th object and $\begin{array}{l}l \in \{1, 2\}\end{array}$ denotes view 1 or view 2.

In the online network, object-level features are first computed, then two MLPs are applied to get predictions. While in the target network, only one MLP is applied to get the targets. Here is a figure that can hopefully illustrates the structure of object representation networks better:

Figure 8. Structure of object representation networks

Next, a similarity metric is defined based on predictions and targets:

This can be interpreted as predicting the k-th object in view 2 from 1, where $\begin{array}{l}q_θ(z_θ^{k,1})\end{array}$ is the prediction of the k-th object feature in view 2 and $\begin{array}{l}z_ξ^{k,2}\end{array}$ is the target for the prediction.

A contrastive loss is further defined as:

And the objective is: (summation across images is omitted for simplicity)

But you may wonder why the target network uses a different set of parameters from the online network. The reason is that the authors wish to stabilize targets, which means the network parameters should vary slowly. Inspired by BYOL [11], the authors propose to update the parameters of the target network using an exponential moving average (EMA) of the parameters from the online network. In fact, the object discovery network is also updated using EMA:

where $\begin{array}{l}θ, ξ, τ\end{array}$ denote parameters for the online network, the target network, and the object discovery network respectively. Only the online network is optimized by contrastive loss.

4. Experiments

4.1 Settings

Methods

Datasets

Networks

Evaluation Metrics

Training Tricks

Contrastive Part Discovery

CUB-200-2011,

DeepFashion,

PASCAL-Part

f: DeepLab-v2 with ResNet50,

Φ: VGG19

Keypoint Regression Error↓, NMI ↑, ARI ↑

f and Φ both pretrained on ImageNet 1K. Φ is kept fixed. Foreground masks are also used.

DetCon

Pretrain on ImageNet and finetune on COCO, PASCAL, Cityscapes, NYU-Depth v2

Encoder: ResNet50

Object detection & segmentation: AP↑, semantic segmentation: mIoU↑,

depth estimation: acc↑

Off-the-shelf segmentation algorithms are applied

Odin

Pretrain on ImageNet and finetune on COCO, PASCAL, Cityscapes

f: ResNet50/Swin Transformer with Feature Pyramid Network,

g and q: 2-layer MLP

Object detection & segmentation: AP↑, semantic segmentation: mIoU↑,

Object discovery: ABO↑, OR↑

Update target network and object discovery network using an exponential moving average of online network

Table 1: settings of the methods. In evaluation metrics, ↑ means larger is better and ↓ means lower is better.

4.2 Results

Due to limitations on space, in this section, only representative results are shown, not all of the results reported by papers.

Results of Contrastive Part Discovery [8]:

Here only show the results of CUB-200 compared to SOTA (with K=4):

Interestingly, the authors do not consider keypoint regression error a good metric, because a simple model that can only predict one single part correctly would yield a very low error (see the baseline methods in the first three rows). So the authors propose to use NMI and ARI as evaluation metrics. FG means fore-ground only. We can see that the proposed method beats SOTA in every metric.

The following image shows comparison examples of the proposed method and previous art SCOPS. It shows that the proposed method is able to find clearer part boundaries even in difficult poses, e.g., open wings.

Figure 9. Visualization of results of SCOPS and the proposed methods [8]

Results of DetCon [9]

The authors pretrain networks with SimCLR [12], DetCon, or supervised learning on ImageNet for different numbers of epochs, and fine-tune them for 4 downstream tasks:

Figure 10. Pretraining results compared to SimCLR and supervised pertaining [9]

The results show that DetCon outperforms SimCLR and supervised pertaining, with up to 10 times pertaining efficiency. Note that the masks used are FH masks [13].

Results of Odin [10]

Here all methods pretrain a ResNet-50 on ImageNet before fine-tuning on COCO with Mask-RCNN [14] for 12 epochs (1X schedule) or 24 epochs (2X schedule). Average precision on object detection (APbb) and instance segmentation (APmk) are reported:

The top-performing methods are ReLIC v2 and DetCon_B which make heavy use of saliency or segmentation information in their learning paradigm, but Odin outperforms them both without any form of prior knowledge.

5. Comparison and Discussion

First, we compare the three papers in their methodology:

Similarities:

They all average representations to get mask-pooled hidden vectors
They all perform object/part-level contrastive learning

Differences

Contrastive Part Discovery (CPD) is a different task from the other two. Not only because CPD focuses on parts while the other two focus on objects, but also because CPD only learns part segmentation instead of representation since the feature extraction network is fixed
CPD uses a network to produce segmentation, while Odin gets segmentation by clustering computed features
CPD defines a positive pair from two images, while the other two define it from two augmented views of the same image
To keep learning stable, CPD freezes the feature extraction network, while Odin adopts an exponential moving average update
Prior knowledge is injected in CPD and DetCon, but not in Odin. Recall in 4.1 that CPD uses pretrained networks as well as ground-truth foreground masks, and DetCon uses off-the-shelf segmentation algorithms

Next, as the author of this blog, I would like to raise several interesting problems or discussions. Most comments are in my own view, any corrections or suggestions are welcome.

1. Why contrastive learning works?

A reason may be that the pretext task of contrastive learning is hard enough: there are many negative samples, and data is applied several augmentations. In order to complete the contrastive tasks, low-level or local features are not enough. This is also mentioned in SimCLR [12] that 'When composing augmentations, the contrastive prediction task becomes harder, but the quality of representation improves dramatically'.

2. Why object-level contrastive learning works?

In addition to the object-level features being leveraged, another important reason is that ImageNet has so many objects, which provide plenty of negative samples for contrastive learning, which could make the learned features more robust.

3. Why use projections after representations in Odin?

According to BYOL [11], from which the authors of Odin are inspired, the predictor can prevent representations from collapsing. Also according to SimCLR, a projection can prevent representations to lose information. If representations are fed into contrastive loss directly, some information will be lost because the representations are forced to be invariant of data augmentations.

4. Can clustering of features provide meaningful segmentations?

Authors of Contrastive Part Discovery argue that clustering pre-computed features alone is unlikely to yield meaningful segmentations, because 'grouping may sometimes highlight self-similar structures such as region boundaries instead of parts.' However, in Odin, it seems that clustering features can indeed provide valuable segmentations. In my view, these two comments may both be correct, because part segmentation is a more challenging task than object segmentation, in that parts are defined with more degree of freedom. Considering this, it is likely that merely clustering features in part discovery cannot provide meaningful segmentations, but it can in the case of object segmentation.

5. Baking prior knowledge back into SSL?

Among the three papers discussed above, only Odin injects no prior knowledge into SSL. Though some methods do achieve impressive results by introducing some degree of prior knowledge, just as what DetCon does, we should reflect on whether this violates the intuition of SSL. What's worse, it is likely that this prior knowledge will harm the generalization of models, as the off-the-shelf model may also suffer from insufficient generalization ability. Since SSL tries to get rid of annotations, so maybe a fully automatic network like Odin is the right one to pursue.

6. Interesting to know the results of the linear evaluation

Linear evaluation means adding a linear classifier on frozen features, and seeing how good the model is on some classification tasks (often on ImageNet). It is widely used in self-supervised learning methods like [4][12][15]. I think it would be interesting to know whether these object-level contrastive methods can perform well also in downstream classification tasks, not only in some object-related tasks that are reported by the papers.

6. Summary

Advantages of the proposed methods:

leverage part/object-level features
The network learns object discovery and good representation altogether
Easy to transfer to downstream tasks without changing network architectures, i.e. use the encoder directly

Disadvantages of the proposed methods:

Contrastive Part Discovery: uniform part appearance is not always applicable. See the example images below. The tail of the black cat looks more similar to the neck of the black swan than the tail of another cat.
Parts/objects discovered may not agree with human intuition. This is due to the parameter K, which controls the granularity of output segmentations and needs to be deliberately set according to different inputs.

Figure 10. An example^[16] of failure cases of uniform part appearance assumption

7. References

[1] Jing, Longlong, and Yingli Tian. "Self-supervised visual feature learning with deep neural networks: A survey." IEEE transactions on pattern analysis and machine intelligence 43.11 (2020): 4037-4058.

[2] Jaiswal, Ashish, et al. "A survey on contrastive self-supervised learning." Technologies 9.1 (2020): 2.

[3] He, Kaiming, et al. "Momentum contrast for unsupervised visual representation learning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

[4] Oord, Aaron van den, Yazhe Li, and Oriol Vinyals. "Representation learning with contrastive predictive coding." arXiv preprint arXiv:1807.03748 (2018).

[5] Wu, Zhirong, et al. "Unsupervised feature learning via non-parametric instance discrimination." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[6] Zbontar, Jure, et al. "Barlow twins: Self-supervised learning via redundancy reduction." International Conference on Machine Learning. PMLR, 2021.

[7] Li, Junnan, et al. "Prototypical contrastive learning of unsupervised representations." arXiv preprint arXiv:2005.04966 (2020).

[8] Choudhury, Subhabrata, et al. "Unsupervised part discovery from contrastive reconstruction." Advances in Neural Information Processing Systems 34 (2021): 28104-28118.

[9] Hénaff, Olivier J., et al. "Efficient visual pretraining with contrastive detection." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[10] Hénaff, Olivier J., et al. "Object discovery and representation networks." arXiv preprint arXiv:2203.08777 (2022).

[11] Grill, Jean-Bastien, et al. "Bootstrap your own latent-a new approach to self-supervised learning." Advances in neural information processing systems 33 (2020): 21271-21284.

[12] Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." International conference on machine learning. PMLR, 2020.

[13] Felzenszwalb, Pedro F., and Daniel P. Huttenlocher. "Efficient graph-based image segmentation." International journal of computer vision 59.2 (2004): 167-181.

[14] He, Kaiming, et al. "Mask r-cnn." Proceedings of the IEEE international conference on computer vision. 2017.

[15]: Zhang, Richard, Phillip Isola, and Alexei A. Efros. "Colorful image colorization." European conference on computer vision. Springer, Cham, 2016.

[16]: Image sources: https://nzbirdsonline.org.nz/species/black-swan and https://www.tuftandpaw.com/blogs/cat-guides/the-definitive-guide-to-cat-behavior-and-body-language

Seitenhierarchie