Reality Gap

Student/in

Vorname: Nawal

Name: Hafez

Kurzbeschreibung

Training classifiers for object detection requires large amount of data. These datasets are expensive to generate, especially because of the manual annotation. To avoid that, we use synthetic data and use three different methods to try to close the gap between real and synthetic images. The methods include photo-realistic rendering, domain adaptation and domain randomization. We examine the effect of training known object detection methods with data generated using the three methods and compare their performances.

Betreuer

Stefan Röhrl

Dokumentation

Closing the Gap to Real Data

Overview

Our Approach relies on generating data using domain adaptation or randomization methods as well as photo-realistic rendering. We use these datasets to train a deep learning method and compare the performance on a real dataset (see the figure below.)

Photo-Realistic Rendering

The main idea for photo-realistic rendering is to render images that model the physics of the real world well enough, such as reflections, noise, light variations and so on. An example for a possible renderer is the Mitsuba Renderer. Mitsuba renders images slower than other simpler renderers. It is also possible to create very complicated scenes with Blender. For our experiments we use a set of real annotated images. This means we assume the best case scenario where photo-realistic rendering would render the same scene as in reality.

We also consider another case where we scan the textured models using the EinScan scanner. We place the scanned models in background images of the real scene as well to generate datasets. The figure below includes both possibilities for these kind of datasets. On the left you’ll see the real images and on the right the images with the scanned models.


Real image taken from the scene	Scanned models placed with random real backgrounds

Domain Adaptation

Domain Adaptation relies on transforming images from one domain to another. There are two possibilities for the domain adaptation. If the target domain is labeled, then images should correspond to each other. Otherwise images don't have to be matched on a pixel-level. We use two different methods, pix2pix and CycleGAN. Pix2pix is an example for conditional GANs where the training is done by providing pairs of images. (See Fig below)


Real image scene	Simulated image scene

We train both CycleGAN and pix2pix models with 104 pairs of images. We then use the trained to generate new datasets. An example for images generated with both models are displayed in the table below.


simulated scene	transformed into target domain with pix2pix

simulated scene	transformed into target domain with CycleGAN

Domain Randomization

The idea behind domain randomization is to add random effects to images, to emulate real world variability. We can use a simple renderer such as PyBullet and then use a library, for example ImgAug to augment the images. Augmentations may include:

adding different kinds of noise (gaussian, poisson)
blur
adding/removing random pixels
changing hue and saturation
adding random backgrounds
Some example of augmented images are below.

Object Detection Methods

Mask R-CNN

Mask R-CNN is an instance segmentation method. Instance segmentation is a mix of semantic segmentation, where each pixel in the image is classified, and object detection, where the goal is to detect objects using bounding boxes, meaning every pixel is not only classified but different instances of one class are labeled differently. MaskRCNN is an extension of FastRCNN by adding another branch that outputs the object's mask. We use the following implementation. We train the network with the different data generation methods. There are two examples of the outputs of a trained network when fed with a real image of the scene.

Experiments

Exp. #	methods	#images	precision	recall	comment
1	CycleGAN (DA)	400	104 / 134	104 / 124
2	pix2pix (DA)	400	99 / 114	99 / 124
3	photo-realistic rendering	105	113 / 115	113 / 124	using real images to train the network
4	photo-realistic rendering	300	101 / 108	101 / 124	scanned models with real cell backgrounds
5	DR	400	105 / 124	105 / 124
6	DR	300	109 / 146	109 / 124	using textured models
7	photo-realistic rendering + simulated data	105 + 200	117 / 117	117 / 124	no augmentation for training, real images are used as a substitute for photo-realistic rendering
8	photo-realistic rendering + DR	105 + 200	118 / 118	118 / 124	real images are used as a substitute for photo-realistic rendering
9	photo-realistic rendering + CycleGAN (DA)	105 + 200	112 / 112	112 / 124	real images are used as a substitute for photo-realistic rendering
10	domain randomization + CycleGAN (DA)	105 + 200	106 / 128	106 / 124

Most important takeaways include:

for Domain Adaptation experiments
- the precision is better than recall for each experiment
for Domain Randomization experiments
Photo-Realistic Rendering Experiments
for combined experiments

Bereichsverknüpfungen

Seitenhierarchie