Training classifiers for object detection requires large amount of data. These datasets are expensive to generate, especially because of the manual annotation. To avoid that, we use synthetic data and use three different methods to try to close the gap between real and synthetic images. The methods include photo-realistic rendering, domain adaptation and domain randomization. We examine the effect of training known object detection methods with data generated using the three methods and compare their performances.


Closing the Gap to Real Data


Our Approach relies on generating data using domain adaptation or randomization methods as well as photo-realistic rendering. We use these datasets to train a deep learning method and compare the performance on a real dataset (see the figure below.)Overview of approach

Photo-Realistic Rendering 

The main idea for photo-realistic rendering is to render images that model the physics of the real world well enough, such as reflections, noise, light variations and so on. An example for a possible renderer is the Mitsuba Renderer. Mitsuba renders images  slower than other simpler renderers. It is also possible to create very complicated scenes with Blender. For our experiments we use a set of real annotated images. This means we assume the best case scenario where photo-realistic rendering would render the same scene as in reality.

We also consider another case where we scan the textured models using the EinScan scanner. We place the scanned models in background images of the real scene as well to generate datasets. The figure below includes both possibilities for these kind of datasets. On the left you’ll see the real images and on the right the images with the scanned models. 

Real image taken from the sceneScanned models placed with random real backgrounds

Domain Adaptation

Domain Adaptation relies on transforming images from one domain to another. There are two possibilities for the domain adaptation. If the target domain is labeled, then images should correspond to each other. Otherwise images don't have to be matched on a pixel-level. We use two different methods, pix2pix and CycleGAN. Pix2pix is an example for conditional GANs where the training is done by providing pairs of images. (See Fig below) 

Real image sceneSimulated image scene
We train both CycleGAN and pix2pix models with 104 pairs of images. We then use the trained to generate new datasets. An example for images generated with both models are displayed in the table below. 

simulated scenetransformed into target domain with pix2pix

simulated scenetransformed into target domain with CycleGAN

Domain Randomization

The idea behind domain randomization is to add random effects to images, to emulate real world variability. We can use a simple renderer such as PyBullet and then use a library, for example  ImgAug to augment the images. Augmentations may include: 

  • adding different kinds of noise (gaussian, poisson)
  • blur
  • adding/removing random pixels
  • changing hue and saturation
  • adding random backgrounds
    Some example of augmented images are below.

Object Detection Methods

Mask R-CNN

Mask R-CNN is an instance segmentation method. Instance segmentation is a mix of semantic segmentation, where each pixel in the image is classified, and object detection, where the goal is to detect objects using bounding boxes, meaning every pixel is not only classified but different instances of one class are labeled differently. MaskRCNN is an extension of FastRCNN by adding another branch that outputs the object's mask. We use the following implementation. We train the network with the different data generation methods. There are two examples of the outputs of a trained network when fed with a real image of the scene. 


Exp. #methods#imagesprecisionrecallcomment

CycleGAN (DA)


104 / 134104 / 124

pix2pix (DA)

40099 / 11499 / 124
3photo-realistic rendering 105113 / 115113 / 124using real images to train the network
4photo-realistic rendering300101 / 108101 / 124

scanned models with real cell backgrounds

5DR400105 / 124105 / 124
6DR300109 / 146109 / 124using textured models

photo-realistic rendering  + simulated data

105 + 200

117 / 117117 / 124no augmentation for training, real images are used as a substitute for photo-realistic rendering
8photo-realistic rendering  + DR105 + 200118 / 118118 / 124real images are used as a substitute for photo-realistic rendering

photo-realistic rendering  + CycleGAN (DA)

105 + 200

112 / 112112 / 124real images are used as a substitute for photo-realistic rendering

domain randomization + CycleGAN (DA)

105 + 200106 / 128106 / 124

Most important takeaways include:

  • for Domain Adaptation experiments
    • the precision is better than recall for each experiment

  • for Domain Randomization experiments
  • Photo-Realistic Rendering Experiments

  • for combined experiments


  • No labels