A New Benchmark: On the Utility of Synthetic Data with Blender for Bare Supervised Learning and Downstream Domain Adaptation

Hui Tang^{1, 2}

Kui Jia ^{✉, 1}

South China University of Technology¹

DexForce Co. Ltd.²

^✉Corresponding author

Code [GitHub]

Dataset [COVE]

Paper [arXiv]

Cite [BibTeX]

Teaser

Sample images from the synthetic (left) domain and the real domains of our introduced S2RDA-49 (middle) and S2RDA-MS-39 (right). The real domain of S2RDA-49 comprises 60,535 images of 49 classes, collected from ImageNet validation set, ObjectNet, VisDA-2017 validation set, and the web. For S2RDA-MS-39, the real domain collects 41,735 natural images exclusive for 39 classes from MetaShift, which contain complex and distinct contexts, e.g., object presence (co-occurrence of different objects), general contexts (indoor or outdoor), and object attributes (color or shape), leading to a much harder task.

Abstract

To solve the basic and important problems in the context of image classification, such as the lack of comprehensive synthetic data research and the insufficient exploration of synthetic-to-real transfer, we in this paper propose to exploit synthetic datasets to explore questions on model generalization, benchmark pre-training strategies for domain adaptation (DA), and build a large-scale benchmark dataset S2RDA for synthetic-to-real transfer, which can push forward future DA research. Specifically, we make the following contributions: (i) under the well-controlled, IID data setting enabled by 3D rendering, we systematically verify the typical, important learning insights, e.g., shortcut learning, and discover the new laws of various data regimes and network architectures in generalization; (ii) we further investigate the effect of image formation factors on generalization, e.g., object scale, material texture, illumination, camera viewpoint, and background in a 3D scene; (iii) we use the simulation-to-reality adaptation as a downstream task for comparing the transferability between synthetic and real data when used for pre-training, which demonstrates that synthetic data pre-training is also promising to improve real test results; finally, (iv) we develop a new large-scale synthetic-to-real benchmark for image classification, termed S2RDA, which provides more significant challenges for transfer from simulation to reality.

Background

Problem definition of unsupervised domain adaptation (UDA). The great success of deep learning relies on massive labeled data. Many tasks lack labeled data. How to address this problem? We can tackle it with data-efficient learning, such as transfer learning, UDA, etc. If the new task lacks high-quality training data, the knowledge from the previous task can be transferred to the new task. In this paper, we focus on the typical scenario of UDA — transfer knowledge of synthetic data to help classify real data.

Data Synthesis via Domain Randomization

Sample images from the training (left) and validation (middle) domains of VisDA-2017 and our synthesized data (right). It is noteworthy that VisDA-2017 generates synthetic images by rendering 3D models just under varied camera angles and lighting conditions. Differently, we vary the values of much more image variation factors, leading to more realistic and diverse samples.

Experiment and Evaluation

Empirical Study on Supervised Learning

R1: Fixed-Dataset Periodic Training vs. Training on Non-Repetitive Samples

With strong data augmentation, the test results on synthetic data without background are good enough to show that the synthetically trained models do not learn shortcut solutions relying on context clues.

Training on a fixed dataset vs. non-repetitive samples. FD: Fixed Dataset, True (T) or False (F). DA: Data Augmentation, None (N), Weak (W), or Strong (S). BG: BackGround.

Learning process. (a-c): Training ResNet-50 on a fixed dataset (blue) or non-repetitive samples (red) for no, weak, and strong data augmentations. (d): Training ResNet-50 (red), ViT-B (green), and Mixer-B (blue) on non-repetitive samples with strong data augmentation.

Attention maps of randomly selected IID test samples, obtained from the ViT-B trained on a fixed dataset or non-repetitive samples with no data augmentation, at the 20-th, 200-th, 2K-th, 20K-th, and 200K-th training iterations.

R2: Evaluating Various Network Architectures

In IID tests, ViT performs surprisingly poorly whatever the data augmentation is and even the triple number of training epochs does not improve much.

R3: Impact of Model Capacity & Impact of Training Data Quantity

There is always a bottleneck from synthetic data to OOD/real data, where increasing data size and model capacity brings no more benefits, and DA to bridge the distribution gap is indispensable except for evolving the image generation pipeline to synthesize more realistic images.

Generalization accuracy w.r.t. model capacity.

Generalization accuracy w.r.t. training data quantity.

R4: Impact of Data Augmentations

For the data-unrepeatable training, IID and OOD generalizations are some type of zero-sum game w.r.t. the strength of data augmentation.

Assessing Image Variation Factors

In this table, we explore how variation factors of an image affect the model generalization, such as object scale, material texture, illumination, camera viewpoint, and background. We find that different rendering variation factors and even their different values have uneven importance to model generalization. The observations also stress the under-explored topic of data generation — AutoSimulate, namely Weighted Rendering, which learns distributions of image variation factors from real/target data.

Fix vs. randomize image variation factors (ResNet-50).

Exploring Pre-training for Domain Adaptation

R1: The Importance of Pre-training for DA

DA fails without pre-training. With no pre-training, the very baseline No Adaptation that trains the model only on the labeled source data, outperforms all compared DA methods in overall accuracy, despite the worst mean class precision.

Comparing different pre-training schemes. ⋆: Official checkpoint. Green or red: Best Acc. or Mean in each row. Ours w. SelfSup: Sup. pre-training with contrastive learning.

R2: Effects of Different Pre-training Schemes

Different DA methods exhibit different relative advantages under different pre-training data. When pre-training on our synthesized data, MCD achieves the best results; when pre-training on Ours+SubImageNet, DisClusterDA outperforms the others; when pre-training on ImageNet⋆, SRDC yields the best performance. What’s worse, the reliability of existing DA method evaluation criteria is unguaranteed. With different pre-training schemes, the best performance is achieved by different DA methods.

R3: Synthetic Data Pre-training vs. Real Data Pre-training

Synthetic data pre-training is comparable to or better than real data pre-training — Synthetic data pretraining is promising. Under the same experimental configuration, SynSL pre-training for 24 epochs is comparable to or better than pre-training on ImageNet for 120 epochs.

Learning process (Mean) of MCD (left) and DisClusterDA (right) when varying the pre-training scheme.

R4: Implications for Pre-training Data Setting

Big Synthesis Small Real is worth deeply researching. Ours+SubImageNet augmenting our synthetic data with a small amount of real data, achieves remarkable performance gain over Ours, suggesting a promising paradigm of supervised pre-training — Big Synthesis Small Real. On the other hand, pre-train with target classes first under limited computing resources. With 200K pre-training iterations, SubImageNet performs much better than ImageNet (10 Epochs).

R5: The Improved Generalization of DA Models

Real data pre-training with extra non-target classes, fine-grained target subclasses, or our synthesized data added for target classes helps DA. ImageNet (120 Epochs) involving both target and non-target classes in pre-training is better than SubImageNet involving only target classes, indicating that learning rich category relationships is helpful for downstream transferring. with 200K pre-training iterations, ImageNet-990 performs much worse than ImageNet, implying that pre-training in a fine-grained visual categorization manner may bring surprising benefits. Ours+SubImageNet adding our synthesized data for target classes in SubImageNet, produces significant improvements and is close to ImageNet (120 Epochs); ImageNet-990+Ours improves over ImageNet-990, suggesting that synthetic data may help improve the performance further.

R6: Convergence Analysis

The convergence from different pre-training schemes for the same DA method differs in speed, stability, and accuracy. SynSL with 24 epochs outperforms ImageNet with 120 epochs significantly; notably, SynSL is on par with or better than ImageNet⋆.

A New Synthetic-to-Real Benchmark

R1: Dataset Construction

Our proposed Synthetic-to-Real benchmark for more practical visual DA (termed S2RDA) includes two challenging transfer tasks of S2RDA-49 and S2RDA-MS-39. In each task, source/synthetic domain samples are synthesized by rendering 3D models from ShapeNet. The used 3D models are in the same label space as the target/real domain and each class has 12K rendered RGB images. The real domain of S2RDA-49 comprises 60,535 images of 49 classes, collected from ImageNet validation set, ObjectNet, VisDA-2017 validation set, and the web. For S2RDA-MS-39, the real domain collects 41,735 natural images exclusive for 39 classes from MetaShift, which contain complex and distinct contexts, e.g., object presence (co-occurrence of different objects), general contexts (indoor or outdoor), and object attributes (color or shape), leading to a much harder task. Compared to VisDA-2017, our introduced S2RDA contains more categories, more realistically synthesized source domain data coming for free, and more complicated target domain data collected from diverse real-world sources, setting a more practical and challenging benchmark for future DA research.

The distribution of the number of images per class in each real domain, exhibited to be a long-tailed distribution where a small number of classes dominate.

R2: Benchmarking DA Methods

We report the results on S2RDA in the following table and show t-SNE visualizations in the following figure. SRDC outperforms the baseline No Adaptation by ∼10% on S2RDA-49 and DisClusterDA outperforms that by ∼5%, verifying the efficacy of these DA methods. The observations also demonstrate that S2RDA can benchmark different DA methods. Compared to SubVisDA-10, SRDC degrades by ∼7% on S2RDA-49, which is reasonable as our real domain contains more practical images from real-world sources, though our synthetic data contain much more diversity, e.g., background. Differently, S2RDA-MS-39, which decreases by >20% over S2RDA-49, evaluates different DA approaches on the worst/extreme cases, making a more comprehensive comparison and acting as a touchstone to examine and advance DA algorithms. Reducing domain gap between simple and difficult backgrounds is by nature one of the key issues in simulation-to-real transfer. To sum up, S2RDA is a better benchmark than VisDA-2017 and enables a larger room of improvement.

Domain adaptation performance on S2RDA (ResNet-50).

The t-SNE visualization of target domain features extracted by different models on S2RDA-49 (a-b) and S2RDA-MS-39 (c-d).

BibTeX

  	
    @InProceedings{tang2023a,
    author = {Tang, Hui and Jia, Kui},
    title = {A New Benchmark: On the Utility of Synthetic Data with Blender for Bare Supervised Learning and Downstream Domain Adaptation},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month = {June},
    year = {2023}}

Acknowledgements

Based on a template by Keyan Chen.