nielsr HF Staff commited on
Commit
6ae2afb
·
verified ·
1 Parent(s): 901ab05

Add comprehensive model card for USP

Browse files

This PR adds a comprehensive model card for the USP model. It includes:
- A link to the paper: https://huggingface.co/papers/2503.06132
- A link to the official GitHub repository: https://github.com/GD-ML/USP
- The appropriate `pipeline_tag` (`unconditional-image-generation`) to make the model discoverable on the Hugging Face Hub.
- The `library_name` (`diffusers`) to indicate compatibility with the Diffusers library, enabling the "how to use" widget.
- Relevant `tags` for better discoverability.
- Key information about the model's features, performance tables, and usage instructions, taken directly from the paper abstract and GitHub README.

Please review and merge this PR if everything looks good.

Files changed (1) hide show
  1. README.md +99 -0
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: unconditional-image-generation
3
+ library_name: diffusers
4
+ license: unknown
5
+ tags:
6
+ - diffusion-model
7
+ - self-supervised-learning
8
+ - dit
9
+ - sit
10
+ ---
11
+
12
+ # USP: Unified Self-Supervised Pretraining for Image Generation and Understanding
13
+
14
+ This repository contains the weights for **USP: Unified Self-Supervised Pretraining for Image Generation and Understanding**, as described in our paper: [https://huggingface.co/papers/2503.06132](https://huggingface.co/papers/2503.06132).
15
+
16
+ Find our official code and more details on GitHub: [https://github.com/GD-ML/USP](https://github.com/GD-ML/USP).
17
+
18
+ ## Abstract
19
+
20
+ Recent studies have highlighted the interplay between diffusion models and representation learning. Intermediate representations from diffusion models can be leveraged for downstream visual tasks, while self-supervised vision models can enhance the convergence and generation quality of diffusion models. However, transferring pretrained weights from vision models to diffusion models is challenging due to input mismatches and the use of latent spaces. To address these challenges, we propose Unified Self-supervised Pretraining (USP), a framework that initializes diffusion models via masked latent modeling in a Variational Autoencoder (VAE) latent space. USP achieves comparable performance in understanding tasks while significantly improving the convergence speed and generation quality of diffusion models.
21
+
22
+ ## Model Architecture and Convergence
23
+
24
+ ![Model Architecture](https://raw.githubusercontent.com/GD-ML/USP/main/method.png)
25
+
26
+ USP significantly improves convergence speed just with weight initialization from pretraining:
27
+ ![Convergence Speed](https://raw.githubusercontent.com/GD-ML/USP/main/XL_converge.png)
28
+
29
+ ## Finetuning Weights and Evaluation Results
30
+
31
+ Finetuning weights for image generation tasks are available. All weights were pretrained for 1600 epochs and then finetuned for 400K steps.
32
+
33
+ Using the above weights and following the inference and evaluation procedures outlined in [GENERATION.md](https://github.com/GD-ML/USP/blob/main/generation/GENERATION.md), we obtained the following evaluation results:
34
+
35
+ | Model Name | Pretrain | Finetuning | FID (\downarrow) | IS (\uparrow) | sFID (\downarrow) |
36
+ |------------|----------------|----------------|---------|---------|----------|
37
+ | DiT_B-2 | 1600 epochs | 400 K steps | 27.22 | 50.47 | 7.60 |
38
+ | DiT_L-2 | 1600 epochs | 400 K steps | 15.05 | 80.11 | 6.41 |
39
+ | DiT_XL-2 | 1600 epochs | 400 K steps | 9.64 | 112.93 | 6.30 |
40
+ | SiT_B-2 | 1600 epochs | 400 K steps | 22.10 | 61.59 | 5.88 |
41
+ | SiT_XL-2 | 1600 epochs | 400 K steps | 7.35 | 128.50 | 5.00 |
42
+
43
+ Our method is somewhat orthogonal to other DINO based acceleration methods. Results combined with external-model-based methods:
44
+
45
+ | Model | Params | Steps | FID (\downarrow) | IS (\uparrow) |
46
+ |----------------|--------|------------|---------------|---------------|
47
+ | SiT-XL/2 | 130M | 400K | 16.97 | 77.50 |
48
+ | **USP** | 130M | 400K | **7.38** | **127.96** |
49
+ | REPA | 130M | 400K | 7.9 | 122.6 |
50
+ | **USP + REPA** | 130M | 400K | **6.26** | **139.84** |
51
+ | VAVAE | 130M | 64 Epochs | 5.18/2.15† | 132.4/245.1† |
52
+ | **USP + VAVAE**| 130M | 64 Epochs | **4.2/1.81†** | **144/261.0†**|
53
+
54
+ *Table: Results Combined with External-Model-Based Methods. †: w/ CFG=10.0.*
55
+
56
+ ## Usage
57
+
58
+ You can use this model with the `diffusers` library for unconditional image generation.
59
+
60
+ ```python
61
+ from diffusers import DiffusionPipeline
62
+ import torch
63
+
64
+ # Load the USP Image Generation pipeline
65
+ # Replace "GD-ML/USP-Image_Generation" with the actual repo ID if different
66
+ pipeline = DiffusionPipeline.from_pretrained("GD-ML/USP-Image_Generation", torch_dtype=torch.float16)
67
+ pipeline.to("cuda")
68
+
69
+ # Generate an image
70
+ image = pipeline(num_inference_steps=50).images[0]
71
+
72
+ # Save or display the image
73
+ image.save("usp_generated_image.png")
74
+ print("Generated image saved as usp_generated_image.png")
75
+ ```
76
+
77
+ For detailed instructions on pre-training and image generation tasks, please refer to the following guides in the [official GitHub repository](https://github.com/GD-ML/USP):
78
+ * [PRETRAIN.md](https://github.com/GD-ML/USP/blob/main/pretrain/PRETRAIN.md)
79
+ * [GENERATION.md](https://github.com/GD-ML/USP/blob/main/generation/GENERATION.md)
80
+
81
+ ## Acknowledgement
82
+
83
+ Our code is based on [MAE](https://github.com/facebookresearch/mae), [DiT](https://github.com/facebookresearch/DiT), [SiT](https://github.com/willisma/SiT) and [VisionLLaMA](https://github.com/Meituan-AutoML/VisionLLaMA). Thanks for their great work.
84
+
85
+ ## Citation
86
+
87
+ If you find USP useful in your research or applications, please consider citing our paper:
88
+
89
+ ```bibtex
90
+ @misc{chu2025uspunifiedselfsupervisedpretraining,
91
+ title={USP: Unified Self-Supervised Pretraining for Image Generation and Understanding},
92
+ author={Xiangxiang Chu and Renda Li and Yong Wang},
93
+ year={2025},
94
+ eprint={2503.06132},
95
+ archivePrefix={arXiv},
96
+ primaryClass={cs.CV},
97
+ url={https://arxiv.org/abs/2503.06132},
98
+ }
99
+ ```