Add comprehensive model card for USP

This PR adds a comprehensive model card for the USP model. It includes:
- A link to the paper: https://huggingface.co/papers/2503.06132
- A link to the official GitHub repository: https://github.com/GD-ML/USP
- The appropriate `pipeline_tag` (`unconditional-image-generation`) to make the model discoverable on the Hugging Face Hub.
- The `library_name` (`diffusers`) to indicate compatibility with the Diffusers library, enabling the "how to use" widget.
- Relevant `tags` for better discoverability.
- Key information about the model's features, performance tables, and usage instructions, taken directly from the paper abstract and GitHub README.

Please review and merge this PR if everything looks good.

Files changed (1) hide show

README.md +99 -0

README.md ADDED Viewed

	@@ -0,0 +1,99 @@

+---
+pipeline_tag: unconditional-image-generation
+library_name: diffusers
+license: unknown
+tags:
+- diffusion-model
+- self-supervised-learning
+- dit
+- sit
+---
+# USP: Unified Self-Supervised Pretraining for Image Generation and Understanding
+This repository contains the weights for **USP: Unified Self-Supervised Pretraining for Image Generation and Understanding**, as described in our paper: [https://huggingface.co/papers/2503.06132](https://huggingface.co/papers/2503.06132).
+Find our official code and more details on GitHub: [https://github.com/GD-ML/USP](https://github.com/GD-ML/USP).
+## Abstract
+Recent studies have highlighted the interplay between diffusion models and representation learning. Intermediate representations from diffusion models can be leveraged for downstream visual tasks, while self-supervised vision models can enhance the convergence and generation quality of diffusion models. However, transferring pretrained weights from vision models to diffusion models is challenging due to input mismatches and the use of latent spaces. To address these challenges, we propose Unified Self-supervised Pretraining (USP), a framework that initializes diffusion models via masked latent modeling in a Variational Autoencoder (VAE) latent space. USP achieves comparable performance in understanding tasks while significantly improving the convergence speed and generation quality of diffusion models.
+## Model Architecture and Convergence
+![Model Architecture](https://raw.githubusercontent.com/GD-ML/USP/main/method.png)
+USP significantly improves convergence speed just with weight initialization from pretraining:
+![Convergence Speed](https://raw.githubusercontent.com/GD-ML/USP/main/XL_converge.png)
+## Finetuning Weights and Evaluation Results
+Finetuning weights for image generation tasks are available. All weights were pretrained for 1600 epochs and then finetuned for 400K steps.
+Using the above weights and following the inference and evaluation procedures outlined in [GENERATION.md](https://github.com/GD-ML/USP/blob/main/generation/GENERATION.md), we obtained the following evaluation results:
+| Model Name | Pretrain       | Finetuning     | FID (\downarrow) | IS (\uparrow)  | sFID (\downarrow) |
+|------------|----------------|----------------|---------|---------|----------|
+| DiT_B-2    | 1600 epochs    | 400 K steps    | 27.22   | 50.47   | 7.60     |
+| DiT_L-2    | 1600 epochs    | 400 K steps    | 15.05   | 80.11   | 6.41     |
+| DiT_XL-2   | 1600 epochs    | 400 K steps    | 9.64    | 112.93  | 6.30     |
+| SiT_B-2    | 1600 epochs    | 400 K steps    | 22.10   | 61.59   | 5.88     |
+| SiT_XL-2   | 1600 epochs    | 400 K steps    | 7.35    | 128.50  | 5.00     |
+Our method is somewhat orthogonal to other DINO based acceleration methods. Results combined with external-model-based methods:
+| Model          | Params | Steps      | FID (\downarrow)       | IS (\uparrow)        |
+|----------------|--------|------------|---------------|---------------|
+| SiT-XL/2       | 130M   | 400K       | 16.97         | 77.50         |
+| **USP**        | 130M   | 400K       | **7.38**      | **127.96**    |
+| REPA           | 130M   | 400K       | 7.9           | 122.6         |
+| **USP + REPA** | 130M   | 400K       | **6.26**      | **139.84**    |
+| VAVAE          | 130M   | 64 Epochs  | 5.18/2.15†    | 132.4/245.1†  |
+| **USP + VAVAE**| 130M   | 64 Epochs  | **4.2/1.81†** | **144/261.0†**|
+*Table: Results Combined with External-Model-Based Methods. †: w/ CFG=10.0.*
+## Usage
+You can use this model with the `diffusers` library for unconditional image generation.
+```python
+from diffusers import DiffusionPipeline
+import torch
+# Load the USP Image Generation pipeline
+# Replace "GD-ML/USP-Image_Generation" with the actual repo ID if different
+pipeline = DiffusionPipeline.from_pretrained("GD-ML/USP-Image_Generation", torch_dtype=torch.float16)
+pipeline.to("cuda")
+# Generate an image
+image = pipeline(num_inference_steps=50).images[0]
+# Save or display the image
+image.save("usp_generated_image.png")
+print("Generated image saved as usp_generated_image.png")
+```
+For detailed instructions on pre-training and image generation tasks, please refer to the following guides in the [official GitHub repository](https://github.com/GD-ML/USP):
+*   [PRETRAIN.md](https://github.com/GD-ML/USP/blob/main/pretrain/PRETRAIN.md)
+*   [GENERATION.md](https://github.com/GD-ML/USP/blob/main/generation/GENERATION.md)
+## Acknowledgement
+Our code is based on [MAE](https://github.com/facebookresearch/mae), [DiT](https://github.com/facebookresearch/DiT), [SiT](https://github.com/willisma/SiT) and [VisionLLaMA](https://github.com/Meituan-AutoML/VisionLLaMA). Thanks for their great work.
+## Citation
+If you find USP useful in your research or applications, please consider citing our paper:
+```bibtex
+@misc{chu2025uspunifiedselfsupervisedpretraining,
+      title={USP: Unified Self-Supervised Pretraining for Image Generation and Understanding},
+      author={Xiangxiang Chu and Renda Li and Yong Wang},
+      year={2025},
+      eprint={2503.06132},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2503.06132},
+}
+```