Abstract
A reinforcement learning and on-policy distillation approach enhances the visual quality and instruction-following capabilities of a diffusion model for image generation and editing tasks.
We present Qwen-Image-2.0-RL, a post-training pipeline that applies reinforcement learning from human feedback (RLHF) and on-policy distillation (OPD) to improve both the visual quality and instruction-following capability of the Qwen-Image-2.0 diffusion model. To provide reliable reward signals, we construct task-specific composite reward models by fine-tuning vision-language models with a pointwise scoring paradigm and chain-of-thought reasoning. For text-to-image generation, the reward models cover alignment, aesthetics, and portrait fidelity dimensions. For image editing tasks, the reward system addresses instruction-following accuracy and face identity preservation. Building on this reward system, we develop a scalable GRPO-based RL training framework, incorporating a hybrid classifier-free guidance (CFG) strategy to preserve pre-trained knowledge, prompt curation via intra-group reward range filtering, and per-category reward weight calibration. To merge the task-specialized RL policies for T2I and editing, we propose on-policy distillation as the final training stage, which consolidates multiple teachers into a single student model through trajectory-level velocity matching. Extensive evaluation shows that Qwen-Image-2.0-RL achieves 57.84 overall score on Qwen-Image-Bench (+2.61 over the base model), Elo ratings of 1193 in text-to-image arena (+78) and 1349 in image edit arena (+93), demonstrating consistent gains in aesthetic quality, prompt adherence, and editing accuracy.
Community
Qwen3.7-Max/Plus is already live as a closed API โ any plans for open-weight releases of the 3.7 family? (like 3.6-35B-A3B / 3.6-27B alongside 3.6-Max)
Would love to run it locally via llama.cpp / GGUF.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ERNIE-Image Technical Report (2026)
- Qwen-Image-2.0 Technical Report (2026)
- Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions (2026)
- AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward (2026)
- SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training (2026)
- ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning (2026)
- AdaGRPO: A Capability-Aware Adaptive Enhancement for Flow-based GRPO (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Hi there,
Thanks for the great work!
I have a quick question regarding the "face identity consistency reward" section in the paper. It mentions:
"We therefore introduce a dedicated model-based face identity scorer."
Could you share more details about this scorer?
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper