ETCHR-FLUX.2-klein-9B

๐Ÿ“–Paper | ๐Ÿ Homepage | ๐Ÿค—ETCHR-FLUX.2-klein-9B Model | ๐Ÿค—ETCHR SFT-400K Dataset | ๐Ÿค—ETCHR GRPO-10K Dataset | ๐Ÿค—DL3DV-2K Benchmark

ETCHR-FLUX.2-klein-9B is a novel question-conditioned, reasoning-aware image editor designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models. By decoupling the specialized image editor from the downstream understanding model, ETCHR bridges the critical bottleneck where a purely textual chain of thought fails in fine-grained focus or complex spatial transformations.

๐Ÿ“ข News

๐ŸŒˆ Overview

We are thrilled to introduce ETCHR (Editing To Clarify and Harness Reasoning), a novel question-conditioned, reasoning-aware image editor built on FLUX.2-klein-base-9B designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models (MLLMs). By decoupling the specialized image editor from the downstream understanding model, ETCHR bridges the critical bottleneck where a purely textual chain of thought fails in fine-grained focus or complex spatial transformations.

Teaser

๐Ÿ’ก Highlights

  • ๐Ÿ”ฅ Decoupled & Plug-and-Play: ETCHR functions as a separate module, allowing it to assist diverse downstream MLLMs (such as Qwen3-VL-8B, Gemini-3.1-Flash-Lite, or Kimi K2.5) without requiring any task-specific fine-tuning on the understanding models themselves.
  • ๐Ÿ”ฅ Naturally Reflective Pipeline: Introduces an Edit-Verify-Reason inference mechanism where the understanding model filters out noisy or flawed edits, reverting safely to the original image when verification fails.

๐Ÿ“Š Results

We evaluate ETCHR across five distinct task families spanning fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding. Across all evaluated backbones, ETCHR consistently yields major improvements in Pass@1 accuracy:

Pipeline

๐Ÿ› ๏ธ Evaluation

Prepare your environment:

git clone https://github.com/InternLM/ETCHR.git
conda create -n ETCHR python==3.11
conda activate ETCHR
cd RL/Pref-GRPO
bash env_setup.sh fastvideo
pip install "vllm>=0.11.0"
pip install qwen-vl-utils==0.0.14

We Provide an example code running ETCHR on DL3DV-2K Benchmark in Evaluation/inference_dl3dv.py, you can start the evaluation with the following two steps:

Step 1: start a VLLM server for an understanding model (eg. Qwen3-VL-8B, Kimi K2.5, ...).

cd Evaluation
bash launch_vllm.sh

Step 2: Run ETCHR atop any understanding model

python inference_dl3dv.py

Cases

ETCHR can assist with a broad spectrum of understanding tasks, including fine-grained perception, chart reasoning, maze navigation, jigsaw puzzles, and 3D spatial understanding.

case3D

casejigsaw

casejigsaw

casejigsaw

๐Ÿ“„ License

Our work is based on FLUX.2-klein-base-9B, so please follow FLUX Non-Commercial License.

โœ’๏ธCitation

If you find this project useful, please kindly cite:

@article{zhang2026etchr,
  title={ETCHR: Editing To Clarify and Harness Reasoning},
  author={Beichen Zhang, Yuhong Liu, Jinsong Li, Yuhang Zang, Jiaqi Wang, Dahua Lin},
  journal={arXiv preprint arXiv:2605.23897},
  year={2026}
}

โค๏ธ Acknowledgement

The base model is FLUX.2-klein-base-9B, a powerful image-to-image model.

The work is built upon DiffSynth-Studio and Pref-GRPO, two excellent codebases for Diffusion models training!


Downloads last month
125
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including internlm/ETCHR-FLUX.2-klein-9B

Paper for internlm/ETCHR-FLUX.2-klein-9B