Proposal: Modulation Guidance — making AdaLN text-aware for quality steering

#122

by sorryhyun - opened Apr 10

Apr 10

Summary

Anima's AdaLN modulation path is entirely text-blind — shift/scale/gate coefficients are functions of timestep only. Text conditioning enters exclusively via cross-attention. Based on Starodubcev et al., "Rethinking Global Text Conditioning in Diffusion Transformers" (ICLR 2026), injecting a pooled text embedding into the modulation path and applying guidance in modulation space yields quality improvements orthogonal to CFG.

We ran pre-implementation validation experiments on the frozen Anima model to check whether this approach is viable. The results suggest it is — sharing the findings here in case they're useful.

Current state

Component	Text-dependent?	Notes
Cross-attention KV	Yes	Qwen3 → LLMAdapter → 28 blocks
AdaLN shift/scale/gate	No	`t_embedder` sees only timestep
CFG	Yes	Noise-space guidance (cond - uncond)

Validation results

Pooling strategy for global text representation

Evaluated 5 pooling strategies on crossattn_emb using K-Means clustering NMI against artist labels (1,416 images, 37 artists):

Strategy	Source	KMeans NMI
Max pool	crossattn_emb (post-LLMAdapter)	0.926
Mean pool	crossattn_emb (post-LLMAdapter)	0.551
Mean pool	prompt_embeds (pre-LLMAdapter)	0.400
EOS token	prompt_embeds (pre-LLMAdapter)	0.170
EOS token	crossattn_emb (post-LLMAdapter)	0.089

Max pooling on post-adapter embeddings dramatically outperforms alternatives. EOS is near-useless — Qwen3's causal LM EOS captures tokenization artifacts, not semantics. Mean pool drowns discriminative features in shared content tokens. The LLMAdapter itself concentrates discriminative information, making post-adapter pooling far richer than pre-adapter.

Quality direction consistency across content

The quality direction max_pool(p+) - max_pool(p-) was tested across 8 diverse content types (solo character, group scene, landscape, action, mecha, still life, abstract, portrait):

Metric	Value
Average pairwise cosine similarity	0.814
Minimum pairwise cosine similarity	0.770

All 28 pairwise cosine similarities exceed 0.77. A single global guidance direction generalizes across content — no need for content-conditioned directions.

Injection point comparison

Compared three injection points for the projected pooled text vector by measuring noise prediction MSE at varying perturbation scales:

Injection point	MSE @ α=2.0	MSE @ α=8.0	Growth α=4→8
Before t_embedding_norm	4.77e-4	1.08e-2	6.1x
After t_embedding_norm	4.76e-3	1.89e-1	7.2x
Into adaln_lora branch	4.29e-3	4.06e-2	2.6x

After normalization is optimal: ~10x more sensitive than before norm (RMSNorm re-centers the perturbation) and ~4.7x more dynamic range than the adaln_lora branch. All injection points remain stable at high α (no collapse).

Quality/resolution correlation in embedding space

Metric	Value
Standalone quality <> resolution cosine	0.021 (nearly orthogonal)
Per-content quality <> resolution cosine	0.496 (correlated)

Standalone directions are orthogonal, but per-content they correlate — high-resolution training images tend to be higher quality. This means including resolution tags (absurdres, highres) in the positive guidance prompt is beneficial, but using a separate resolution guidance direction on top of quality guidance would interfere.

Recommended guidance prompts:

p+: "absurdres, highres, masterpiece, best quality, score_7, score_8, score_9"
p-: "worst quality, low quality, score_1, score_2, score_3"

Proposed architecture change

A small projection MLP (~6.3M params, 0.3% of model) injected after t_embedding_norm:

pooled = crossattn_emb.max(dim=1).values          # (B, 1024)
t_embedding_B_T_D = t_embedding_B_T_D + pooled_text_proj(pooled).unsqueeze(1)

Zero-initialized output layer means no effect before distillation training. The distillation follows the paper's Section 5: freeze the model, train only the projection using teacher (full cross-attention) vs. student (unconditional cross-attention, projection active) with MSE loss. ~4K iterations on the existing training dataset.

Once trained, inference-time modulation guidance (Eq. 3 from the paper) steers quality through AdaLN coefficients — orthogonal to CFG, composable with LoRA/T-LoRA, negligible latency cost.

Compatibility

Feature	Interaction
T-LoRA	Orthogonal (different parameter spaces)
CFG	Complementary (noise space vs. AdaLN space, they stack)
HydraLoRA	Shared pooling (`crossattn_emb.max(dim=1).values`)
Spectrum	Compatible (guidance applies to `emb_B_T_D` before blocks)

nagarago

Apr 10

Are you talking about this?
https://github.com/Anzhc/Anima-Mod-Guidance-ComfyUI-Node

sorryhyun

Apr 10

@nagarago yeah, I have read this implementation but I wanted to verify will this work properly, or how should I implement in detail. Experiments I wrote are groundings, sort of, design decisions I have made.

sorryhyun

Apr 10

@nagarago For example I found max pool can be more helpful compared to conventional eos pooling comfy clip uses, and because I don't know how quality (masterpiece, score_9...) and resolution (highres, absurdres...) tags were trained in cross_emb, I tried some guidance prompt variants.

sorryhyun

Apr 15

Here is update, I tried dynamic guidance strategy, inspired by original author's appendix, and their 'i8_skip27' strategy quite works well. Here is the drop-in implementation as a replacement of ksampler block. https://github.com/sorryhyun/ComfyUI-Spectrum-KSampler

ArranEye

Apr 15

@sorryhyun it seems that the default config of your Mod Guidance is pretty different from Anzhc's. The output between "KSampler (Spectrum + Mod Guidance)" and "Anima Mod Guidance + KSampler (Spectrum)" is significantly different, and will get color drift with some scheduler like bong_tangent.

sorryhyun

Apr 15

@ArranEye Yeah those will be quite different since the default config was adjusted to my personal preference, sry for a bit of dirtyness
Output between those should be different; I use different mod guidance weight (trained personally) and spectrum has also adjusted for best quality.
And yeah, I agree with color drift with some schedulers will happen, I haven't tested except for simple scheduler. (with er_sde method)

guri06

Apr 20

Thank you for your great work. While it’s still a bit unstable (e.g., color bleeding/burning issues or sampler/scheduler compatibility), it feels more powerful than implementing Modulation Guidance with CLIP-L. I have two questions:

Could you implement a stop_caching_step parameter for the Spectrum implementation? According to the paper and related implementations, skipping interpolation and performing direct computation for the final few steps (e.g., last 3 steps) is more effective for preserving detail.

Can I include artist tags like @big chungus in p+? Since the anima model’s Qwen 0.6B has learned many artist tags, I’m wondering if this could help with the prompt dilution issue—where artist tags tend to lose their influence as the prompt gets longer in standard Text Conditioning.

sorryhyun

Apr 20

@guri06 Thanks for your comment!

it was implemented internal as follow:

        self.step_idx = -1
        self.last_sigma: Optional[float] = None
        self.mode = "actual"
        self.curr_ws = window_size
        self.consec_cached = 0
        self.fwd_count = 0

        # Forecasters keyed by cond_or_uncond value (0=cond, 1=uncond)
        self.forecasters: Dict[int, SpectrumPredictor] = {}
        self.captured_feat: Optional[torch.Tensor] = None

    def should_cache(self) -> bool:
        if self.step_idx < self.warmup_steps:
            return False
        stop_at = self.num_steps - 3
        if self.step_idx >= stop_at:
            return False
        return (self.consec_cached + 1) % max(1, math.floor(self.curr_ws)) != 0

but I'll expose this to comfy node field. thanks for the feedback

Actually quality tags I've been using like masterpiece, best quality, absurdres were chosen arbitrarily. Since modulation guidance was trained to 'reproduce cross-attn', if the base model knows that tag, it should work. So, I think @big chungus would work.

guri06

Apr 20

Oh, I didn't realize automated processing was already implemented. Sorry.

Based on my experience with a previous CLIP-L implementation, I’d like to suggest a few improvements. While this is based on empirical observation, I believe they could be beneficial:

AdaLN Modulation via Separate Encoding
In the previous CLIP-L implementation, I could create a separate Text Encode node to exclude "masterpiece" or "best quality" from the base conditioning and only include "1girl." However, the current structure doesn't allow for this. I'm not an expert on the underlying mechanics, but when "masterpiece" is present in both p+ and p(base), the image quality seems to degrade or "break" easily. On the other hand, removing quality tags from the base entirely causes them to be omitted from CFG conditioning.

Therefore, I would love to see an advanced node that accepts p(+), p(base), and p(-) as separate inputs. Thanks, and have a great day!

ArranEye

19 days ago

I've noticed you've been updating recently, and I'd like to ask what the main new developments are. How well does the pooled_text_proj-0429 model perform?

sorryhyun

19 days ago

@ArranEye I selected best eval-loss checkpoint. and main new one would be dcw correction. Well... the project is quite drifted from the beginning, as pure modulation guidance, anw I aim for general quality improvement without additional inference costs.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment