Instructions to use circlestone-labs/Anima with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusion Single File
How to use circlestone-labs/Anima with Diffusion Single File:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Proposal: Modulation Guidance — making AdaLN text-aware for quality steering
Summary
Anima's AdaLN modulation path is entirely text-blind — shift/scale/gate coefficients are functions of timestep only. Text conditioning enters exclusively via cross-attention. Based on Starodubcev et al., "Rethinking Global Text Conditioning in Diffusion Transformers" (ICLR 2026), injecting a pooled text embedding into the modulation path and applying guidance in modulation space yields quality improvements orthogonal to CFG.
We ran pre-implementation validation experiments on the frozen Anima model to check whether this approach is viable. The results suggest it is — sharing the findings here in case they're useful.
Current state
| Component | Text-dependent? | Notes |
|---|---|---|
| Cross-attention KV | Yes | Qwen3 → LLMAdapter → 28 blocks |
| AdaLN shift/scale/gate | No | t_embedder sees only timestep |
| CFG | Yes | Noise-space guidance (cond - uncond) |
Validation results
Pooling strategy for global text representation
Evaluated 5 pooling strategies on crossattn_emb using K-Means clustering NMI against artist labels (1,416 images, 37 artists):
| Strategy | Source | KMeans NMI |
|---|---|---|
| Max pool | crossattn_emb (post-LLMAdapter) | 0.926 |
| Mean pool | crossattn_emb (post-LLMAdapter) | 0.551 |
| Mean pool | prompt_embeds (pre-LLMAdapter) | 0.400 |
| EOS token | prompt_embeds (pre-LLMAdapter) | 0.170 |
| EOS token | crossattn_emb (post-LLMAdapter) | 0.089 |
Max pooling on post-adapter embeddings dramatically outperforms alternatives. EOS is near-useless — Qwen3's causal LM EOS captures tokenization artifacts, not semantics. Mean pool drowns discriminative features in shared content tokens. The LLMAdapter itself concentrates discriminative information, making post-adapter pooling far richer than pre-adapter.
Quality direction consistency across content
The quality direction max_pool(p+) - max_pool(p-) was tested across 8 diverse content types (solo character, group scene, landscape, action, mecha, still life, abstract, portrait):
| Metric | Value |
|---|---|
| Average pairwise cosine similarity | 0.814 |
| Minimum pairwise cosine similarity | 0.770 |
All 28 pairwise cosine similarities exceed 0.77. A single global guidance direction generalizes across content — no need for content-conditioned directions.
Injection point comparison
Compared three injection points for the projected pooled text vector by measuring noise prediction MSE at varying perturbation scales:
| Injection point | MSE @ α=2.0 | MSE @ α=8.0 | Growth α=4→8 |
|---|---|---|---|
| Before t_embedding_norm | 4.77e-4 | 1.08e-2 | 6.1x |
| After t_embedding_norm | 4.76e-3 | 1.89e-1 | 7.2x |
| Into adaln_lora branch | 4.29e-3 | 4.06e-2 | 2.6x |
After normalization is optimal: ~10x more sensitive than before norm (RMSNorm re-centers the perturbation) and ~4.7x more dynamic range than the adaln_lora branch. All injection points remain stable at high α (no collapse).
Quality/resolution correlation in embedding space
| Metric | Value |
|---|---|
| Standalone quality <> resolution cosine | 0.021 (nearly orthogonal) |
| Per-content quality <> resolution cosine | 0.496 (correlated) |
Standalone directions are orthogonal, but per-content they correlate — high-resolution training images tend to be higher quality. This means including resolution tags (absurdres, highres) in the positive guidance prompt is beneficial, but using a separate resolution guidance direction on top of quality guidance would interfere.
Recommended guidance prompts:
- p+:
"absurdres, highres, masterpiece, best quality, score_7, score_8, score_9" - p-:
"worst quality, low quality, score_1, score_2, score_3"
Proposed architecture change
A small projection MLP (~6.3M params, 0.3% of model) injected after t_embedding_norm:
pooled = crossattn_emb.max(dim=1).values # (B, 1024)
t_embedding_B_T_D = t_embedding_B_T_D + pooled_text_proj(pooled).unsqueeze(1)
Zero-initialized output layer means no effect before distillation training. The distillation follows the paper's Section 5: freeze the model, train only the projection using teacher (full cross-attention) vs. student (unconditional cross-attention, projection active) with MSE loss. ~4K iterations on the existing training dataset.
Once trained, inference-time modulation guidance (Eq. 3 from the paper) steers quality through AdaLN coefficients — orthogonal to CFG, composable with LoRA/T-LoRA, negligible latency cost.
Compatibility
| Feature | Interaction |
|---|---|
| T-LoRA | Orthogonal (different parameter spaces) |
| CFG | Complementary (noise space vs. AdaLN space, they stack) |
| HydraLoRA | Shared pooling (crossattn_emb.max(dim=1).values) |
| Spectrum | Compatible (guidance applies to emb_B_T_D before blocks) |
- Here is update, I tried dynamic guidance strategy, inspired by original author's appendix, and their 'i8_skip27' strategy quite works well. Here is the drop-in implementation as a replacement of ksampler block. https://github.com/sorryhyun/ComfyUI-Spectrum-KSampler
@sorryhyun it seems that the default config of your Mod Guidance is pretty different from Anzhc's. The output between "KSampler (Spectrum + Mod Guidance)" and "Anima Mod Guidance + KSampler (Spectrum)" is significantly different, and will get color drift with some scheduler like bong_tangent.
@ArranEye Yeah those will be quite different since the default config was adjusted to my personal preference, sry for a bit of dirtyness
Output between those should be different; I use different mod guidance weight (trained personally) and spectrum has also adjusted for best quality.
And yeah, I agree with color drift with some schedulers will happen, I haven't tested except for simple scheduler. (with er_sde method)
Thank you for your great work. While it’s still a bit unstable (e.g., color bleeding/burning issues or sampler/scheduler compatibility), it feels more powerful than implementing Modulation Guidance with CLIP-L. I have two questions:
Could you implement a stop_caching_step parameter for the Spectrum implementation? According to the paper and related implementations, skipping interpolation and performing direct computation for the final few steps (e.g., last 3 steps) is more effective for preserving detail.
Can I include artist tags like @big chungus in p+? Since the anima model’s Qwen 0.6B has learned many artist tags, I’m wondering if this could help with the prompt dilution issue—where artist tags tend to lose their influence as the prompt gets longer in standard Text Conditioning.
@guri06 Thanks for your comment!
- it was implemented internal as follow:
self.step_idx = -1
self.last_sigma: Optional[float] = None
self.mode = "actual"
self.curr_ws = window_size
self.consec_cached = 0
self.fwd_count = 0
# Forecasters keyed by cond_or_uncond value (0=cond, 1=uncond)
self.forecasters: Dict[int, SpectrumPredictor] = {}
self.captured_feat: Optional[torch.Tensor] = None
def should_cache(self) -> bool:
if self.step_idx < self.warmup_steps:
return False
stop_at = self.num_steps - 3
if self.step_idx >= stop_at:
return False
return (self.consec_cached + 1) % max(1, math.floor(self.curr_ws)) != 0
but I'll expose this to comfy node field. thanks for the feedback
- Actually quality tags I've been using like
masterpiece, best quality, absurdreswere chosen arbitrarily. Since modulation guidance was trained to 'reproduce cross-attn', if the base model knows that tag, it should work. So, I think@big chunguswould work.
Oh, I didn't realize automated processing was already implemented. Sorry.
Based on my experience with a previous CLIP-L implementation, I’d like to suggest a few improvements. While this is based on empirical observation, I believe they could be beneficial:
- AdaLN Modulation via Separate Encoding
In the previous CLIP-L implementation, I could create a separate Text Encode node to exclude "masterpiece" or "best quality" from the base conditioning and only include "1girl." However, the current structure doesn't allow for this. I'm not an expert on the underlying mechanics, but when "masterpiece" is present in both p+ and p(base), the image quality seems to degrade or "break" easily. On the other hand, removing quality tags from the base entirely causes them to be omitted from CFG conditioning.
Therefore, I would love to see an advanced node that accepts p(+), p(base), and p(-) as separate inputs. Thanks, and have a great day!
I've noticed you've been updating recently, and I'd like to ask what the main new developments are. How well does the pooled_text_proj-0429 model perform?