Full creative production suite that runs entirely on your GPU.
No cloud. No API keys. No subscriptions. Every model runs on-device.
Download Visione | Documentation | GitHub
Why Visione
Most AI creative tools are fragmented: one app for image gen, another for video, another for audio, each with its own cloud dependency and pricing tier. Visione puts the entire pipeline β from concept to final export β inside a single desktop application that runs on a consumer NVIDIA GPU (16GB VRAM).
You own your hardware, your models, and your outputs. Nothing is transmitted externally. Ever.
![]() |
![]() |
![]() |
![]() |
![]() |
What You Can Do
| Component | Description |
|---|---|
| Imagine | Text-to-image generation with 90+ style LoRAs across 3 model tiers (Z-Image Turbo, Klein 9B). Character @mentions for consistent subjects. |
| Animate | Image-to-video and text-to-video via LTX 2.3. 5 workflow modes: standard I2V/T2V, Best 3-stage, first-last-frame, audio-conditioned. |
| Retouch | Full image editor β inpainting, upscaling, reframing, face swap (InsightFace + FaceFusion), background removal, LUT color grading, optical realism effects, multi-reference compositing, and smart selection (SAM). |
| Retexture | Apply any of 90+ preset styles to existing images via LoRA, or transfer the style of a reference image using depth-conditioned generation. |
| Enhance | SeedVR2 video enhancement (3B/7B models), Real-ESRGAN upscaling, and RIFE frame interpolation. |
| Storyboard | 12-stage AI filmmaking pipeline: concept development with multi-agent LLM collaboration, character library, shot-by-shot generation, and ZIP export. |
| Sound Studio | ACE-Step music generation, Qwen3-TTS voiceover (28 preset voices + clone + design), and HunyuanVideo-Foley for video-to-audio. |
| Characters | Persistent character library with full-body 5-shot reference generation for visual consistency across shots and components. |
| Styles | Browse and install LoRAs from CivitAI directly inside the app. Manage custom styles with per-preset strength tuning. |
| Gallery | Unified asset browser across all components with metadata, output modal, and send-to integration for cross-component workflows. |
![]() |
![]() |
![]() |
![]() |
![]() |
Key Features
- 90+ style presets β LoRA-based styles spanning cinematic, illustration, animation, photography, design, and artist-specific looks. Browse and install more from CivitAI directly inside the app.
- Character consistency β Generate a persistent character once, then reference them by name across Imagine, Retouch, and Storyboard with
@mentions. - Smart VRAM management β Models load and unload sequentially to fit within 16GB. One active model at a time, no manual memory management needed.
- Multilingual UI β English, Italian, Spanish, French, German. (COMING SOON)
- Local LLM + VLM β Qwen3.5-4B handles prompt enhancement, image captioning, and storyboard agents. Falls back to Llama 3.2 3B on CPU if needed. No external API calls.
- Image Edit β Client-side film emulation: grain, halation, vignette, pro-mist, chromatic aberration, highlight roll-off, color temperature and tint.
Architecture
Visione is a local client-server desktop app. The React frontend talks to a FastAPI backend over localhost; real-time progress streams via SSE. Heavy inference runs in-process (diffusers/PyTorch) or through a headless ComfyUI subprocess for video pipelines. The Tauri 2 shell wraps it as a native window and manages the backend lifecycle.
Models are shared across components wherever possible β the same image generation backbone serves Imagine, Retouch, Retexture, and Storyboard. All assets, models, and outputs stay on local storage.
Stack: Python 3.12 + FastAPI + SSE / React 18 + TypeScript + Zustand / Tauri 2 / ComfyUI headless / PyTorch 2.7 + CUDA
![]() |
![]() |
![]() |
![]() |
Hardware Requirements
| Minimum | Recommended | |
|---|---|---|
| GPU | NVIDIA 12GB VRAM (RTX 3060) | NVIDIA 16GB VRAM (RTX 4080) |
| RAM | 16GB | 32GB |
| Storage | ~50GB (core models) | ~210GB (all models) |
| OS | Windows 10/11 | Windows 11 |
License
MIT
FAQ
Will this OOM my PC? There is a chance. While I tried to build in as many safeguards and memory management as possible (78 OOMs in 3 weeks really do something to a man), there's indeed a possibility. Everything has been stress-tested on my hardware to ensure such eventualities don't present, but it isn't a guarantee.
What are the minimum specs? Visione requires an NVIDIA GPU with at least 12GB VRAM (RTX 3060 or equivalent), 16GB of RAM, and approximately 50GB of free storage for the core models. Windows 10 or 11 is supported. For the full model set and the experience it was designed and tested around, the recommended setup is an NVIDIA GPU with 16GB VRAM (RTX 4080), 32GB of RAM, and around 210GB of storage. Settings includes a model manager where you can see what is installed, download what you need, and get prompted automatically if a feature requires a model you don't have yet.
Does it run on Mac or Linux? Not at this time. Visione is built on a CUDA stack and the models it runs (video generation, video enhancement, audio foley, and more) have no meaningful support outside of NVIDIA hardware. It requires a Windows machine with an NVIDIA GPU. Nothing is ruled out for future versions, but it isn't on the current roadmap.
Is this safe? The application runs completely offline once you have downloaded the models. You will only need an internet connection to browse new styles, otherwise you can run completely offline. Furthermore, nothing is encrypted in the installer, you can fully unpack it and check for yourself. I am just delivering a final package to remove the hassle of CLI installations.
Why is it free then? Because since the first moment I sat down and wrote the very first line on the design document, this was always conceived as a free application. If you find it useful, and without any obligation, there is a KoFi link hidden at the bottom of the settings.
But... I really wanna see the code now, are you publishing it? Yes, as soon as Visione reaches 1.0 in the coming weeks, I have already planned to release the whole codebase on a dedicated GitHub repository (that already exists).
Why do I need to use those specific models? While I understand you might be keen to experiment with your own models (and BYOM is a feature I'm considering), at this moment in time everything has been built with those specific models in mind, along with all the testing done.
Are you planning to include "X" in v1.0? Here are, in no particular order and without commitment, what didn't make the cut for this initial release but has already been scoped and defined: Multi-lingual support; Characters in Animate; Elements for Imagine and Animate; Video Editor; Session Mode. And a few other things more. One I'm particularly looking forward to: automatic hardware detection that identifies your GPU tier and tailors settings accordingly, so Visione adapts to your machine rather than the other way around.
Feedback or suggestions? Please, do. Feel free to reach out directly here, or via any of the other channels linked. Your feedback is much appreciated.
license: mit tags: - art - agent - image-generation - video-generation - text-to-image - text-to-video - style-transfer - image-editing - tts - local-inference
- Downloads last month
- 210
4-bit
8-bit
Model tree for atMrMattV/Visione
Base model
Lightricks/LTX-2.3












