MPT-VC — Matryoshka Phase-on-Torus Video Codec

MPT-VC is a neural video tokenizer built almost entirely from deterministic operators: windowed axial 3D rotary position embeddings (RoPE), finite scalar quantization (FSQ), and a bit-exact arithmetic entropy coder. It compresses video into a single rate-ordered bitstream that truncates from a coarse preview to full resolution, and whose latent decodes back to pixels — so the compressed artifact is both a storage format and an information-complete, AI-friendly latent.

Code: https://github.com/k-kolomeitsev/mpt-vc
This repo: the production checkpoint (inference-only) + this model card.

This card ships the trained weights only. All code (encoder/decoder, entropy coder, CLIs, tests) lives in the GitHub repository above.

Highlights

Rate-scalable Matryoshka bitstream. One .mptvc file is read as a prefix: L_0 → preview, L_0+L_1 → medium, L_0+L_1+L_2 → full. No re-encode for an ABR ladder.
Windowed axial 3D RoPE. Relative RoPE inside local axial attention → every pyramid level is translation-equivariant and resolution-flexible; cost is linear in token count.
Scene-aware phase-jump. A per-scene virtual temporal offset decorrelates attention across shot cuts geometrically — a differentiable, mask-free attention gate.
Bit-exact entropy coding. The rate minimized in training equals the stored bytes to within 1%.
Invertible latent. The latent reconstructs the source video, and is also a native generation target (compact, discrete, collapse-free FSQ codebook).
Compact. ~5.0M parameters total.

Model details


Architecture	Matryoshka Laplacian pyramid, `K=3` levels, local axial 3D-RoPE attention
Parameters	5,039,641 total (backbone 4,354,201 + entropy/hyperprior 685,440)
Patch / window	`p=8`, `p_t=4`, chunk `T=32` (1.33 s @ 24 fps)
Channels / FSQ levels	`C=(64, 96, 128)`, `L=(8, 8, 10)`
Canonical tile	`480 × 864` @ 24 fps, RGB
RoPE	`θ_base = 10000`, scene-cut `θ_jump = 10000`
Entropy model	Ballé scale hyperprior, `σ_floor = 0.3`, `hyper_channels = 32`
Bitstream	`.mptvc` format v2 (rate-ordered prefix, CRC32)
Training	progressive curriculum (per-level → joint), bf16 main graph, fp32 entropy/RoPE paths
Provenance	joint fine-tune step 1,350,000, stage `joint`

The checkpoint embeds the exact training Config (resolution, pyramid, FSQ, σ_floor, …). The CLIs rehydrate it from the checkpoint so the range coder stays bit-exact — you do not pass any model hyperparameters by hand.

Files

File	Size	SHA-256
`mpt-vc-290h.pt`	20.36 MB	`96970b029fd3c938f1014b70a1d9bc0d13a903e992f2461089a4d227657d5ea2`

mpt-vc-290h.pt is an inference-only checkpoint: a PyTorch dict with keys model, entropy, cfg, step, stage. The optimizer state has been stripped — these are inference weights.

Intended use

Reconstruction / compression of video into a rate-scalable, invertible latent.
A frozen latent backbone for downstream video understanding (detection, tracking) on tokens.
A native generation substrate: synthesize video directly as MPT-VC latents, then decode.

Out of scope. MPT-VC has a fixed-size latent and does not adapt its rate to content, so it is not rate-distortion competitive with mature handcrafted codecs (HEVC/VVC) on smooth, high-frame-rate sequences. Long-range temporal context is, by design, the job of a model built on the latent, not of the compressor.

How to use

Inference (encode/decode) needs the Rust range coder built once. The model card weights work with the CLIs in the code repo.

git clone https://github.com/k-kolomeitsev/mpt-vc
cd mpt-vc/codec

# dependencies (PyTorch + ffmpeg assumed present)
pip install -r requirements.txt

# build the mandatory native range coder (bit-exact arithmetic coder; needs a Rust toolchain)
cd range_coder_rs && cargo build --release && cd ..

# download the weights from this HF repo, e.g. via huggingface_hub:
#   huggingface_hub.hf_hub_download(repo_id="kkolomeitsev/mpt-vc", filename="mpt-vc-290h.pt")

Encode a video → `.mptvc`

python encode_cli.py \
    --input source.mp4 \
    --ckpt  mpt-vc-290h.pt \
    --output clip.mptvc

Decode `.mptvc` → video (with rate adaptation)

--k-max truncates the Matryoshka prefix: 0 = preview, 1 = medium, omit = full.

# full quality (all 3 levels)
python decode_cli.py --input clip.mptvc --ckpt mpt-vc-290h.pt --output full.mp4

# medium (L_0 + L_1)
python decode_cli.py --input clip.mptvc --ckpt mpt-vc-290h.pt --output medium.mp4 --k-max 1

# preview (L_0 only)
python decode_cli.py --input clip.mptvc --ckpt mpt-vc-290h.pt --output preview.mp4 --k-max 0

Add --fp32 to either CLI to force full-precision inference (default is bf16 on RTX 40/50/Hopper).

Results

Matryoshka rate–distortion (held-out, 32 unseen clips)

The single bitstream trades quality for rate monotonically; the training-to-storage rate gap stays below 1%.

Truncation	bpp	PSNR-Y
preview (`L_0`)	0.011	27.85 dB
medium (`L_0+L_1`)	0.103	32.76 dB
full (`L_0+L_1+L_2`)	0.351	40.26 dB

UVG — MPT-VC vs. NVIDIA Cosmos (CV8×8×8)

Full UVG benchmark (16 sequences, native 4K, native frame rate), identical 864×480 tiling, PSNR measured in-tensor. MPT-VC reconstructs at higher Y-PSNR on every sequence.

	MPT-VC Y	Cosmos Y	ΔY
Mean (16 seq.)	38.07 dB	36.35 dB	+1.72 dB

Per-sequence margin ranges from +0.60 dB (FlowerPan) to +4.28 dB (YachtRide); high-motion sequences (Jockey, ReadySetGo) do not reduce the advantage. Full per-sequence table is in the paper.

Throughput & memory (one 4K clip, 32 frames, 25 tiles, single GPU)

Path	Time	PSNR-Y
MPT-VC, full (with entropy coding)	105 s	33.26
MPT-VC, latent-direct (no entropy)	17 s	33.26
Cosmos CV8×8×8	34 s	30.98

On the matched encode–decode basis MPT-VC is ~2× faster. Reconstruction is bit-identical with and without the entropy coder. A single 4K window through the Cosmos encoder requests ~94 GiB; MPT-VC's windowed-local attention tiles at ~4.7 GiB per tile.

Training data

Trained on approximately 290 hours of video.

Limitations

Fixed-rate latent; not RD-competitive with HEVC/VVC on easy, smooth content (see Intended use).
Native-resolution evaluation uses overlapping tiling, which adds overlap bits and removes cross-tile context.

Citation

@misc{kolomeitsev2026mptvc,
  title         = {MPT-VC: A Rate-Scalable Video Tokenizer with Windowed Axial RoPE and Scene-Aware Positional Encoding},
  author        = {Konstantin Kolomeitsev},
  year          = {2026},
  howpublished  = {\url{https://github.com/k-kolomeitsev/mpt-vc}}
}

License

Apache-2.0. See the code repository for the full text.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support