MPT-VC — Matryoshka Phase-on-Torus Video Codec
MPT-VC is a neural video tokenizer built almost entirely from deterministic operators: windowed axial 3D rotary position embeddings (RoPE), finite scalar quantization (FSQ), and a bit-exact arithmetic entropy coder. It compresses video into a single rate-ordered bitstream that truncates from a coarse preview to full resolution, and whose latent decodes back to pixels — so the compressed artifact is both a storage format and an information-complete, AI-friendly latent.
- Code: https://github.com/k-kolomeitsev/mpt-vc
- This repo: the production checkpoint (inference-only) + this model card.
This card ships the trained weights only. All code (encoder/decoder, entropy coder, CLIs, tests) lives in the GitHub repository above.
Highlights
- Rate-scalable Matryoshka bitstream. One
.mptvcfile is read as a prefix:L_0→ preview,L_0+L_1→ medium,L_0+L_1+L_2→ full. No re-encode for an ABR ladder. - Windowed axial 3D RoPE. Relative RoPE inside local axial attention → every pyramid level is translation-equivariant and resolution-flexible; cost is linear in token count.
- Scene-aware phase-jump. A per-scene virtual temporal offset decorrelates attention across shot cuts geometrically — a differentiable, mask-free attention gate.
- Bit-exact entropy coding. The rate minimized in training equals the stored bytes to within 1%.
- Invertible latent. The latent reconstructs the source video, and is also a native generation target (compact, discrete, collapse-free FSQ codebook).
- Compact. ~5.0M parameters total.
Model details
| Architecture | Matryoshka Laplacian pyramid, K=3 levels, local axial 3D-RoPE attention |
| Parameters | 5,039,641 total (backbone 4,354,201 + entropy/hyperprior 685,440) |
| Patch / window | p=8, p_t=4, chunk T=32 (1.33 s @ 24 fps) |
| Channels / FSQ levels | C=(64, 96, 128), L=(8, 8, 10) |
| Canonical tile | 480 × 864 @ 24 fps, RGB |
| RoPE | θ_base = 10000, scene-cut θ_jump = 10000 |
| Entropy model | Ballé scale hyperprior, σ_floor = 0.3, hyper_channels = 32 |
| Bitstream | .mptvc format v2 (rate-ordered prefix, CRC32) |
| Training | progressive curriculum (per-level → joint), bf16 main graph, fp32 entropy/RoPE paths |
| Provenance | joint fine-tune step 1,350,000, stage joint |
The checkpoint embeds the exact training Config (resolution, pyramid, FSQ, σ_floor, …). The CLIs
rehydrate it from the checkpoint so the range coder stays bit-exact — you do not pass any model
hyperparameters by hand.
Files
| File | Size | SHA-256 |
|---|---|---|
mpt-vc-290h.pt |
20.36 MB | 96970b029fd3c938f1014b70a1d9bc0d13a903e992f2461089a4d227657d5ea2 |
mpt-vc-290h.pt is an inference-only checkpoint: a PyTorch dict with keys
model, entropy, cfg, step, stage. The optimizer state has been stripped — these are
inference weights.
Intended use
- Reconstruction / compression of video into a rate-scalable, invertible latent.
- A frozen latent backbone for downstream video understanding (detection, tracking) on tokens.
- A native generation substrate: synthesize video directly as MPT-VC latents, then decode.
Out of scope. MPT-VC has a fixed-size latent and does not adapt its rate to content, so it is not rate-distortion competitive with mature handcrafted codecs (HEVC/VVC) on smooth, high-frame-rate sequences. Long-range temporal context is, by design, the job of a model built on the latent, not of the compressor.
How to use
Inference (encode/decode) needs the Rust range coder built once. The model card weights work with the CLIs in the code repo.
git clone https://github.com/k-kolomeitsev/mpt-vc
cd mpt-vc/codec
# dependencies (PyTorch + ffmpeg assumed present)
pip install -r requirements.txt
# build the mandatory native range coder (bit-exact arithmetic coder; needs a Rust toolchain)
cd range_coder_rs && cargo build --release && cd ..
# download the weights from this HF repo, e.g. via huggingface_hub:
# huggingface_hub.hf_hub_download(repo_id="kkolomeitsev/mpt-vc", filename="mpt-vc-290h.pt")
Encode a video → .mptvc
python encode_cli.py \
--input source.mp4 \
--ckpt mpt-vc-290h.pt \
--output clip.mptvc
Decode .mptvc → video (with rate adaptation)
--k-max truncates the Matryoshka prefix: 0 = preview, 1 = medium, omit = full.
# full quality (all 3 levels)
python decode_cli.py --input clip.mptvc --ckpt mpt-vc-290h.pt --output full.mp4
# medium (L_0 + L_1)
python decode_cli.py --input clip.mptvc --ckpt mpt-vc-290h.pt --output medium.mp4 --k-max 1
# preview (L_0 only)
python decode_cli.py --input clip.mptvc --ckpt mpt-vc-290h.pt --output preview.mp4 --k-max 0
Add --fp32 to either CLI to force full-precision inference (default is bf16 on RTX 40/50/Hopper).
Results
Matryoshka rate–distortion (held-out, 32 unseen clips)
The single bitstream trades quality for rate monotonically; the training-to-storage rate gap stays below 1%.
| Truncation | bpp | PSNR-Y |
|---|---|---|
preview (L_0) |
0.011 | 27.85 dB |
medium (L_0+L_1) |
0.103 | 32.76 dB |
full (L_0+L_1+L_2) |
0.351 | 40.26 dB |
UVG — MPT-VC vs. NVIDIA Cosmos (CV8×8×8)
Full UVG benchmark (16 sequences, native 4K, native frame rate), identical 864×480 tiling,
PSNR measured in-tensor. MPT-VC reconstructs at higher Y-PSNR on every sequence.
| MPT-VC Y | Cosmos Y | ΔY | |
|---|---|---|---|
| Mean (16 seq.) | 38.07 dB | 36.35 dB | +1.72 dB |
Per-sequence margin ranges from +0.60 dB (FlowerPan) to +4.28 dB (YachtRide); high-motion sequences (Jockey, ReadySetGo) do not reduce the advantage. Full per-sequence table is in the paper.
Throughput & memory (one 4K clip, 32 frames, 25 tiles, single GPU)
| Path | Time | PSNR-Y |
|---|---|---|
| MPT-VC, full (with entropy coding) | 105 s | 33.26 |
| MPT-VC, latent-direct (no entropy) | 17 s | 33.26 |
| Cosmos CV8×8×8 | 34 s | 30.98 |
On the matched encode–decode basis MPT-VC is ~2× faster. Reconstruction is bit-identical with and without the entropy coder. A single 4K window through the Cosmos encoder requests ~94 GiB; MPT-VC's windowed-local attention tiles at ~4.7 GiB per tile.
Training data
Trained on approximately 290 hours of video.
Limitations
- Fixed-rate latent; not RD-competitive with HEVC/VVC on easy, smooth content (see Intended use).
- Native-resolution evaluation uses overlapping tiling, which adds overlap bits and removes cross-tile context.
Citation
@misc{kolomeitsev2026mptvc,
title = {MPT-VC: A Rate-Scalable Video Tokenizer with Windowed Axial RoPE and Scene-Aware Positional Encoding},
author = {Konstantin Kolomeitsev},
year = {2026},
howpublished = {\url{https://github.com/k-kolomeitsev/mpt-vc}}
}
License
Apache-2.0. See the code repository for the full text.