MPT-VC — Matryoshka Phase-on-Torus Video Codec

MPT-VC is a neural video tokenizer built almost entirely from deterministic operators: windowed axial 3D rotary position embeddings (RoPE), finite scalar quantization (FSQ), and a bit-exact arithmetic entropy coder. It compresses video into a single rate-ordered bitstream that truncates from a coarse preview to full resolution, and whose latent decodes back to pixels — so the compressed artifact is both a storage format and an information-complete, AI-friendly latent.

This card ships the trained weights only. All code (encoder/decoder, entropy coder, CLIs, tests) lives in the GitHub repository above.


Highlights

  • Rate-scalable Matryoshka bitstream. One .mptvc file is read as a prefix: L_0 → preview, L_0+L_1 → medium, L_0+L_1+L_2 → full. No re-encode for an ABR ladder.
  • Windowed axial 3D RoPE. Relative RoPE inside local axial attention → every pyramid level is translation-equivariant and resolution-flexible; cost is linear in token count.
  • Scene-aware phase-jump. A per-scene virtual temporal offset decorrelates attention across shot cuts geometrically — a differentiable, mask-free attention gate.
  • Bit-exact entropy coding. The rate minimized in training equals the stored bytes to within 1%.
  • Invertible latent. The latent reconstructs the source video, and is also a native generation target (compact, discrete, collapse-free FSQ codebook).
  • Compact. ~5.0M parameters total.

Model details

Architecture Matryoshka Laplacian pyramid, K=3 levels, local axial 3D-RoPE attention
Parameters 5,039,641 total (backbone 4,354,201 + entropy/hyperprior 685,440)
Patch / window p=8, p_t=4, chunk T=32 (1.33 s @ 24 fps)
Channels / FSQ levels C=(64, 96, 128), L=(8, 8, 10)
Canonical tile 480 × 864 @ 24 fps, RGB
RoPE θ_base = 10000, scene-cut θ_jump = 10000
Entropy model Ballé scale hyperprior, σ_floor = 0.3, hyper_channels = 32
Bitstream .mptvc format v2 (rate-ordered prefix, CRC32)
Training progressive curriculum (per-level → joint), bf16 main graph, fp32 entropy/RoPE paths
Provenance joint fine-tune step 1,350,000, stage joint

The checkpoint embeds the exact training Config (resolution, pyramid, FSQ, σ_floor, …). The CLIs rehydrate it from the checkpoint so the range coder stays bit-exact — you do not pass any model hyperparameters by hand.

Files

File Size SHA-256
mpt-vc-290h.pt 20.36 MB 96970b029fd3c938f1014b70a1d9bc0d13a903e992f2461089a4d227657d5ea2

mpt-vc-290h.pt is an inference-only checkpoint: a PyTorch dict with keys model, entropy, cfg, step, stage. The optimizer state has been stripped — these are inference weights.


Intended use

  • Reconstruction / compression of video into a rate-scalable, invertible latent.
  • A frozen latent backbone for downstream video understanding (detection, tracking) on tokens.
  • A native generation substrate: synthesize video directly as MPT-VC latents, then decode.

Out of scope. MPT-VC has a fixed-size latent and does not adapt its rate to content, so it is not rate-distortion competitive with mature handcrafted codecs (HEVC/VVC) on smooth, high-frame-rate sequences. Long-range temporal context is, by design, the job of a model built on the latent, not of the compressor.


How to use

Inference (encode/decode) needs the Rust range coder built once. The model card weights work with the CLIs in the code repo.

git clone https://github.com/k-kolomeitsev/mpt-vc
cd mpt-vc/codec

# dependencies (PyTorch + ffmpeg assumed present)
pip install -r requirements.txt

# build the mandatory native range coder (bit-exact arithmetic coder; needs a Rust toolchain)
cd range_coder_rs && cargo build --release && cd ..

# download the weights from this HF repo, e.g. via huggingface_hub:
#   huggingface_hub.hf_hub_download(repo_id="kkolomeitsev/mpt-vc", filename="mpt-vc-290h.pt")

Encode a video → .mptvc

python encode_cli.py \
    --input source.mp4 \
    --ckpt  mpt-vc-290h.pt \
    --output clip.mptvc

Decode .mptvc → video (with rate adaptation)

--k-max truncates the Matryoshka prefix: 0 = preview, 1 = medium, omit = full.

# full quality (all 3 levels)
python decode_cli.py --input clip.mptvc --ckpt mpt-vc-290h.pt --output full.mp4

# medium (L_0 + L_1)
python decode_cli.py --input clip.mptvc --ckpt mpt-vc-290h.pt --output medium.mp4 --k-max 1

# preview (L_0 only)
python decode_cli.py --input clip.mptvc --ckpt mpt-vc-290h.pt --output preview.mp4 --k-max 0

Add --fp32 to either CLI to force full-precision inference (default is bf16 on RTX 40/50/Hopper).


Results

Matryoshka rate–distortion (held-out, 32 unseen clips)

The single bitstream trades quality for rate monotonically; the training-to-storage rate gap stays below 1%.

Truncation bpp PSNR-Y
preview (L_0) 0.011 27.85 dB
medium (L_0+L_1) 0.103 32.76 dB
full (L_0+L_1+L_2) 0.351 40.26 dB

UVG — MPT-VC vs. NVIDIA Cosmos (CV8×8×8)

Full UVG benchmark (16 sequences, native 4K, native frame rate), identical 864×480 tiling, PSNR measured in-tensor. MPT-VC reconstructs at higher Y-PSNR on every sequence.

MPT-VC Y Cosmos Y ΔY
Mean (16 seq.) 38.07 dB 36.35 dB +1.72 dB

Per-sequence margin ranges from +0.60 dB (FlowerPan) to +4.28 dB (YachtRide); high-motion sequences (Jockey, ReadySetGo) do not reduce the advantage. Full per-sequence table is in the paper.

Throughput & memory (one 4K clip, 32 frames, 25 tiles, single GPU)

Path Time PSNR-Y
MPT-VC, full (with entropy coding) 105 s 33.26
MPT-VC, latent-direct (no entropy) 17 s 33.26
Cosmos CV8×8×8 34 s 30.98

On the matched encode–decode basis MPT-VC is ~2× faster. Reconstruction is bit-identical with and without the entropy coder. A single 4K window through the Cosmos encoder requests ~94 GiB; MPT-VC's windowed-local attention tiles at ~4.7 GiB per tile.


Training data

Trained on approximately 290 hours of video.


Limitations

  • Fixed-rate latent; not RD-competitive with HEVC/VVC on easy, smooth content (see Intended use).
  • Native-resolution evaluation uses overlapping tiling, which adds overlap bits and removes cross-tile context.

Citation

@misc{kolomeitsev2026mptvc,
  title         = {MPT-VC: A Rate-Scalable Video Tokenizer with Windowed Axial RoPE and Scene-Aware Positional Encoding},
  author        = {Konstantin Kolomeitsev},
  year          = {2026},
  howpublished  = {\url{https://github.com/k-kolomeitsev/mpt-vc}}
}

License

Apache-2.0. See the code repository for the full text.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support