ObjectForesight-DiT (EPIC-KITCHENS-100)

📄 Paper (arXiv:2601.05237) · 📦 Dataset: raivn/ObjectForesight-EPIC · 🛠️ Code: RustinS/ObjectForesight

The main model from ObjectForesight, a 3D object-centric dynamics model that predicts H=8 future 6-DoF object poses from a single egocentric observation (scene point cloud + the object's recent pose context). This is the DiT (diffusion-transformer) variant trained on EPIC-KITCHENS-100.

Model

PoserV1 = PTv3 scene encoder (PointTransformer V3 / Sonata, 50.6M) + DiT diffusion temporal head (132.7M) → 183.25M params.


Encoder	PTv3, `embed_dim=768`, `in_channels=6` (camera-xyz ⊕ object-centric-xyz), `attn_obj` pooling, voxel grid `0.005 m`
Temporal head	DiT, 12 layers / 768-d / 12 heads, `adaln_zero` conditioning, cosine β-schedule, v-prediction, `T=1000`, 50 DDIM steps
Input	scene point cloud `[N,3]` (depth-lifted, voxel-downsampled to ~4096 pts) + `context_len=3` frames of `[t(3), rot6d(6)]` + bbox + object-in-camera pose
Output	`[H=8, 9]` future poses, `[t_x, t_y, t_z, rot6d(6)]` per frame; 6D rotation → SO(3) via Gram-Schmidt
Training data	raivn/ObjectForesight-EPIC, `frame_skips=0`, IoU-drop filtering
Checkpoint	epoch 134 / step 22k; batch 128; AdamW, cosine LR `2e-4`→`1e-5`, warmup 500, wd 0.01

Reported metrics (EPIC-KITCHENS-100, from the paper)

ADE/FDE = average/final translation displacement error (m, ↓); ARE/FRE = average/final rotation error (°, ↓); DES/RES = error slope over the horizon (↓).

Model	ADE	FDE	DES	ARE	FRE	RES
ObjectForesight-DiT (this model)	0.019	0.035	0.005	7.98°	13.93°	1.86°
ObjectForesight-AR (baseline)	0.067	0.074	0.002	9.48°	12.58°	0.93°

Files

File	Size	Description
`model.safetensors`	0.73 GB	Inference weights, pickle-free. 183.25M params, fp32.
`best.pt`	0.73 GB	Same weights as a torch checkpoint (`state_dict` + training metrics) for the repo's loader.
`config.yaml`	n/a	Exact architecture + data-preprocessing recipe that defines this model.
`architecture.png`	n/a	Model diagram.

Usage

This is a weights-only release; the model definition lives in RustinS/ObjectForesight. It needs CUDA-compiled deps for the PTv3 encoder:

pip install spconv-cu124          # match your CUDA (e.g. 12.4)
pip install torch-scatter -f https://data.pyg.org/whl/torch-2.8.0+cu124.html
pip install flash-attn --no-build-isolation   # optional; falls back to SDPA if absent

Load the weights into PoserV1 (built from config.yaml):

import torch
from safetensors.torch import load_file
# from objectforesight repo:
from src.models.poser_v1.builder import build_poser_v1
from src.utils.config_adapter import apply_config_adapter   # builds the model cfg

model = build_poser_v1(**model_cfg)          # model_cfg from config.yaml (encoder + temporal)
sd = load_file("model.safetensors")          # raw inference weights
model.load_state_dict(sd, strict=False)      # only the tied `dit.*` alias is reported missing
model.eval().cuda()

# one observation -> 8 future 6-DoF poses
with torch.no_grad():
    cond = model.condition_from_batch(batch)              # batch from the dataset loader
    future = model.sample(cond["scene_pcd"], cond["context_vec"],
                          T_cam_anchor_obj=cond["T_cam_anchor_obj"],
                          steps=50, sampler="ddim", ctx_tokens_9d=ctx_9d)   # -> [B, 8, 9]

Inputs come from the companion dataset raivn/ObjectForesight-EPIC, whose bundled loader (SceneSequenceDataset) produces the exact batch contract above (scene_pcd, context_init_9d, context_bbox_norm, context_T_cam_anchor_obj).

best.pt loads via the repo's own utilities with no missing/unexpected keys:

from src.models.poser_v1.utils.checkpoint import resolve_and_load_state_dict
sd, _ = resolve_and_load_state_dict("best.pt", map_location="cpu", prefer_ema=False)
model.load_state_dict(sd, strict=False)

License & attribution

Released under CC BY-NC 4.0, inherited from EPIC-KITCHENS-100 (the model is trained on derivatives of that data). Non-commercial research use only. You must cite ObjectForesight and EPIC-KITCHENS-100 and comply with the EPIC-KITCHENS terms. Do not use this model to identify or infer private information about individuals depicted in the source video.

Citation

@article{soraki2026objectforesight,
  title   = {ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos},
  author  = {Soraki, Rustin and Bharadhwaj, Homanga and Farhadi, Ali and Mottaghi, Roozbeh},
  journal = {arXiv preprint arXiv:2601.05237},
  year    = {2026}
}
@article{damen2022rescaling,
  title   = {Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100},
  author  = {Damen, Dima and Doughty, Hazel and Farinella, Giovanni Maria and Furnari, Antonino and
             Kazakos, Evangelos and Ma, Jian and Moltisanti, Davide and Munro, Jonathan and
             Perrett, Toby and Price, Will and Wray, Michael},
  journal = {International Journal of Computer Vision (IJCV)},
  year    = {2022}
}

Built with (please also cite): PointTransformer V3 / Sonata · EPIC-KITCHENS-100. See the code repository for full references.

Downloads last month: 20

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train raivn/ObjectForesight-EPIC-DiT

Paper for raivn/ObjectForesight-EPIC-DiT

ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos

Paper • 2601.05237 • Published Mar 22