ObjectForesight-DiT (EPIC-KITCHENS-100)
📄 Paper (arXiv:2601.05237) · 📦 Dataset: raivn/ObjectForesight-EPIC · 🛠️ Code: RustinS/ObjectForesight
The main model from ObjectForesight, a 3D object-centric dynamics model that predicts H=8 future 6-DoF object poses from a single egocentric observation (scene point cloud + the object's recent pose context). This is the DiT (diffusion-transformer) variant trained on EPIC-KITCHENS-100.
Model
PoserV1 = PTv3 scene encoder (PointTransformer V3 / Sonata, 50.6M) + DiT diffusion temporal head (132.7M) → 183.25M params.
| Encoder | PTv3, embed_dim=768, in_channels=6 (camera-xyz ⊕ object-centric-xyz), attn_obj pooling, voxel grid 0.005 m |
| Temporal head | DiT, 12 layers / 768-d / 12 heads, adaln_zero conditioning, cosine β-schedule, v-prediction, T=1000, 50 DDIM steps |
| Input | scene point cloud [N,3] (depth-lifted, voxel-downsampled to ~4096 pts) + context_len=3 frames of [t(3), rot6d(6)] + bbox + object-in-camera pose |
| Output | [H=8, 9] future poses, [t_x, t_y, t_z, rot6d(6)] per frame; 6D rotation → SO(3) via Gram-Schmidt |
| Training data | raivn/ObjectForesight-EPIC, frame_skips=0, IoU-drop filtering |
| Checkpoint | epoch 134 / step 22k; batch 128; AdamW, cosine LR 2e-4→1e-5, warmup 500, wd 0.01 |
Reported metrics (EPIC-KITCHENS-100, from the paper)
ADE/FDE = average/final translation displacement error (m, ↓); ARE/FRE = average/final rotation error (°, ↓); DES/RES = error slope over the horizon (↓).
| Model | ADE | FDE | DES | ARE | FRE | RES |
|---|---|---|---|---|---|---|
| ObjectForesight-DiT (this model) | 0.019 | 0.035 | 0.005 | 7.98° | 13.93° | 1.86° |
| ObjectForesight-AR (baseline) | 0.067 | 0.074 | 0.002 | 9.48° | 12.58° | 0.93° |
Files
| File | Size | Description |
|---|---|---|
model.safetensors |
0.73 GB | Inference weights, pickle-free. 183.25M params, fp32. |
best.pt |
0.73 GB | Same weights as a torch checkpoint (state_dict + training metrics) for the repo's loader. |
config.yaml |
n/a | Exact architecture + data-preprocessing recipe that defines this model. |
architecture.png |
n/a | Model diagram. |
Usage
This is a weights-only release; the model definition lives in RustinS/ObjectForesight. It needs CUDA-compiled deps for the PTv3 encoder:
pip install spconv-cu124 # match your CUDA (e.g. 12.4)
pip install torch-scatter -f https://data.pyg.org/whl/torch-2.8.0+cu124.html
pip install flash-attn --no-build-isolation # optional; falls back to SDPA if absent
Load the weights into PoserV1 (built from config.yaml):
import torch
from safetensors.torch import load_file
# from objectforesight repo:
from src.models.poser_v1.builder import build_poser_v1
from src.utils.config_adapter import apply_config_adapter # builds the model cfg
model = build_poser_v1(**model_cfg) # model_cfg from config.yaml (encoder + temporal)
sd = load_file("model.safetensors") # raw inference weights
model.load_state_dict(sd, strict=False) # only the tied `dit.*` alias is reported missing
model.eval().cuda()
# one observation -> 8 future 6-DoF poses
with torch.no_grad():
cond = model.condition_from_batch(batch) # batch from the dataset loader
future = model.sample(cond["scene_pcd"], cond["context_vec"],
T_cam_anchor_obj=cond["T_cam_anchor_obj"],
steps=50, sampler="ddim", ctx_tokens_9d=ctx_9d) # -> [B, 8, 9]
Inputs come from the companion dataset raivn/ObjectForesight-EPIC, whose bundled loader (SceneSequenceDataset) produces the exact batch contract above (scene_pcd, context_init_9d, context_bbox_norm, context_T_cam_anchor_obj).
best.pt loads via the repo's own utilities with no missing/unexpected keys:
from src.models.poser_v1.utils.checkpoint import resolve_and_load_state_dict
sd, _ = resolve_and_load_state_dict("best.pt", map_location="cpu", prefer_ema=False)
model.load_state_dict(sd, strict=False)
License & attribution
Released under CC BY-NC 4.0, inherited from EPIC-KITCHENS-100 (the model is trained on derivatives of that data). Non-commercial research use only. You must cite ObjectForesight and EPIC-KITCHENS-100 and comply with the EPIC-KITCHENS terms. Do not use this model to identify or infer private information about individuals depicted in the source video.
Citation
@article{soraki2026objectforesight,
title = {ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos},
author = {Soraki, Rustin and Bharadhwaj, Homanga and Farhadi, Ali and Mottaghi, Roozbeh},
journal = {arXiv preprint arXiv:2601.05237},
year = {2026}
}
@article{damen2022rescaling,
title = {Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100},
author = {Damen, Dima and Doughty, Hazel and Farinella, Giovanni Maria and Furnari, Antonino and
Kazakos, Evangelos and Ma, Jian and Moltisanti, Davide and Munro, Jonathan and
Perrett, Toby and Price, Will and Wray, Michael},
journal = {International Journal of Computer Vision (IJCV)},
year = {2022}
}
Built with (please also cite): PointTransformer V3 / Sonata · EPIC-KITCHENS-100. See the code repository for full references.
- Downloads last month
- 20
