Skip to content

9LogM/Video2Sim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Video2Sim

Video2Sim is a Docker-based pipeline for converting raw video (or sensor logs) into simulation-ready 3D assets using a sequence of modular components.

News

  • 12/25/2025: Pipeline ready for public use. Current papers used: Segment Anything Model 3, Depth Anything 3, and HoloScene.

Requirements

  • NVIDIA GPU with CUDA support
  • 200 GB of available disk space
  • Docker Engine
  • A Hugging Face access token with permission to pull from:
    • https://huggingface.co/facebook/sam3
  • VRAM considerations
    • The primary VRAM bottleneck in the current pipeline is DA3.
    • As a reference point, 380 frames required 80 GB of VRAM.
    • The other processes did not exceed 16 GB of VRAM usage in my tests.
    • Because of this, I recommend renting a cloud GPU to run DA3, then copying the necessary file (data/input/custom/<SCENE_NAME>/transforms.json) locally (or to a smaller GPU instance) to complete training.

Pipeline Overview

Input

Record a horizontal video of the scene while orbiting around it. I recommend researching proper video/image capture techniques for 3D scenes.

Preprocessor

Takes a video or ROS bag and outputs images.

DA3 (Depth Anything 3)

Reads extracted frames and produces transforms.json, along with additional supporting files. The JSON contains camera intrinsics, extrinsics, and per-frame camera poses.

SAM 3 (Segment Anything Model 3)

Reads extracted frames and produces consistent per‑frame instance masks. Prompts are defined in prompts.txt.

HoloScene

Uses extracted frames, poses, masks, and Marigold‑generated priors to reconstruct the 3D scene and export final assets.

Running the Pipeline

  1. Place input files
    Place video file into data/input.

  2. Configure environment
    Fill out .env (scene name, fps extraction etc.).

  3. Configure HoloScene Configs
    Fill out the configuration files in modules/holoscene/confs/ (base.conf, post.conf, tex.conf).

  4. Build and run each module in sequence

    docker compose up --build preprocessor
    docker compose up --build da3
    docker compose up --build sam3
    docker compose up --build holoscene
  5. Retrieve results
    Results will be generated into data/output.

Note: To keep the container alive for debugging, temporarily set the command in docker-compose.yml to ["tail", "-f", "/dev/null"]."

Citations

@article{depthanything3,
        title       = {Depth Anything 3: Recovering the visual space from any views},
        author      = {Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang},
        journal     = {arXiv preprint arXiv:2511.10647},
        year        = {2025}
}
@misc{carion2025sam3segmentconcepts,
      title         = {SAM 3: Segment Anything with Concepts},
      author        = {Nicolas Carion and Laura Gustafson and Yuan-Ting Hu and Shoubhik Debnath and Ronghang Hu and Didac Suris and Chaitanya Ryali and Kalyan Vasudev Alwala and Haitham Khedr and Andrew Huang and Jie Lei and Tengyu Ma and Baishan Guo and Arpit Kalla and Markus Marks and Joseph Greer and Meng Wang and Peize Sun and Roman Rädle and Triantafyllos Afouras and Effrosyni Mavroudi and Katherine Xu and Tsung-Han Wu and Yu Zhou and Liliane Momeni and Rishi Hazra and Shuangrui Ding and Sagar Vaze and Francois Porcher and Feng Li and Siyuan Li and Aishwarya Kamath and Ho Kei Cheng and Piotr Dollár and Nikhila Ravi and Kate Saenko and Pengchuan Zhang and Christoph Feichtenhofer},
      year          = {2025},
      eprint        = {2511.16719},
      archivePrefix = {arXiv},
      primaryClass  = {cs.CV},
      url           = {https://arxiv.org/abs/2511.16719},
}
@misc{xia2025holoscene,
      title         = {HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video}, 
      author        = {Hongchi Xia and Chih-Hao Lin and Hao-Yu Hsu and Quentin Leboutet and Katelyn Gao and Michael Paulitsch and Benjamin Ummenhofer and Shenlong Wang},
      year          = {2025},
      eprint        = {2510.05560},
      archivePrefi  = {arXiv},
      primaryClas   = {cs.CV},
      url           = {https://arxiv.org/abs/2510.05560}, 
}

About

A modular pipeline for converting video → simulation-ready scenes.

Resources

License

Stars

Watchers

Forks

Packages

No packages published