Code for the paper Video Occupancy Models, includes three versions of quantizing the input video frames -- vae which uses a VQ-VAE, dino which uses quantized DINO, and musik which uses quantized Multi-step Inverse Dynamics.
This is a PyTorch/GPU implementation of the paper Video Occupancy Models:
@Article{VideoOccupancyModels2024,
author = {Manan Tomar and Philippe Hansen-Estruch and Philip Bachman and Alex Lamb and John Langford and Matthew E. Taylor and Sergey Levine,
journal = {arXiv:2407.09533},
title = {Video Occupancy Models},
year = {2024},
}
The main packages are provided in the requirements.txt file. This code has been tested on a virtual env with Python-3.8 with the package versions listed in the requirements file.
The following table provides the pre-trained model checkpoints and datasets used in the paper:
| Cheetah | Walker | |
|---|---|---|
| VQ-VAE fine-tuned model checkpoint | download | download |
| DINO latent datasets | link | |
| VQ-VAE latent datasets | link | link |
You would need to download the contents of this folder and place them one directory above where this repo is present. This folder contains model descriptions for using a VQ-VAE model from the taming-transformers codebase.
Run train_vq_vae_voc.py to train a VOC model on stored VQ-VAE latents. If you want to train both the VQ-VAE and the VOC model on pixel data then run train_pixel_vq_vae_voc.py. In case you want to create your own latents by traning VQ-VAE on a custom dataset use the collect_latents() and train_vq_latents() methods in save_vq_codes.py.
We use a quantized verison of DINO from BEiT-v2. You would need to download this dino model file and place them one directory above where this repo is present.
Run train_vq_dino_voc.py to train a VOC model on stored DINO latents. Again, in case you want to create your own latents by running a quantized version of DINO on a custom dataset use the collect_latents() method in save_dino_codes.py.
In the case, action data is also available, we use a quantized multi-step inverse kinematics (MUSIK) objective to train the representation.
Run train_vq_musik_voc.py to train a VOC model along with the MUSIK objective on pixel data.