mini-VLA is a minimal, beginner-friendly Vision-Language-Action (VLA) model designed to show how modern robot policies can fuse images, text instructions, and robot states to produce continuous actions.
Important
it is far from being production-ready! Purpose? Education. Optimizations? Yes, I am developing them. Checkout other branches.
This project intentionally keeps the codebase small (~150 LOC for the core model) so that,
- beginners can understand the complete VLA training and exec pipeline
- researchers can rapidly prototype new ideas around this
- students can learn diffusion-based action generation w/o heavy dependencies
Tip
I recommend reading the following blogs to get started with mini-VLA implementation.
This project is not meant to be state-of-the-art instead, it provides a clear, hackable template for understanding VLA design.
The mini-VLA model core is mainly four files: models/encoders.py contains encoders for images, text and states corresponding to the robot, models/fusion.py simply combines vision-language-action embeddings using an MLP (yeah, not ideal but simple and it works OKAY), models/diffusion_head.py generates action using diffusion policy, and models/vla_diffusion_policy.py combines everything!
Additionally, I provide scripts such as scripts/collect_data.py to collect data using an expert policy, scripts/train.py to train the VLA on the collected data, and scripts/test.py to test VLA-Diffusion Policy's performance (+ save videos).
Create (or activate) a conda environment
conda create --name mini-vla python=3.10
conda activate mini-vla
Clone mini-VLA project
git clone https://github.com/keivalya/mini-vla.git
cd mini-vla
Install dependencies
pip install -r requirements.txt
This gathers trajectories using an expert Meta-World policy and saves them in .npz dataset.
python -m scripts.collect_data \
--env-name push-v3 \
--camera-name corner \
--episodes 100 \
--max-steps 100 \
--output-path data/metaworld_push_bc.npz
Train a small vision-language diffusion policy on your collected dataset.
python -m scripts.train \
--dataset-path data/push_v3.npz \
--epochs 50 \
--batch-size 64 \
--save-path checkpoints/model.pt \
--device cpu
Run the trained VLA inside the Meta-World MT1 environment.
python -m scripts.test \
--checkpoint checkpoints/model.pt \
--env-name push-v3 \
--episodes 5 \
--max-steps 150 \
--instruction "push the object to the goal" \
--device cpu \
--save-video \
--video-dir videos
Planning to,
- support for multiple tasks (MT10 or M50 something, let's see how much I can scale it)
- adding larger vision/text backbones (CLIP, SigLIP, ViT) -- w/o losing simplicity
- arbitrary text-input during inference
