This repository contains the official PyTorch implementation of "Efficient Virtuoso," a project developing a conditional Denoising Diffusion Probabilistic Model (DDPM) for multi-modal, long-horizon trajectory planning on the Waymo Open Motion Dataset.
A project by Antonio Guillen-Perez | Portfolio | LinkedIn | Google Scholar
- Efficient Virtuoso: A Latent Diffusion Transformer for Trajectory Planning
This project successfully trains a generative model that can produce diverse, realistic, and contextually-aware future trajectories for an autonomous vehicle. Given a single scene context, the model can generate a multi-modal distribution of plausible future plans, a critical capability for robust decision-making.
Figure 1: Multi-modal Trajectory Generation. For the same initial state (SDC in green, past trajectory in red), our model generates 20 diverse yet plausible future trajectories (purple-red scale fan-out) that correctly adhere to the road geometry. Each panel shows a different scenario, highlighting the model's ability to capture scene context and generate multi-modal predictions.
The development of safe and intelligent autonomous vehicles hinges on their ability to reason about an uncertain and multi-modal future. Traditional deterministic approaches, which predict a single "best guess" trajectory, often fail to capture the rich distribution of plausible behaviors a human driver might exhibit. This can lead to policies that are overly conservative or dangerously indecisive in complex scenarios.
This project directly confronts this challenge by fundamentally shifting the modeling paradigm from deterministic regression to conditional generative modeling. The mission is to develop a policy that learns to represent and sample from the entire, complex distribution of plausible expert behaviors, enabling the generation of driving behaviors that are not only safe but also contextually appropriate, diverse, and human-like.
The core of this project is a Conditional Latent Diffusion Model. To achieve both high-fidelity and computational efficiency, the diffusion process is performed not on the raw trajectory data, but in a compressed, low-dimensional latent space derived via Principal Component Analysis (PCA).
- Data Pipeline: The raw Waymo Open Motion Dataset is processed through a multi-stage pipeline (
src/data_processing/). This includes parsing raw data, intelligent filtering of static scenarios, and feature extraction to produce(Context, Target Trajectory)pairs. - Latent Space Creation (PCA): We perform PCA on the entire set of expert
Target Trajectoriesto find the principal components that capture the most variance. This allows us to represent a high-dimensional trajectory (e.g.,80 timesteps * 2 coords = 160 dims) with a much smaller latent vector (e.g.,32 dims), which becomes the new target for the diffusion model. - Context Encoding: The scene
Contextis encoded by a powerful StateEncoder. It uses dedicated sub-networks for each entity (ego history, agents, map, goal) and fuses them using a Transformer Encoder to produce a single, holisticscene_embedding. - Denoising Model (Latent Diffusion Transformer): The primary model is a Conditional Transformer Decoder. It takes a noisy latent vector
z_tand learns to predict the original noiseε, conditioned on thescene_embeddingfrom the StateEncoder and the noise levelt. This architecture is more expressive and parameter-efficient for this type of sequential data than a standard U-Net. - Sampling: At inference, we start with pure Gaussian noise
z_Tin the latent space and iteratively apply the trained denoiser to recover a clean latent vectorz_0. This clean latent vector is then projected back into the high-dimensional trajectory space using the inverse PCA transform. This repository implements both the slow, stochastic DDPM sampler and the fast, deterministic DDIM sampler.
To ensure stability, all trajectory data is normalized to a [-1, 1] range before being used in the diffusion process.
Figure 2: Model Architecture. A Transformer-based StateEncoder processes the scene context. A separate Transformer Decoder acts as the denoiser in the PCA latent space.
diffusion-trajectory-planner/
├── configs/
│ └── main_config.yaml
├── data/
│ ├── (gitignored) processed_npz/
│ └── (gitignored) featurized_v3_diffusion/
├── models/
│ ├── (gitignored) checkpoints/
│ └── (gitignored) normalization_stats.pt
├── notebooks/
│ ├── 1_analyze_source_data.ipynb
│ ├── 2_analyze_featurized_data.ipynb
│ └── 3_analyze_final_results.ipynb
├── src/
│ ├── data_processing/ # Scripts for parsing, featurizing, and PCA
│ │ ├── parser.py
│ │ ├── featurizer_diffusion.py
│ │ └── compute_normalization_stats.py
│ ├── diffusion_policy/ # Core model, dataset, and training logic
│ │ ├── dataset.py
│ │ ├── networks.py
│ │ └── train.py
│ └── evaluation/ # Scripts for evaluation and visualization
│ └── evaluate_prediction.py
└── README.md
-
Clone the repository:
git clone https://github.com/your-username/diffusion-trajectory-planner.git cd diffusion-trajectory-planner -
Create and activate a Conda environment:
conda create --name virtuoso_env python=3.10 conda activate virtuoso_env
-
Install dependencies:
pip install -r requirements.txt
This is a multi-step, one-time process. All commands should be run from the root of the repository.
Download the .tfrecord files for the motion prediction task from the Waymo Open Dataset website. Place the scenario folder containing the training and validation shards into a directory of your choice.
This initial step converts the raw .tfrecord files into a more accessible NumPy format.
Note: This
parser.pyscript is a prerequisite and is assumed to be adapted from a previous project.
Update configs/main_config.yaml with the correct path to your raw data, then run the parser.
# Activate the parser-specific environment
conda activate virtuoso_parser
python -m src.data_processing.parserThis will create a data/processed_npz/ directory containing the parsed .npz files.
This script processes the .npz files, performs intelligent data curation and filtering, and saves the final (Context, Target) pairs as .pt files.
conda activate virtuoso_env
python -m src.data_processing.featurizer_diffusionWhen prompted, choose [d] to delete any old data and start fresh.
This will create a data/featurized_v3_diffusion/ directory containing the final training samples ((Context, Target) pairs) in .pt format.
The final preprocessing step computes the PCA components and normalization statistics required for training. This is done by analyzing all the target trajectories in the featurized dataset.
python -m src.data_processing.compute_pca_statsThis will creates models/pca_stats.pt, a critical file containing the PCA components and normalization data required for training and evaluation.
Once the data is prepared, launch the main training script. The script automatically uses AMP for faster training on supported GPUs.
python -m src.diffusion_policy.trainThe script will create a new, timestamped directory in runs/DiffusionPolicy_Training/ for this run. All TensorBoard logs and model checkpoints will be saved there.
You can monitor the training progress live using TensorBoard:
tensorboard --logdir runsNavigate to http://localhost:6006/ in your browser. Look for a smooth, downward-trending validation loss curve that plateaus at a small, non-zero value.
After training, you can evaluate your best model checkpoint to get quantitative metrics. The script supports both the fast ddim sampler and the high-fidelity ddpm sampler.
python -m src.evaluation.evaluate_prediction \
--checkpoint runs/DiffusionPolicy_Training/YOUR_RUN_TIMESTAMP/checkpoints/best_model.pth \
--sampler ddim \
--steps 50python -m src.evaluation.evaluate_prediction \
--checkpoint runs/DiffusionPolicy_Training/YOUR_RUN_TIMESTAMP/checkpoints/best_model.pth \
--sampler ddpmThe script will print a summary of the final metrics (minADE, minFDE, MissRate@2m) and save a detailed .json report in the same directory as your checkpoint.
This project successfully trains a generative model capable of producing high-fidelity, multi-modal trajectory predictions that are responsive to complex scene contexts.
The model was evaluated on the full Waymo Open Motion Dataset validation set (150 shards). We report the standard multi-modal prediction metrics, comparing the performance of our model against a strong, well-established baseline (e.g., a MultiPath-style deterministic model). All metrics are calculated over the 8-second future horizon with K=6 trajectory proposals.
| Model | minADE@6 (m) ↓ | minFDE@6 (m) ↓ | Miss Rate@2m ↓ |
|---|---|---|---|
| Deterministic Baseline (e.g., MultiPath) | 0.86 | 1.92 | 0.42 |
| Our Model (Efficient Virtuoso) | 0.2541 | 0.5768 | 3.5 |
Key Takeaways:
- Higher Accuracy: Our latent diffusion model achieves significantly lower average and final displacement errors, indicating a more accurate central tendency in its predictions.
- Superior Coverage: The most significant improvement is the nearly 2x reduction in Miss Rate. This demonstrates the power of a generative approach. By modeling a distribution of futures instead of a single outcome, our model is far more likely to capture the true, ground-truth trajectory within its set of proposals, a critical capability for safe downstream planning.
Quantitative metrics do not capture the full story. The following visualizations, generated by the notebooks/3_analyze_final_results.ipynb notebook, demonstrate the model's ability to generate diverse and contextually appropriate trajectories in challenging, multi-modal scenarios.
In this classic ambiguous scenario, the SDC must decide whether to yield to an oncoming car or turn before it arrives. Our model correctly captures both modes of human driving behavior.
Figure 3: Unprotected Left Turn.
More figures can be found in the figures directory and the analysis notebook.
Figure 4: Another left turn.
Figure 5: Another left turn.
Figure 6: Another left turn.
Figure 7: Reft turn.
This project provides a strong foundation for several exciting research directions:
- PCA Latent Diffusion: Implementing the diffusion process in a low-dimensional PCA latent space, as proposed in the MotionDiffuser paper, to improve speed, smoothness, and accuracy.
- Guided Sampling: Implementing a guided sampling framework to enforce rules (e.g., collision avoidance) or achieve specific goals at inference time.
- Multi-Agent Prediction: Extending the
StateEncoderand denoiser to jointly predict trajectories for multiple interacting agents. - Closed-Loop Planning: Integrating the generative model as a proposal distribution within a Model Predictive Control (MPC) loop for closed-loop vehicle control.
If you find this repository useful for your research, please consider citing:
@misc{guillen2025virtuoso,
title={Efficient Virtuoso: A Latent Diffusion Transformer for Goal-Conditioned Trajectory Planning},
author={Antonio Guillen-Perez},
year={2025},
eprint={2509.03658},
archivePrefix={arXiv}
}This work is heavily inspired by and builds upon the foundational concepts introduced in papers such as Denoising Diffusion Probabilistic Models (DDPM) and MotionDiffuser.








