A 3D diffusion model trained to generate Minecraft schematics from natural language prompts. It uses a 3D U-Net architecture with cross-attention, conditioned on text embeddings from OpenAI's CLIP model.
This project has recently undergone a significant upgrade. The original models, trained on a consumer laptop GPU, have been moved to a legacy directory. A new set of models has been retrained from scratch on a RTX 4090 instance using a more powerful architecture. It is highly recommended to use the new V2 models.
- Location:
models/retrain/ - Training Hardware: NVIDIA RTX 4090
- Architecture: Wider U-Net (
base_c=128) with more learning capacity. - Training Process: More stable training due to a larger batch size (
BATCH_SIZE=24). - Expected Quality: These models produce significantly better results, with more logical structures, finer details, and fewer visual artifacts. For best results, use a checkpoint from a later epoch (e.g.,
schematic_diffusion_epoch_80.pthor higher).
- Location:
models/legacy/ - Training Hardware: NVIDIA RTX 4070 Laptop GPU
- Architecture: Standard U-Net (
base_c=64). - Expected Quality: Functional, but results can be mediocre. Structures may lack coherence and detail compared to the V2 models.
- A Windows or Linux machine with an NVIDIA GPU (16GB+ VRAM recommended for retraining).
- NVIDIA drivers and CUDA toolkit compatible with PyTorch.
- Miniconda or Anaconda installed (recommended).
First, clone the repository and set up the Conda environment.
# Clone this repository
git clone https://github.com/KHROTU/schematic-diffusion.git
cd schematic-diffusion
# Create and activate the Conda environment (recommended)
conda create --name schematic-diffusion python=3.10
conda activate schematic-diffusion
# Install PyTorch & CUDA
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
# Install remaining dependencies
pip install -r requirements.txtWhen you first clone the repository, it should look like this:
+-- data
| +-- 0_raw_downloads
| +-- 2_named_schematics
| +-- 3_litematics_to_convert
| +-- 4_processed_tensors
| L-- 1_id_to_name.txt
+-- litematic_converter
| +-- converter.py
| +-- converter_server.py
| L-- Python Converter Bridge-1.0.user.js
+-- .gitignore
+-- 01a_triage_litematics.py
+-- 01_rename_files.py
+-- 02_generate_labels.py
+-- config.py
+-- generate.py
+-- preprocess_all_data.py
+-- README
+-- requirements.txt
L-- train_diffusion.pyWhen you finish training, it should look like this:
+-- data
| +-- 0_raw_downloads/
| +-- 2_named_schematics/
| +-- 3_litematics_to_convert/
| +-- 4_processed_tensors/
| +-- 1_id_to_name.txt
| L-- 5_labels.json
+-- litematic_converter
| +-- converter.py
| +-- converter_server.py
| L-- Python Converter Bridge-1.0.user.js
+-- models
| +-- legacy
| | +-- schematic_diffusion_epoch_5.pth
| | +-- schematic_diffusion_epoch_10.pth
| | +-- ...
| | L-- schematic_diffusion_final.pth
| L-- retrain
| +-- schematic_diffusion_epoch_5.pth
| +-- schematic_diffusion_epoch_10.pth
| +-- ...
| L-- schematic_diffusion_final.pth
+-- .gitignore
+-- 01a_triage_litematics.py
+-- 01_rename_files.py
+-- 02_generate_labels.py
+-- config.py
+-- generate.py
+-- preprocess_all_data.py
+-- README
+-- requirements.txt
L-- train_diffusion.pyThe model was trained on a large dataset of schematics from the web. Due to the size of the dataset, it is not included in this repository.
-
Download the Schematic Dataset: Download the
Schematics.zipfile containing ~120,000 raw schematic files.- Link: MediaFire (thank you u/cbreauxgaming)
- Unzip this file and place its contents into the
data/0_raw_downloads/directory.
-
Download the ID-to-Name Mapping: This file maps the numeric filenames to their original names.
For more context on the dataset (you might not want to use this specific one due to ethical reasons), you can read the original Reddit post, specifically this thread.
Run the following scripts in order. Not sure what would happen if you run them out of order, but it is not recommended.
# Ensure your conda environment is active
conda activate schematic-diffusion
# 1. Rename files from IDs to human-readable names
python 01_rename_files.py
# 2. Separate .litematic files for conversion
python 01a_triage_litematics.py
# 3. Convert .litematic files to .schem (this will take a while)
# This requires the Tampermonkey script (litematic_converter\Python Converter Bridge-1.0.user.js)
# to be installed and active in your browser.
cd litematic_converter
python converter.py
cd ..
# 4. Generate the final labels.json file from the processed files
python 02_generate_labels.py
# 5. Convert all schematics into PyTorch tensors (this will also take a while, the output is ~80GB)
python preprocess_all_data.pyAfter this step, the data/4_processed_tensors/ directory will be filled with your training-ready dataset.
# Start the training process
python train_diffusion.py- The script will print the average loss after each epoch. You should see this value decrease over time.
- The
train_diffusion.pyscript is now configured for the V2 model architecture (base_c=128) and a larger batch size (BATCH_SIZE=24), targeting high-performance GPUs. - Model checkpoints will be saved every 5 epochs to the
models/retrain/directory by default.
Once the model is trained, you can generate new schematics using the generation script.
python generate.py-
You can modify the prompt and other parameters directly in the
generate.pyscript. -
Important: Make sure to update the
MODEL_PATHvariable to point to one of the new, high-quality models from theretraindirectory. For example:# In generate.py MODEL_PATH = "models/retrain/schematic_diffusion_epoch_95.pth"
The current V2 models are a huge leap in visual and structural quality. However, their ability to follow specific stylistic prompts (e.g., "gothic," "modern") is limited by the dataset's short and generic labels (e.g., "cool tower").
The next major goal is to create a "Version 3.0" model that combines the powerful V2 visual engine with precise language control. This will be achieved by generating a new, high-quality set of labels for our existing schematic data.
-
Automated Rendering: A script will be developed to automatically load each of the 11,000+ schematics and render 2D images from several key angles (e.g., isometric, front-facing).
-
Multi-modal AI Description: These rendered images will be fed to a multi-modal LLM. The LLM will be prompted to act as an expert architect and provide a rich, descriptive caption for each schematic, identifying its style, materials, and key features.
-
Retraining with Enriched Labels: The V2 model architecture will be retrained from scratch using the same visual data but this new, high-quality set of text labels.
The resulting model should be capable of understanding both complex architectural concepts and nuanced stylistic language, marking the next major evolution of this project.