Skip to content

A text-conditional 3D diffusion model in PyTorch for generating Minecraft schematics from natural language prompts.

Notifications You must be signed in to change notification settings

KHROTU/schematic-diffusion

Repository files navigation

schematic-diffusion

A 3D diffusion model trained to generate Minecraft schematics from natural language prompts. It uses a 3D U-Net architecture with cross-attention, conditioned on text embeddings from OpenAI's CLIP model.

Project Status & Model Checkpoints

This project has recently undergone a significant upgrade. The original models, trained on a consumer laptop GPU, have been moved to a legacy directory. A new set of models has been retrained from scratch on a RTX 4090 instance using a more powerful architecture. It is highly recommended to use the new V2 models.

V2 Models (Recommended)

  • Location: models/retrain/
  • Training Hardware: NVIDIA RTX 4090
  • Architecture: Wider U-Net (base_c=128) with more learning capacity.
  • Training Process: More stable training due to a larger batch size (BATCH_SIZE=24).
  • Expected Quality: These models produce significantly better results, with more logical structures, finer details, and fewer visual artifacts. For best results, use a checkpoint from a later epoch (e.g., schematic_diffusion_epoch_80.pth or higher).

V1 Models (Legacy)

  • Location: models/legacy/
  • Training Hardware: NVIDIA RTX 4070 Laptop GPU
  • Architecture: Standard U-Net (base_c=64).
  • Expected Quality: Functional, but results can be mediocre. Structures may lack coherence and detail compared to the V2 models.

How to Train Your Dragon the Model

Prerequisites

  • A Windows or Linux machine with an NVIDIA GPU (16GB+ VRAM recommended for retraining).
  • NVIDIA drivers and CUDA toolkit compatible with PyTorch.
  • Miniconda or Anaconda installed (recommended).

Step 1: Setup the Environment

First, clone the repository and set up the Conda environment.

# Clone this repository
git clone https://github.com/KHROTU/schematic-diffusion.git
cd schematic-diffusion

# Create and activate the Conda environment (recommended)
conda create --name schematic-diffusion python=3.10
conda activate schematic-diffusion

# Install PyTorch & CUDA
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

# Install remaining dependencies
pip install -r requirements.txt

When you first clone the repository, it should look like this:

+-- data
|   +-- 0_raw_downloads
|   +-- 2_named_schematics
|   +-- 3_litematics_to_convert
|   +-- 4_processed_tensors
|   L-- 1_id_to_name.txt
+-- litematic_converter
|   +-- converter.py
|   +-- converter_server.py
|   L-- Python Converter Bridge-1.0.user.js
+-- .gitignore
+-- 01a_triage_litematics.py
+-- 01_rename_files.py
+-- 02_generate_labels.py
+-- config.py
+-- generate.py
+-- preprocess_all_data.py
+-- README
+-- requirements.txt
L-- train_diffusion.py

When you finish training, it should look like this:

+-- data
|   +-- 0_raw_downloads/
|   +-- 2_named_schematics/
|   +-- 3_litematics_to_convert/
|   +-- 4_processed_tensors/
|   +-- 1_id_to_name.txt
|   L-- 5_labels.json
+-- litematic_converter
|   +-- converter.py
|   +-- converter_server.py
|   L-- Python Converter Bridge-1.0.user.js
+-- models
|   +-- legacy
|   |   +-- schematic_diffusion_epoch_5.pth
|   |   +-- schematic_diffusion_epoch_10.pth
|   |   +-- ...
|   |   L-- schematic_diffusion_final.pth
|   L-- retrain
|       +-- schematic_diffusion_epoch_5.pth
|       +-- schematic_diffusion_epoch_10.pth
|       +-- ...
|       L-- schematic_diffusion_final.pth
+-- .gitignore
+-- 01a_triage_litematics.py
+-- 01_rename_files.py
+-- 02_generate_labels.py
+-- config.py
+-- generate.py
+-- preprocess_all_data.py
+-- README
+-- requirements.txt
L-- train_diffusion.py

Step 2: Download the Dataset

The model was trained on a large dataset of schematics from the web. Due to the size of the dataset, it is not included in this repository.

  1. Download the Schematic Dataset: Download the Schematics.zip file containing ~120,000 raw schematic files.

    • Link: MediaFire (thank you u/cbreauxgaming)
    • Unzip this file and place its contents into the data/0_raw_downloads/ directory.
  2. Download the ID-to-Name Mapping: This file maps the numeric filenames to their original names.

    • Link: mclo.gs, 2
    • Place this file in the data/ directory and name it 1_id_to_name.txt.

For more context on the dataset (you might not want to use this specific one due to ethical reasons), you can read the original Reddit post, specifically this thread.

Step 3: Preprocess the Data

Run the following scripts in order. Not sure what would happen if you run them out of order, but it is not recommended.

# Ensure your conda environment is active
conda activate schematic-diffusion

# 1. Rename files from IDs to human-readable names
python 01_rename_files.py

# 2. Separate .litematic files for conversion
python 01a_triage_litematics.py

# 3. Convert .litematic files to .schem (this will take a while)
# This requires the Tampermonkey script (litematic_converter\Python Converter Bridge-1.0.user.js)
# to be installed and active in your browser.
cd litematic_converter
python converter.py
cd ..

# 4. Generate the final labels.json file from the processed files
python 02_generate_labels.py

# 5. Convert all schematics into PyTorch tensors (this will also take a while, the output is ~80GB)
python preprocess_all_data.py

After this step, the data/4_processed_tensors/ directory will be filled with your training-ready dataset.

Step 4: Train the Model

# Start the training process
python train_diffusion.py
  • The script will print the average loss after each epoch. You should see this value decrease over time.
  • The train_diffusion.py script is now configured for the V2 model architecture (base_c=128) and a larger batch size (BATCH_SIZE=24), targeting high-performance GPUs.
  • Model checkpoints will be saved every 5 epochs to the models/retrain/ directory by default.

Step 5: Generate Schematics

Once the model is trained, you can generate new schematics using the generation script.

python generate.py
  • You can modify the prompt and other parameters directly in the generate.py script.

  • Important: Make sure to update the MODEL_PATH variable to point to one of the new, high-quality models from the retrain directory. For example:

    # In generate.py
    MODEL_PATH = "models/retrain/schematic_diffusion_epoch_95.pth" 

Future Development: The Path to Version 3.0

The current V2 models are a huge leap in visual and structural quality. However, their ability to follow specific stylistic prompts (e.g., "gothic," "modern") is limited by the dataset's short and generic labels (e.g., "cool tower").

The next major goal is to create a "Version 3.0" model that combines the powerful V2 visual engine with precise language control. This will be achieved by generating a new, high-quality set of labels for our existing schematic data.

The Roadmap

  1. Automated Rendering: A script will be developed to automatically load each of the 11,000+ schematics and render 2D images from several key angles (e.g., isometric, front-facing).

  2. Multi-modal AI Description: These rendered images will be fed to a multi-modal LLM. The LLM will be prompted to act as an expert architect and provide a rich, descriptive caption for each schematic, identifying its style, materials, and key features.

  3. Retraining with Enriched Labels: The V2 model architecture will be retrained from scratch using the same visual data but this new, high-quality set of text labels.

The resulting model should be capable of understanding both complex architectural concepts and nuanced stylistic language, marking the next major evolution of this project.

About

A text-conditional 3D diffusion model in PyTorch for generating Minecraft schematics from natural language prompts.

Topics

Resources

Stars

Watchers

Forks