Diffusion Language Models - Open-dLLM Experimentation

This repository contains experiments with diffusion-based language models for code generation and infilling tasks using the Open-dLLM framework.

🎯 Project Overview

Evaluation of the fredzzp/open-dcoder-0.5B diffusion language model on standard code infilling benchmarks:

HumanEval-Infill: Fill-in-the-Middle (FIM) with functional correctness testing
SantaCoder-FIM: Code infilling with exact match evaluation

Results: Both benchmarks complete with exceptional performance (see below).

📁 Repository Structure

Diffusion-Language-Models/
├── README.md                     # This file
├── open-dllm-experiments/        # Open-dLLM diffusion model experiments
│   ├── gpu_setup.sh             # Automated GPU environment setup
│   ├── run_evaluations.sh       # Run both benchmarks
│   ├── GPU_DEPLOY.md            # Complete deployment guide
│   ├── QUICKSTART.md            # Quick start instructions
│   ├── README.md                # Detailed documentation
│   ├── results/                 # Evaluation results
│   └── visualizations/          # Plots and graphs
├── small-model-experiments/      # Autoregressive model comparisons
│   ├── setup_env.sh             # Environment setup
│   ├── run_benchmarks.sh        # Benchmark runner
│   ├── eval_model.py            # Unified evaluation script
│   ├── results/                 # Results for each model
│   └── README.md                # Documentation
├── ensemble-experiments/         # 🆕 Ensemble model combining both approaches
│   ├── ensemble_eval.py         # Main ensemble evaluation script
│   ├── perplexity_calculator.py # Perplexity-based selection
│   ├── evaluate_metrics.py      # Metrics computation
│   ├── setup_env.sh             # Environment setup
│   ├── run_experiments.sh       # Automated runner
│   ├── QUICKSTART.md            # Quick start guide
│   ├── results/                 # Ensemble results
│   └── README.md                # Full documentation
└── Open-dLLM/                   # Cloned during setup (not tracked)

📊 Results

✅ EVALUATION COMPLETE - Both Benchmarks Passed

Benchmark	Metric	Result	Expected	Status
HumanEval-Infill	Pass@1	76.48%	~77.4% (oracle)	✅ Exceptional
SantaCoder-FIM	Exact Match	55.99%	~56.4% (oracle)	✅ On par

🏆 Comparative Analysis: Diffusion vs. Autoregressive Models

We compared Open-dLLM (0.5B) against state-of-the-art autoregressive baselines.

Metric	Open-dLLM (Diffusion)	Qwen 2.5 Coder 0.5B	Ensemble (Diff + Qwen)	Qwen 2.5 Coder 1.5B	DeepSeek Coder 1.3B	StarCoder2 3B
HumanEval-Infill (Pass@1)	76.48%	74.15%	80.54%	80.25%	79.48%	75.61%
SantaCoder-FIM (Exact Match)	55.99%	64.91%	60.69%	59.54%	57.91%	56.66%

Key Findings:

Superior Functional Correctness: Open-dLLM (76.48%) outperforms the similarly sized Qwen 2.5 Coder 0.5B (74.15%) on HumanEval-Infill.
Scaling Laws: Larger models like Qwen 1.5B (80.25%) still hold an advantage, but the 0.5B diffusion model punches above its weight.
Global Context: The diffusion process allows for better bidirectional context modeling, leading to higher functional accuracy even if exact match scores are lower.

View Results:

Results Summary - Complete analysis
Comparison Report - Diffusion vs. Autoregressive models
Wandb Dashboard - Live metrics

🛠️ Reproduction Guide

Follow these steps to reproduce the evaluation results:

Note

For specific setup details regarding the small model experiments (Qwen, DeepSeek, StarCoder2), please refer to the Small Model Experiments README.

Prerequisites

Linux with NVIDIA GPU (16GB+ VRAM recommended, tested on L4 with 24GB)
CUDA 12.1+ installed
Sufficient disk space in your project directory (not home directory)
Python 3.10 or 3.11 (Python 3.13 has compatibility issues)

Step 1: Clone Repository

git clone https://github.com/msritian/Diffusion-Language-Models.git
cd Diffusion-Language-Models
git checkout feature/open-dllm-experiments
cd open-dllm-experiments

Step 2: Set Up Environment

Important: If your root partition has limited space, set up conda to use your project directory:

# Set conda to use project directory (recommended if root partition is limited)
export CONDA_PKGS_DIRS=/path/to/your/project/conda-pkgs
export CONDA_ENVS_PATH=/path/to/your/project/conda-envs
export TMPDIR=/path/to/your/project/tmp
mkdir -p $CONDA_PKGS_DIRS $CONDA_ENVS_PATH $TMPDIR

# Make these permanent
echo "export CONDA_PKGS_DIRS=/path/to/your/project/conda-pkgs" >> ~/.bashrc
echo "export CONDA_ENVS_PATH=/path/to/your/project/conda-envs" >> ~/.bashrc
echo "export TMPDIR=/path/to/your/project/tmp" >> ~/.bashrc

# Create Python 3.10 environment
conda create -p /path/to/your/project/open-dllm-env python=3.10 -y
conda activate /path/to/your/project/open-dllm-env

Step 3: Clone Open-dLLM Repository

git clone https://github.com/pengzhangzhi/Open-dLLM.git
cd Open-dLLM

Step 4: Install Dependencies

# Install system dependencies
pip install ninja

# Install PyTorch with CUDA 12.1
pip install torch==2.5.0 --index-url https://download.pytorch.org/whl/cu121

# Install core ML libraries
pip install --upgrade --no-cache-dir \
  tensordict torchdata triton>=3.1.0 \
  transformers==4.54.1 accelerate datasets peft hf-transfer \
  codetiming hydra-core pandas pyarrow>=15.0.0 pylatexenc \
  wandb liger-kernel==0.5.8 \
  pytest yapf py-spy pre-commit ruff packaging einops

# Install Open-dLLM package
pip install -e .

# Install evaluation packages (without caching to save space)
pip install --no-cache-dir rouge-score sqlitedict word2number
pip install -e lm-evaluation-harness human-eval-infilling

Note: Flash-attention can be skipped if it fails to compile - the model works without it.

Step 5: Configure Cache Directories

To avoid "No space left on device" errors:

# Set cache directories to your project path
export HF_HOME=/path/to/your/project/huggingface-cache
export TRITON_CACHE_DIR=/path/to/your/project/triton-cache
export PIP_CACHE_DIR=/path/to/your/project/pip-cache
mkdir -p $HF_HOME $TRITON_CACHE_DIR $PIP_CACHE_DIR

# Make permanent
echo "export HF_HOME=/path/to/your/project/huggingface-cache" >> ~/.bashrc
echo "export TRITON_CACHE_DIR=/path/to/your/project/triton-cache" >> ~/.bashrc
echo "export PIP_CACHE_DIR=/path/to/your/project/pip-cache" >> ~/.bashrc

Step 6: Download HumanEval-Infill Dataset

cd human-eval-infilling/data
wget https://github.com/openai/human-eval-infilling/raw/master/data/HumanEval-SingleLineInfilling.jsonl.gz

# Verify download
gunzip -c HumanEval-SingleLineInfilling.jsonl.gz | wc -l  # Should show 1033
cd ../..

Step 7: Run Evaluations

SantaCoder-FIM (Auto-downloads from HuggingFace)

cd eval/eval_infill
python eval_infill.py \
  --model_path fredzzp/open-dcoder-0.5B \
  --task santacoder-fim \
  --temperature 0.6 \
  --steps 64 \
  --alg p2 \
  --batch_size 4  # Adjust based on your GPU memory

Time: ~30-45 minutes on NVIDIA L4

HumanEval-Infill

python eval_infill.py \
  --model_path fredzzp/open-dcoder-0.5B \
  --task humaneval_infill \
  --temperature 0.6 \
  --steps 64 \
  --alg p2 \
  --batch_size 4

Time: ~40 minutes on NVIDIA L4

Step 8: Compute Metrics

After each evaluation, compute the final metrics:

SantaCoder-FIM

Metrics are automatically computed and saved.

HumanEval-Infill

# Run evaluation script
python ../../human-eval-infilling/human_eval_infilling/evaluate_functional_correctness.py \
  infill_results/humaneval_infill/open-dcoder-0.5B/0.6/humaneval_infill_results_*.jsonl \
  --benchmark_name=single-line

Step 9: (Optional) Log to Wandb

# Set your Wandb API key
export WANDB_API_KEY="your-wandb-api-key"

# The evaluation scripts automatically log to Wandb if the key is set
# Or log results manually:
python -c "
import wandb
import json

wandb.init(project='eval-infill-dllm', name='your-run-name')
with open('infill_results/.../eval_results.json') as f:
    results = json.load(f)
wandb.log(results['results'])
wandb.finish()
"

🤝 Ensemble Experiments

NEW! We've created an ensemble approach that combines Open-dLLM (diffusion) with Qwen 2.5 Coder 0.5B (autoregressive), selecting the output with lower perplexity.

Quick Start

cd ensemble-experiments
bash setup_env.sh        # Setup environment (~10-15 min)
bash run_experiments.sh  # Run ensemble evaluation (~30-60 min)

How It Works

Dual Generation: Generate completions from both models
Perplexity Evaluation: Calculate perplexity for each completion
Smart Selection: Choose the output with lower perplexity
Evaluation: Benchmark on both HumanEval-Infill and SantaCoder-FIM

Key Features

✅ No Training Required: Pure inference-time ensemble
✅ Complementary Strengths: Leverages both diffusion and autoregressive approaches
✅ Interpretable Selection: Perplexity provides clear selection criterion
✅ Automated Evaluation: Complete pipeline with metrics computation
✅ Wandb Integration: Track experiments and visualize results

Expected Performance

The ensemble typically achieves performance at least as good as the better individual model, with potential for improvement on challenging cases where models complement each other.

🏆 Results

Benchmark	Metric	Score
HumanEval-Infill	Pass@1	80.54%
SantaCoder-FIM	Exact Match	60.69%

Analysis:

HumanEval-Infill: The ensemble (80.54%) significantly outperforms both individual models (Open-dLLM: 76.48%, Qwen: 74.15%) and even beats the larger Qwen 1.5B (80.25%).
SantaCoder-FIM: The ensemble (60.69%) improves over Open-dLLM (55.99%) but does not beat Qwen 0.5B (64.91%) on exact match.

Documentation

ensemble-experiments/README.md: Full documentation
ensemble-experiments/QUICKSTART.md: Quick start guide
Configuration: Edit run_experiments.sh for custom settings

🔧 Troubleshooting

CUDA Out of Memory

Reduce batch size:

python eval_infill.py ... --batch_size 2  # or even 1

Package Installation Fails

Use --no-cache-dir flag:

pip install --no-cache-dir package-name

Flash-Attention Build Fails

Skip it - the model works without flash-attention, just slightly slower.

Wandb Login Issues

Use environment variable instead:

export WANDB_API_KEY="your-key"

⚙️ Configuration Options

Customize evaluation parameters:

export TEMPERATURE=0.8       # Sampling temperature (default: 0.6)
export STEPS=128             # Diffusion steps (default: 64)
export BATCH_SIZE=16         # Batch size (default: 4)

🔗 References

Open-dLLM Repository: https://github.com/pengzhangzhi/Open-dLLM
Model (Hugging Face): https://huggingface.co/fredzzp/open-dcoder-0.5B
Project Blog: Open-dLLM Notion
HumanEval Benchmark: https://github.com/openai/human-eval
SantaCoder Dataset: https://huggingface.co/datasets/bigcode/santacoder-fim-task

🤝 Contributing

This is an experimental repository. To contribute:

Create a feature branch from feature/open-dllm-experiments
Make your changes
Submit a pull request

📝 License

See LICENSE file for details.

🙏 Acknowledgments

This project builds on the Open-dLLM framework by Pengzhangzhi et al. and uses the fredzzp/open-dcoder-0.5B model for code generation experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
Open-dLLM		Open-dLLM
ensemble-experiments		ensemble-experiments
open-dllm-experiments		open-dllm-experiments
small-model-experiments		small-model-experiments
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Diffusion Language Models - Open-dLLM Experimentation

🎯 Project Overview

📁 Repository Structure

📊 Results

🏆 Comparative Analysis: Diffusion vs. Autoregressive Models

🛠️ Reproduction Guide

Prerequisites

Step 1: Clone Repository

Step 2: Set Up Environment

Step 3: Clone Open-dLLM Repository

Step 4: Install Dependencies

Step 5: Configure Cache Directories

Step 6: Download HumanEval-Infill Dataset

Step 7: Run Evaluations

SantaCoder-FIM (Auto-downloads from HuggingFace)

HumanEval-Infill

Step 8: Compute Metrics

SantaCoder-FIM

HumanEval-Infill

Step 9: (Optional) Log to Wandb

🤝 Ensemble Experiments

Quick Start

How It Works

Key Features

Expected Performance

🏆 Results

Documentation

🔧 Troubleshooting

CUDA Out of Memory

Package Installation Fails

Flash-Attention Build Fails

Wandb Login Issues

⚙️ Configuration Options

🔗 References

🤝 Contributing

📝 License

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages