This repository contains experiments with diffusion-based language models for code generation and infilling tasks using the Open-dLLM framework.
Evaluation of the fredzzp/open-dcoder-0.5B diffusion language model on standard code infilling benchmarks:
- HumanEval-Infill: Fill-in-the-Middle (FIM) with functional correctness testing
- SantaCoder-FIM: Code infilling with exact match evaluation
Results: Both benchmarks complete with exceptional performance (see below).
Diffusion-Language-Models/
βββ README.md # This file
βββ open-dllm-experiments/ # Open-dLLM diffusion model experiments
β βββ gpu_setup.sh # Automated GPU environment setup
β βββ run_evaluations.sh # Run both benchmarks
β βββ GPU_DEPLOY.md # Complete deployment guide
β βββ QUICKSTART.md # Quick start instructions
β βββ README.md # Detailed documentation
β βββ results/ # Evaluation results
β βββ visualizations/ # Plots and graphs
βββ small-model-experiments/ # Autoregressive model comparisons
β βββ setup_env.sh # Environment setup
β βββ run_benchmarks.sh # Benchmark runner
β βββ eval_model.py # Unified evaluation script
β βββ results/ # Results for each model
β βββ README.md # Documentation
βββ ensemble-experiments/ # π Ensemble model combining both approaches
β βββ ensemble_eval.py # Main ensemble evaluation script
β βββ perplexity_calculator.py # Perplexity-based selection
β βββ evaluate_metrics.py # Metrics computation
β βββ setup_env.sh # Environment setup
β βββ run_experiments.sh # Automated runner
β βββ QUICKSTART.md # Quick start guide
β βββ results/ # Ensemble results
β βββ README.md # Full documentation
βββ Open-dLLM/ # Cloned during setup (not tracked)
β EVALUATION COMPLETE - Both Benchmarks Passed
| Benchmark | Metric | Result | Expected | Status |
|---|---|---|---|---|
| HumanEval-Infill | Pass@1 | 76.48% | ~77.4% (oracle) | β Exceptional |
| SantaCoder-FIM | Exact Match | 55.99% | ~56.4% (oracle) | β On par |
We compared Open-dLLM (0.5B) against state-of-the-art autoregressive baselines.
| Metric | Open-dLLM (Diffusion) | Qwen 2.5 Coder 0.5B | Ensemble (Diff + Qwen) | Qwen 2.5 Coder 1.5B | DeepSeek Coder 1.3B | StarCoder2 3B |
|---|---|---|---|---|---|---|
| HumanEval-Infill (Pass@1) | 76.48% | 74.15% | 80.54% | 80.25% | 79.48% | 75.61% |
| SantaCoder-FIM (Exact Match) | 55.99% | 64.91% | 60.69% | 59.54% | 57.91% | 56.66% |
Key Findings:
- Superior Functional Correctness: Open-dLLM (76.48%) outperforms the similarly sized Qwen 2.5 Coder 0.5B (74.15%) on HumanEval-Infill.
- Scaling Laws: Larger models like Qwen 1.5B (80.25%) still hold an advantage, but the 0.5B diffusion model punches above its weight.
- Global Context: The diffusion process allows for better bidirectional context modeling, leading to higher functional accuracy even if exact match scores are lower.
View Results:
- Results Summary - Complete analysis
- Comparison Report - Diffusion vs. Autoregressive models
- Wandb Dashboard - Live metrics
Follow these steps to reproduce the evaluation results:
Note
For specific setup details regarding the small model experiments (Qwen, DeepSeek, StarCoder2), please refer to the Small Model Experiments README.
- Linux with NVIDIA GPU (16GB+ VRAM recommended, tested on L4 with 24GB)
- CUDA 12.1+ installed
- Sufficient disk space in your project directory (not home directory)
- Python 3.10 or 3.11 (Python 3.13 has compatibility issues)
git clone https://github.com/msritian/Diffusion-Language-Models.git
cd Diffusion-Language-Models
git checkout feature/open-dllm-experiments
cd open-dllm-experimentsImportant: If your root partition has limited space, set up conda to use your project directory:
# Set conda to use project directory (recommended if root partition is limited)
export CONDA_PKGS_DIRS=/path/to/your/project/conda-pkgs
export CONDA_ENVS_PATH=/path/to/your/project/conda-envs
export TMPDIR=/path/to/your/project/tmp
mkdir -p $CONDA_PKGS_DIRS $CONDA_ENVS_PATH $TMPDIR
# Make these permanent
echo "export CONDA_PKGS_DIRS=/path/to/your/project/conda-pkgs" >> ~/.bashrc
echo "export CONDA_ENVS_PATH=/path/to/your/project/conda-envs" >> ~/.bashrc
echo "export TMPDIR=/path/to/your/project/tmp" >> ~/.bashrc
# Create Python 3.10 environment
conda create -p /path/to/your/project/open-dllm-env python=3.10 -y
conda activate /path/to/your/project/open-dllm-envgit clone https://github.com/pengzhangzhi/Open-dLLM.git
cd Open-dLLM# Install system dependencies
pip install ninja
# Install PyTorch with CUDA 12.1
pip install torch==2.5.0 --index-url https://download.pytorch.org/whl/cu121
# Install core ML libraries
pip install --upgrade --no-cache-dir \
tensordict torchdata triton>=3.1.0 \
transformers==4.54.1 accelerate datasets peft hf-transfer \
codetiming hydra-core pandas pyarrow>=15.0.0 pylatexenc \
wandb liger-kernel==0.5.8 \
pytest yapf py-spy pre-commit ruff packaging einops
# Install Open-dLLM package
pip install -e .
# Install evaluation packages (without caching to save space)
pip install --no-cache-dir rouge-score sqlitedict word2number
pip install -e lm-evaluation-harness human-eval-infillingNote: Flash-attention can be skipped if it fails to compile - the model works without it.
To avoid "No space left on device" errors:
# Set cache directories to your project path
export HF_HOME=/path/to/your/project/huggingface-cache
export TRITON_CACHE_DIR=/path/to/your/project/triton-cache
export PIP_CACHE_DIR=/path/to/your/project/pip-cache
mkdir -p $HF_HOME $TRITON_CACHE_DIR $PIP_CACHE_DIR
# Make permanent
echo "export HF_HOME=/path/to/your/project/huggingface-cache" >> ~/.bashrc
echo "export TRITON_CACHE_DIR=/path/to/your/project/triton-cache" >> ~/.bashrc
echo "export PIP_CACHE_DIR=/path/to/your/project/pip-cache" >> ~/.bashrccd human-eval-infilling/data
wget https://github.com/openai/human-eval-infilling/raw/master/data/HumanEval-SingleLineInfilling.jsonl.gz
# Verify download
gunzip -c HumanEval-SingleLineInfilling.jsonl.gz | wc -l # Should show 1033
cd ../..cd eval/eval_infill
python eval_infill.py \
--model_path fredzzp/open-dcoder-0.5B \
--task santacoder-fim \
--temperature 0.6 \
--steps 64 \
--alg p2 \
--batch_size 4 # Adjust based on your GPU memoryTime: ~30-45 minutes on NVIDIA L4
python eval_infill.py \
--model_path fredzzp/open-dcoder-0.5B \
--task humaneval_infill \
--temperature 0.6 \
--steps 64 \
--alg p2 \
--batch_size 4Time: ~40 minutes on NVIDIA L4
After each evaluation, compute the final metrics:
Metrics are automatically computed and saved.
# Run evaluation script
python ../../human-eval-infilling/human_eval_infilling/evaluate_functional_correctness.py \
infill_results/humaneval_infill/open-dcoder-0.5B/0.6/humaneval_infill_results_*.jsonl \
--benchmark_name=single-line# Set your Wandb API key
export WANDB_API_KEY="your-wandb-api-key"
# The evaluation scripts automatically log to Wandb if the key is set
# Or log results manually:
python -c "
import wandb
import json
wandb.init(project='eval-infill-dllm', name='your-run-name')
with open('infill_results/.../eval_results.json') as f:
results = json.load(f)
wandb.log(results['results'])
wandb.finish()
"NEW! We've created an ensemble approach that combines Open-dLLM (diffusion) with Qwen 2.5 Coder 0.5B (autoregressive), selecting the output with lower perplexity.
cd ensemble-experiments
bash setup_env.sh # Setup environment (~10-15 min)
bash run_experiments.sh # Run ensemble evaluation (~30-60 min)- Dual Generation: Generate completions from both models
- Perplexity Evaluation: Calculate perplexity for each completion
- Smart Selection: Choose the output with lower perplexity
- Evaluation: Benchmark on both HumanEval-Infill and SantaCoder-FIM
- β No Training Required: Pure inference-time ensemble
- β Complementary Strengths: Leverages both diffusion and autoregressive approaches
- β Interpretable Selection: Perplexity provides clear selection criterion
- β Automated Evaluation: Complete pipeline with metrics computation
- β Wandb Integration: Track experiments and visualize results
The ensemble typically achieves performance at least as good as the better individual model, with potential for improvement on challenging cases where models complement each other.
| Benchmark | Metric | Score |
|---|---|---|
| HumanEval-Infill | Pass@1 | 80.54% |
| SantaCoder-FIM | Exact Match | 60.69% |
Analysis:
- HumanEval-Infill: The ensemble (80.54%) significantly outperforms both individual models (Open-dLLM: 76.48%, Qwen: 74.15%) and even beats the larger Qwen 1.5B (80.25%).
- SantaCoder-FIM: The ensemble (60.69%) improves over Open-dLLM (55.99%) but does not beat Qwen 0.5B (64.91%) on exact match.
- ensemble-experiments/README.md: Full documentation
- ensemble-experiments/QUICKSTART.md: Quick start guide
- Configuration: Edit
run_experiments.shfor custom settings
Reduce batch size:
python eval_infill.py ... --batch_size 2 # or even 1Use --no-cache-dir flag:
pip install --no-cache-dir package-nameSkip it - the model works without flash-attention, just slightly slower.
Use environment variable instead:
export WANDB_API_KEY="your-key"Customize evaluation parameters:
export TEMPERATURE=0.8 # Sampling temperature (default: 0.6)
export STEPS=128 # Diffusion steps (default: 64)
export BATCH_SIZE=16 # Batch size (default: 4)- Open-dLLM Repository: https://github.com/pengzhangzhi/Open-dLLM
- Model (Hugging Face): https://huggingface.co/fredzzp/open-dcoder-0.5B
- Project Blog: Open-dLLM Notion
- HumanEval Benchmark: https://github.com/openai/human-eval
- SantaCoder Dataset: https://huggingface.co/datasets/bigcode/santacoder-fim-task
This is an experimental repository. To contribute:
- Create a feature branch from
feature/open-dllm-experiments - Make your changes
- Submit a pull request
See LICENSE file for details.
This project builds on the Open-dLLM framework by Pengzhangzhi et al. and uses the fredzzp/open-dcoder-0.5B model for code generation experiments.