Skip to content

msritian/Diffusion-Language-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

74 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Diffusion Language Models - Open-dLLM Experimentation

This repository contains experiments with diffusion-based language models for code generation and infilling tasks using the Open-dLLM framework.

🎯 Project Overview

Evaluation of the fredzzp/open-dcoder-0.5B diffusion language model on standard code infilling benchmarks:

  • HumanEval-Infill: Fill-in-the-Middle (FIM) with functional correctness testing
  • SantaCoder-FIM: Code infilling with exact match evaluation

Results: Both benchmarks complete with exceptional performance (see below).

πŸ“ Repository Structure

Diffusion-Language-Models/
β”œβ”€β”€ README.md                     # This file
β”œβ”€β”€ open-dllm-experiments/        # Open-dLLM diffusion model experiments
β”‚   β”œβ”€β”€ gpu_setup.sh             # Automated GPU environment setup
β”‚   β”œβ”€β”€ run_evaluations.sh       # Run both benchmarks
β”‚   β”œβ”€β”€ GPU_DEPLOY.md            # Complete deployment guide
β”‚   β”œβ”€β”€ QUICKSTART.md            # Quick start instructions
β”‚   β”œβ”€β”€ README.md                # Detailed documentation
β”‚   β”œβ”€β”€ results/                 # Evaluation results
β”‚   └── visualizations/          # Plots and graphs
β”œβ”€β”€ small-model-experiments/      # Autoregressive model comparisons
β”‚   β”œβ”€β”€ setup_env.sh             # Environment setup
β”‚   β”œβ”€β”€ run_benchmarks.sh        # Benchmark runner
β”‚   β”œβ”€β”€ eval_model.py            # Unified evaluation script
β”‚   β”œβ”€β”€ results/                 # Results for each model
β”‚   └── README.md                # Documentation
β”œβ”€β”€ ensemble-experiments/         # πŸ†• Ensemble model combining both approaches
β”‚   β”œβ”€β”€ ensemble_eval.py         # Main ensemble evaluation script
β”‚   β”œβ”€β”€ perplexity_calculator.py # Perplexity-based selection
β”‚   β”œβ”€β”€ evaluate_metrics.py      # Metrics computation
β”‚   β”œβ”€β”€ setup_env.sh             # Environment setup
β”‚   β”œβ”€β”€ run_experiments.sh       # Automated runner
β”‚   β”œβ”€β”€ QUICKSTART.md            # Quick start guide
β”‚   β”œβ”€β”€ results/                 # Ensemble results
β”‚   └── README.md                # Full documentation
└── Open-dLLM/                   # Cloned during setup (not tracked)

πŸ“Š Results

βœ… EVALUATION COMPLETE - Both Benchmarks Passed

Benchmark Metric Result Expected Status
HumanEval-Infill Pass@1 76.48% ~77.4% (oracle) βœ… Exceptional
SantaCoder-FIM Exact Match 55.99% ~56.4% (oracle) βœ… On par

πŸ† Comparative Analysis: Diffusion vs. Autoregressive Models

We compared Open-dLLM (0.5B) against state-of-the-art autoregressive baselines.

Metric Open-dLLM (Diffusion) Qwen 2.5 Coder 0.5B Ensemble (Diff + Qwen) Qwen 2.5 Coder 1.5B DeepSeek Coder 1.3B StarCoder2 3B
HumanEval-Infill (Pass@1) 76.48% 74.15% 80.54% 80.25% 79.48% 75.61%
SantaCoder-FIM (Exact Match) 55.99% 64.91% 60.69% 59.54% 57.91% 56.66%

Key Findings:

  1. Superior Functional Correctness: Open-dLLM (76.48%) outperforms the similarly sized Qwen 2.5 Coder 0.5B (74.15%) on HumanEval-Infill.
  2. Scaling Laws: Larger models like Qwen 1.5B (80.25%) still hold an advantage, but the 0.5B diffusion model punches above its weight.
  3. Global Context: The diffusion process allows for better bidirectional context modeling, leading to higher functional accuracy even if exact match scores are lower.

View Results:

πŸ› οΈ Reproduction Guide

Follow these steps to reproduce the evaluation results:

Note

For specific setup details regarding the small model experiments (Qwen, DeepSeek, StarCoder2), please refer to the Small Model Experiments README.

Prerequisites

  • Linux with NVIDIA GPU (16GB+ VRAM recommended, tested on L4 with 24GB)
  • CUDA 12.1+ installed
  • Sufficient disk space in your project directory (not home directory)
  • Python 3.10 or 3.11 (Python 3.13 has compatibility issues)

Step 1: Clone Repository

git clone https://github.com/msritian/Diffusion-Language-Models.git
cd Diffusion-Language-Models
git checkout feature/open-dllm-experiments
cd open-dllm-experiments

Step 2: Set Up Environment

Important: If your root partition has limited space, set up conda to use your project directory:

# Set conda to use project directory (recommended if root partition is limited)
export CONDA_PKGS_DIRS=/path/to/your/project/conda-pkgs
export CONDA_ENVS_PATH=/path/to/your/project/conda-envs
export TMPDIR=/path/to/your/project/tmp
mkdir -p $CONDA_PKGS_DIRS $CONDA_ENVS_PATH $TMPDIR

# Make these permanent
echo "export CONDA_PKGS_DIRS=/path/to/your/project/conda-pkgs" >> ~/.bashrc
echo "export CONDA_ENVS_PATH=/path/to/your/project/conda-envs" >> ~/.bashrc
echo "export TMPDIR=/path/to/your/project/tmp" >> ~/.bashrc

# Create Python 3.10 environment
conda create -p /path/to/your/project/open-dllm-env python=3.10 -y
conda activate /path/to/your/project/open-dllm-env

Step 3: Clone Open-dLLM Repository

git clone https://github.com/pengzhangzhi/Open-dLLM.git
cd Open-dLLM

Step 4: Install Dependencies

# Install system dependencies
pip install ninja

# Install PyTorch with CUDA 12.1
pip install torch==2.5.0 --index-url https://download.pytorch.org/whl/cu121

# Install core ML libraries
pip install --upgrade --no-cache-dir \
  tensordict torchdata triton>=3.1.0 \
  transformers==4.54.1 accelerate datasets peft hf-transfer \
  codetiming hydra-core pandas pyarrow>=15.0.0 pylatexenc \
  wandb liger-kernel==0.5.8 \
  pytest yapf py-spy pre-commit ruff packaging einops

# Install Open-dLLM package
pip install -e .

# Install evaluation packages (without caching to save space)
pip install --no-cache-dir rouge-score sqlitedict word2number
pip install -e lm-evaluation-harness human-eval-infilling

Note: Flash-attention can be skipped if it fails to compile - the model works without it.

Step 5: Configure Cache Directories

To avoid "No space left on device" errors:

# Set cache directories to your project path
export HF_HOME=/path/to/your/project/huggingface-cache
export TRITON_CACHE_DIR=/path/to/your/project/triton-cache
export PIP_CACHE_DIR=/path/to/your/project/pip-cache
mkdir -p $HF_HOME $TRITON_CACHE_DIR $PIP_CACHE_DIR

# Make permanent
echo "export HF_HOME=/path/to/your/project/huggingface-cache" >> ~/.bashrc
echo "export TRITON_CACHE_DIR=/path/to/your/project/triton-cache" >> ~/.bashrc
echo "export PIP_CACHE_DIR=/path/to/your/project/pip-cache" >> ~/.bashrc

Step 6: Download HumanEval-Infill Dataset

cd human-eval-infilling/data
wget https://github.com/openai/human-eval-infilling/raw/master/data/HumanEval-SingleLineInfilling.jsonl.gz

# Verify download
gunzip -c HumanEval-SingleLineInfilling.jsonl.gz | wc -l  # Should show 1033
cd ../..

Step 7: Run Evaluations

SantaCoder-FIM (Auto-downloads from HuggingFace)

cd eval/eval_infill
python eval_infill.py \
  --model_path fredzzp/open-dcoder-0.5B \
  --task santacoder-fim \
  --temperature 0.6 \
  --steps 64 \
  --alg p2 \
  --batch_size 4  # Adjust based on your GPU memory

Time: ~30-45 minutes on NVIDIA L4

HumanEval-Infill

python eval_infill.py \
  --model_path fredzzp/open-dcoder-0.5B \
  --task humaneval_infill \
  --temperature 0.6 \
  --steps 64 \
  --alg p2 \
  --batch_size 4

Time: ~40 minutes on NVIDIA L4

Step 8: Compute Metrics

After each evaluation, compute the final metrics:

SantaCoder-FIM

Metrics are automatically computed and saved.

HumanEval-Infill

# Run evaluation script
python ../../human-eval-infilling/human_eval_infilling/evaluate_functional_correctness.py \
  infill_results/humaneval_infill/open-dcoder-0.5B/0.6/humaneval_infill_results_*.jsonl \
  --benchmark_name=single-line

Step 9: (Optional) Log to Wandb

# Set your Wandb API key
export WANDB_API_KEY="your-wandb-api-key"

# The evaluation scripts automatically log to Wandb if the key is set
# Or log results manually:
python -c "
import wandb
import json

wandb.init(project='eval-infill-dllm', name='your-run-name')
with open('infill_results/.../eval_results.json') as f:
    results = json.load(f)
wandb.log(results['results'])
wandb.finish()
"

🀝 Ensemble Experiments

NEW! We've created an ensemble approach that combines Open-dLLM (diffusion) with Qwen 2.5 Coder 0.5B (autoregressive), selecting the output with lower perplexity.

Quick Start

cd ensemble-experiments
bash setup_env.sh        # Setup environment (~10-15 min)
bash run_experiments.sh  # Run ensemble evaluation (~30-60 min)

How It Works

  1. Dual Generation: Generate completions from both models
  2. Perplexity Evaluation: Calculate perplexity for each completion
  3. Smart Selection: Choose the output with lower perplexity
  4. Evaluation: Benchmark on both HumanEval-Infill and SantaCoder-FIM

Key Features

  • βœ… No Training Required: Pure inference-time ensemble
  • βœ… Complementary Strengths: Leverages both diffusion and autoregressive approaches
  • βœ… Interpretable Selection: Perplexity provides clear selection criterion
  • βœ… Automated Evaluation: Complete pipeline with metrics computation
  • βœ… Wandb Integration: Track experiments and visualize results

Expected Performance

The ensemble typically achieves performance at least as good as the better individual model, with potential for improvement on challenging cases where models complement each other.

πŸ† Results

Benchmark Metric Score
HumanEval-Infill Pass@1 80.54%
SantaCoder-FIM Exact Match 60.69%

Analysis:

  • HumanEval-Infill: The ensemble (80.54%) significantly outperforms both individual models (Open-dLLM: 76.48%, Qwen: 74.15%) and even beats the larger Qwen 1.5B (80.25%).
  • SantaCoder-FIM: The ensemble (60.69%) improves over Open-dLLM (55.99%) but does not beat Qwen 0.5B (64.91%) on exact match.

Documentation

πŸ”§ Troubleshooting

CUDA Out of Memory

Reduce batch size:

python eval_infill.py ... --batch_size 2  # or even 1

Package Installation Fails

Use --no-cache-dir flag:

pip install --no-cache-dir package-name

Flash-Attention Build Fails

Skip it - the model works without flash-attention, just slightly slower.

Wandb Login Issues

Use environment variable instead:

export WANDB_API_KEY="your-key"

βš™οΈ Configuration Options

Customize evaluation parameters:

export TEMPERATURE=0.8       # Sampling temperature (default: 0.6)
export STEPS=128             # Diffusion steps (default: 64)
export BATCH_SIZE=16         # Batch size (default: 4)

πŸ”— References

🀝 Contributing

This is an experimental repository. To contribute:

  1. Create a feature branch from feature/open-dllm-experiments
  2. Make your changes
  3. Submit a pull request

πŸ“ License

See LICENSE file for details.

πŸ™ Acknowledgments

This project builds on the Open-dLLM framework by Pengzhangzhi et al. and uses the fredzzp/open-dcoder-0.5B model for code generation experiments.

About

Project to investigate the practical differences between diffusion based and auto-regressive foundation models and improvise generative capabilities of dLLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors