SPES (SParse Expert Sync) is a cutting-edge, memory-efficient decentralized training framework designed for pretraining MoE LLMs across geographically distributed GPU nodes.
Unlike conventional paradigms that demand high-bandwidth interconnects, SPES enables the collaborative pretraining of Mixture-of-Experts (MoE) models where nodes operate semi-independently.
| Feature | Description |
|---|---|
| 🌐 Decentralized Training | Operates without high-speed cross-node interconnects. Each node functions as an independent training unit with local DDP. |
| 💾 Memory Efficiency | Nodes only maintain gradients/optimizer states for their local subset of experts, drastically reducing memory footprint. |
| ⚡ Sparse Sync | Utilizes a lightweight gRPC parameter server to synchronize only trained parameters periodically. |
| 🔀 Smart Merging | Implements intelligent weighted merging with a decaying alpha schedule to ensure stable convergence during knowledge transfer. |
- Release Training Code
- Release pretrained model checkpoints & training logs
- Add detailed documentation for training and evaluation scripts
- Python:
>= 3.10 - CUDA:
>= 12.1(Tested on 12.4) - PyTorch:
2.5.1 - Hardware: NVIDIA GPUs (Tested on A100/A800/L40S)
# 1. Clone the repository
git clone https://github.com/zjr2000/SPES.git
cd SPES
# 2. Install PyTorch (Adjust CUDA version if necessary)
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124
# 3. Install SPES and core dependencies
pip install -e '.[all]'
# 4. Install gRPC components
pip install grpcio==1.73.1 grpcio-tools==1.73.1 protobuf==6.31.0To run benchmarks using the LM Evaluation Harness:
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
pip install "lm_eval[hf]"SPES utilizes tokenized numpy memmap files (.npy) for high-performance data loading.
Convert your .jsonl or .parquet files using the provided script:
python data_process_scripts/tokenize_data.py \
--file_glob "/path/to/your/data/*.jsonl" \
--tokenizer_name_or_path "Qwen/Qwen2.5-0.5B" \
--output_prefix "/path/to/output/tokenized_" \
--text_field "text" \
--processes 8 \
--batch_size 500 \
--max_shard_bytes 4294967296 \
--dtype "uint32"Create a manifest file for the training configuration:
bash data_process_scripts/list_processed_files.sh /path/to/tokenized/data /path/to/output/file_list.txtPoint your YAML configuration file (in configs/) to file_list.txt.
SPES uses a Client-Server architecture:
- Parameter Server: Manages expert synchronization.
- Training Clients: Independent nodes performing local training.
Key SPES parameters in your YAML config:
using_spes: true
spes_config:
num_peers: 4 # Total training nodes
peer_id: 0 # Current node ID (0-indexed)
num_train_experts_per_node: 2 # Local experts per node
sync_steps: 100 # Sync frequency
server_addr: 127.0.0.1:50051 # Parameter Server Address1. Start Parameter Server
bash run_scripts/run_parameter_server.sh2. Start Training Clients (On each node)
# Example: Launching on Node 1
bash run_scripts/run_single_node.sh 1
# Optional: Resume from checkpoint
bash run_scripts/run_single_node.sh 0 --resumeFor SLURM or other schedulers where RANK, MASTER_ADDR, and NPROC_PER_NODE are set automatically:
bash run_scripts/run_cluster.shThis script automatically handles server startup on Rank 0 and isolates DDP to the local node.
Convert the sharded FSDP checkpoints to HuggingFace format:
# Syntax: <RUN_DIR> <SAVE_STEP> <MODEL_SIZE>
bash eval_scripts/convet_model_to_hf_unshard.sh output/spes_moe_3b_9b/node0 10000 A3B-9BEvaluate using lm-evaluation-harness:
bash eval_scripts/eval_full.sh <MODEL_PATH> <MODEL_NAME>This project stands on the shoulders of giants. We explicitly thank the following projects and teams:
- OLMo (Allen Institute for AI): Our codebase is built upon the excellent modeling, training, and inference code provided by the Ai2 team.
- MegaBlocks (Databricks): We utilize MegaBlocks for efficient "dropless" Mixture-of-Experts (MoE) training and sparse operations.
- LM Evaluation Harness (EleutherAI): Used for our few-shot evaluation framework and benchmarking.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.