Skip to content

Latest commit

 

History

History
87 lines (77 loc) · 3.49 KB

File metadata and controls

87 lines (77 loc) · 3.49 KB

Defragmentation Scheduling with Deep Reinforcement Learning in Shared GPU Clusters

DRR is a defragmentation scheduler for shared GPU clusters. It mitigates GPU fragmentation arising from GPU sharing, diverse jobs, and asynchronous lifecycles, improving resource utilization under dynamic scheduling.

Overview of DRR

Getting Started

Environment Version

  • python 3.10

Install dependencies

pip install -r requirements.txt

Run

For 64 nodes cluster simulation:

python simulator.py --num-node 64 --interarrival-time 8 --scheduler DRR \
                    --init_dim 3584 --action_space 64 --lr_actor 0.04 --lr_critic 0.02 \
                    --use_imitation True --imitation_loss_weight 0.1 \
                    --use_dynamic_entropy True --beta0 0.04 \
                    --use_attn True \
                    --use_advantage_adjustment 0.6 \
                    --use_rescheduling True --util_threshold 0.5 --re_time 3600 --re_num 5

# Other baseline schedulers
python simulator.py --num-node 64 --interarrival-time 8 --scheduler ElasticFlow
python simulator.py --num-node 64 --interarrival-time 8 --scheduler "R&P"
python simulator.py --num-node 64 --interarrival-time 8 --scheduler FGD
python simulator.py --num-node 64 --interarrival-time 8 --scheduler Hops

For 32 nodes cluster simulation:

python simulator.py --num-node 32 --interarrival-time 16 --scheduler DRR \
                    --init_dim 1792 --action_space 32 --lr_actor 0.03 --lr_critic 0.02 \
                    --use_imitation True --imitation_loss_weight 0.1 \
                    --use_dynamic_entropy True --beta0 0.01 \
                    --use_attn True \
                    --use_advantage_adjustment 0.4 \
                    --use_rescheduling True --util_threshold 0.5 --re_time 3600 --re_num 5                    

For 128 nodes cluster simulation:

python simulator.py --num-node 128 --interarrival-time 3.8 --scheduler DRR \
                    --init_dim 7168 --action_space 128 --lr_actor 0.06 --lr_critic 0.04 \
                    --use_imitation True --imitation_loss_weight 0.1 \
                    --use_dynamic_entropy True --beta0 0.06 \
                    --use_attn True \
                    --use_advantage_adjustment 0.1 \
                    --use_rescheduling True --util_threshold 0.5 --re_time 3600 --re_num 5                    

Project Structure

Defrag
├── cluster.py                  # Cluster environment implementation
├── clusterdata                 # Cluster trace data and preprocessing scripts
│   ├── cluster-trace-gpu-v2020
│   │   └── trace.txt
│   ├── filtered_traces.csv
│   ├── mypreprocess.ipynb
│   ├── sampled_traces.csv
│   ├── share_0.2_traces.csv
│   └── share_0.6_traces.csv
├── imgs
│   └── overview.jpg
├── job.py                      # Job representation
├── policy                      # Scheduling policies
│   ├── __init__.py
│   ├── drr.py
│   ├── elasticflow.py
│   ├── fgd.py
│   ├── gpupacking.py
│   ├── hops.py
│   └── policy.py
├── README.md
├── requirements.txt
├── simulator.py                 # Main simulation script
└── utils.py                     # Utility functions

Reference