This is an example repository you can use to learn Purdue's HPC resources.
SLURM is a cluster management utility that is used to send jobs to the server. This job includes specifications about:
- The actual code to run.
- Constraints for: a. Walltime: the maximum runtime of the job b. Resources: specify GPU (generally any accelerator) count/type c. Account: d. Node Type:
After we meet the above specifications, our job is added to a queue. It is then run in weighted priority order, which is calculated based on your past usage, resources and walltime.
The goal of this workshop is to understand and utilize HPC via SLURM.
Start by logging in to <username>@scholar.rcac.purdue.edu / <username>@queues.cs.purdue.edu -- you can login with your career account password and Duo's 2FA.
You should now have access to a shell, that you can use to remotely run commands within the linux system.
Protips:
- Append
ServerAliveInterval 60to~/.ssh/configto prevent frozen sessions after long idle periods. curl https://raw.githubusercontent.com/dylanaraps/neofetch/master/neofetch | bashruns neofetch!
Using 2FA everytime gets annoying, especially if you log in and out often. Instead: create an SSH Key that you can use to securely authenticate yourself into various clusters.
Once you create a keypair, append the public key to ~/.ssh/authorized_keys. If you add your SSH keys to GitHub, one easy way of doing this is to run curl https://github.com/<username>.keys >> ~/.ssh/authorized_keys in the cluster. From your local session, you can also run ssh-copy-id <username>@<cluster-address> and it should automataically update ~/.ssh/authorized_keys for you.
Next, we want to configure an interactive job to explore our actual runtime environment to test the workflow before it runs non-interactively.
Scholar does this better. There's two ways to access an interactive runtime:
- GPU endpoint: log in to
gpu.scholar.rcac.purdue.edu - Run
sinteractive -A <account>
You can access the list of queues available from the CS servers homepage.
Once you decide on a queue, run:
srun --partition=<queue-name> --gres=gpu:<n> --pty /bin/bash -i--pty /bin/bash -i runs bash interactively. It effectively is sinteractive.
Once we have the compute, our next objective is to setup runtime dependencies. Let's start by checking if we have python:
jsetpal@scholar-fe06:~ $ python --version
Python 2.7.5
mc17 151 $ python --version
-bash: python: command not foundThis is because dependencies are configurable using module:
module load cuda anaconda
module load cudnn # only on scholarShould be all that you need for the current experiment. You can :
- Find the loaded modules with
module list - Reset modules using
module purge - List all available module using
module av - Obtain an extended description using
module spider <name/version>
You can install an updated python instance using conda.
conda create -n <name> python=<version>
conda activate <name>
(/home/jsetpal/.conda/envs/cent7/2020.11-py38/lint) jsetpal@scholar-fe06:~ $ python --version
Python 3.11.5Optionally, install uv to drastically reduce package installation time.
Finally, we can clone the repository and install package dependencies:
uv pip install -r requirements.txt # if you installed uv
pip install -r requirements.txt # if you didn't install uvWe are ready to begin the training run!
The final step is to create a bash script that:
- Lists SLURM constraints: Add
#SBATCH --<constraint>=<value>lines. - Loads required modules and dependencies.
- Carries out the training run.
You can find an example at scripts/sbatch.sh.
We can run this script using sbatch scripts/sbatch.sh!
You can check squeue for a status regarding your job.
(/home/jsetpal/.conda/envs/cent7/2020.11-py38/lint) jsetpal@scholar-fe06:~ $ squeue
JOBID USER ACCOUNT NAME NODES CPUS TIME_LIMIT ST TIME
000001 jsetpal gpu sbatch.sh 1 1 4:00:00 R 0:10
000002 jsetpal gpu sbatch.sh 1 1 4:00:00 PD 0:00Here, the first job is running, while the second is pending (awaiting free resources).
On scholar, you can check the available resources per account using qlist:
(/home/jsetpal/.conda/envs/cent7/2020.11-py38/lint) jsetpal@scholar-fe06:~/git/lint $ qlist
Current Number of Cores Node
Account Total Queue Run Free Max Walltime Type
============== ================================= ============== ======
debug 32 0 0 32 00:30:00 A,B,G,H
gpu 196 0 16 180 04:00:00 G,H
gpu-mig 128 0 0 128 04:00:00 H
long 128 0 0 128 3-00:00:00 A
scholar 576 2 208 368 04:00:00 A,BYou can use srun to log into your currently running jobs as well:
srun --jobid=000001 --pty /usr/bin/bash -iYou can cancel a queued or running job using scancel <jobid>.
Crucially, this is a non-interactive training setup. This means you can update constants, and rerun training setups without worrying about having your local machine on during these runs.
While it is possible to port forward Jupyter Notebooks and use them for training, it's not recommended practice because it lacks reproduciblity.
Instead, it is recommended to use module-driven development, using:
- Cookiecutter Data Science for package-based structure for machine learning code.
- DVC for data versioning and pipelining.
- MLFlow for experiment tracking.