This repository demonstrates how to run batch jobs on the LRZ AI Systems, with an add-on section for executing R batch jobs using custom container images.
To run batch jobs on the LRZ AI Systems, you need:
- A valid LRZ account with access to the LRZ AI Systems.
- A connection to the Munich Scientific Network (MWN).
If you don’t yet have access, refer to the official "Access and Getting Started" guide.
Recommended setup: Use VS Code for its SSH and Git integration. However, you may use any IDE or terminal-based workflow you prefer.
Log in to the LRZ AI Systems and enter your password when prompted:
-
Via terminal:
ssh login.ai.lrz.de -l <YOUR-LRZ-ACCOUNT>
-
Via VS Code (recommended for ease of use):
With the Remote - SSH extension, open the command palette and choose:
"Connect to Host..." →login.ai.lrz.de
Optionally, clone this repository to follow along step by step:
git clone https://github.com/leofhp/lrzAIdemo.gitAlternatively, you can follow the instructions independently and adjust paths and filenames as needed.
A SLURM batch job is specified through a script like demo_py.sbatch, which might look as follows:
#!/bin/bash
#SBATCH -p lrz-cpu
#SBATCH --qos=cpu
#SBATCH --nodelist=cpu-002
#SBATCH --cpus-per-task=1
#SBATCH --mem=8G
#SBATCH --time=2-00:00:00
#SBATCH --job-name=demo1
#SBATCH --output=lrzAIdemo/slurm/slurm-%j.out
#SBATCH --error=lrzAIdemo/slurm/slurm-%j.err
cd "$SLURM_SUBMIT_DIR"
python lrzAIdemo/demo_py.py-
-p lrz-cpu: Selects the CPU-only partition. For GPU jobs, see the available partitions in this guide. -
--qos=cpu: Specifies the Quality of Service level, which is required on LRZ. For CPU jobs, usecpu. -
--nodelist: Specifies a specific node (optional).- As of June 2025, the
lrz-cpupartition includes:cpu-001--cpu-006: Intel Xeon Gold 6148cpu-007: Intel Xeon E7-4850cpu-008--cpu-012: AMD EPYC 7642
- To check node availability, you can either run:
sinfo -Nel -p lrz-cpu
or use the included helper script
lrz_cpu_availability.shfor a structured overview of available cores and free memory per node:lrzAIdemo/lrz_cpu_availability.sh
- Avoid
cpu-007for compute-intensive jobs -- it tends to be significantly slower.
- As of June 2025, the
-
--cpus-per-task=1: Requests one CPU core. Increase this only if your script is parallelized. -
--mem=8G: Allocates 8 GB of RAM. -
--time=2-00:00:00: Sets the maximum runtime to 2 days. There's no harm in requesting the full 2-day limit. -
--job-name=demo_job: Assigns a name to your job for easier tracking. -
--outputand--error: Define separate log files for standard output and errors. Change these paths if you didn't clone the repository. -
--mail-user=<YOUR-EMAIL@EXAMPLE.COM>: Specifies an email address to receive notifications (optional). -
--mail-type=BEGIN,END,FAIL: Specifies the types of events for which email notifications are sent (optional).BEGIN: email when the job begins.END: email when the job finishes successfully.FAIL: email if the job fails.ALL: email on all job state changes.TIME_LIMIT: email when the job exceeds time limit.
To submit the job:
sbatch lrzAIdemo/demo_py.sbatchMonitor it with:
squeue -u $USERThis shows:
- Job ID
- Partition name
- Job name
- Your LRZ account name
- Job state (see list of states)
- Runtime and node assignment
If you used the example demo_py.py script, the output file (e.g. slurm/slurm-123456.out) will show:
- Time remaining
- Current memory usage
- Iteration info
You can monitor the output live using:
tail -f lrzAIdemo/slurm/slurm-<JOBID>.outUnlike Python, R is not pre-installed on the LRZ AI Systems. However, you can easily run R scripts using containers based on images from the Rocker project. This section walks through:
- Creating a container with R and required packages.
- Running an R batch job using that container.
The container typically only needs to be set up once for a given project and can be reused. For background and advanced usage, see the official guide.
Create a folder to store your container images:
mkdir -p ~/lrzAIdemo/containers
cd ~/lrzAIdemo/containersStart an interactive session on a CPU node:
srun -p lrz-cpu -q cpu --mem=32G --pty bashImport a Rocker container image. The ml-verse variant includes common data science packages and is recommended. Note that these images are large (> 6 GB) and importing them will take several minutes:
enroot import docker://rocker/ml-verse:latestCreate and start the container:
enroot create --name r-custom rocker+ml-verse+latest.sqsh
enroot start r-custom bashStart R inside the container and install any additional packages:
R
> install.packages(c("CICI", "this.path")) # Replace with your actual requirements
> q()Exit the container and export the modified image, which will take several minutes:
exit
enroot export --output r-custom-final.sqsh r-customThis will save the finalized container as r-custom-final.sqsh in your containers folder.
Leave the interactive session and return to your home directory:
exit
cdYou can now run R scripts using the customized container. In your batch script (e.g. demo_r.sbatch), specify the image using --container-image:
#!/bin/bash
#SBATCH -p lrz-cpu
#SBATCH --qos=cpu
#SBATCH --nodelist=cpu-003
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=2-00:00:00
#SBATCH --job-name=demo2
#SBATCH --output=lrzAIdemo/slurm/slurm-%j.out
#SBATCH --error=lrzAIdemo/slurm/slurm-%j.err
#SBATCH --container-image=<YOUR-ABSOLUTE-PATH>/lrzAIdemo/containers/r-custom-final.sqsh
cd "$SLURM_SUBMIT_DIR"
Rscript lrzAIdemo/demo_r.RThis script will:
- Run
demo_r.Rusing 4 CPU cores. - Log output and error messages to the
slurmdirectory. - Write results to
demo_r_results/results.csv.
Submit the job as usual:
sbatch lrzAIdemo/demo_r.sbatchMonitor it using:
squeue -u $USER