Demo: LRZ AI Systems Batch Jobs

This repository demonstrates how to run batch jobs on the LRZ AI Systems, with an add-on section for executing R batch jobs using custom container images.

Getting started

Prerequisites

To run batch jobs on the LRZ AI Systems, you need:

A valid LRZ account with access to the LRZ AI Systems.
A connection to the Munich Scientific Network (MWN).

If you don’t yet have access, refer to the official "Access and Getting Started" guide.

Recommended setup: Use VS Code for its SSH and Git integration. However, you may use any IDE or terminal-based workflow you prefer.

Connecting to LRZ

Log in to the LRZ AI Systems and enter your password when prompted:

Via terminal:

ssh login.ai.lrz.de -l <YOUR-LRZ-ACCOUNT>

Via VS Code (recommended for ease of use):
With the Remote - SSH extension, open the command palette and choose:
"Connect to Host..." → login.ai.lrz.de

Optionally, clone this repository to follow along step by step:

git clone https://github.com/leofhp/lrzAIdemo.git

Alternatively, you can follow the instructions independently and adjust paths and filenames as needed.

Running a Batch Job

A SLURM batch job is specified through a script like demo_py.sbatch, which might look as follows:

#!/bin/bash
#SBATCH -p lrz-cpu
#SBATCH --qos=cpu
#SBATCH --nodelist=cpu-002
#SBATCH --cpus-per-task=1
#SBATCH --mem=8G
#SBATCH --time=2-00:00:00
#SBATCH --job-name=demo1
#SBATCH --output=lrzAIdemo/slurm/slurm-%j.out
#SBATCH --error=lrzAIdemo/slurm/slurm-%j.err

cd "$SLURM_SUBMIT_DIR"
python lrzAIdemo/demo_py.py

Explanation of Key Directives

-p lrz-cpu: Selects the CPU-only partition. For GPU jobs, see the available partitions in this guide.
--qos=cpu: Specifies the Quality of Service level, which is required on LRZ. For CPU jobs, use cpu.
--nodelist: Specifies a specific node (optional).
- As of June 2025, the lrz-cpu partition includes:
  - cpu-001 -- cpu-006: Intel Xeon Gold 6148
  - cpu-007: Intel Xeon E7-4850
  - cpu-008 -- cpu-012: AMD EPYC 7642
- To check node availability, you can either run:
```
sinfo -Nel -p lrz-cpu
```
or use the included helper script lrz_cpu_availability.sh for a structured overview of available cores and free memory per node:
```
lrzAIdemo/lrz_cpu_availability.sh
```
- Avoid cpu-007 for compute-intensive jobs -- it tends to be significantly slower.
--cpus-per-task=1: Requests one CPU core. Increase this only if your script is parallelized.
--mem=8G: Allocates 8 GB of RAM.
--time=2-00:00:00: Sets the maximum runtime to 2 days. There's no harm in requesting the full 2-day limit.
--job-name=demo_job: Assigns a name to your job for easier tracking.
--output and --error: Define separate log files for standard output and errors. Change these paths if you didn't clone the repository.
--mail-user=<YOUR-EMAIL@EXAMPLE.COM>: Specifies an email address to receive notifications (optional).
--mail-type=BEGIN,END,FAIL: Specifies the types of events for which email notifications are sent (optional).
- BEGIN: email when the job begins.
- END: email when the job finishes successfully.
- FAIL: email if the job fails.
- ALL: email on all job state changes.
- TIME_LIMIT: email when the job exceeds time limit.

Submitting the job

To submit the job:

sbatch lrzAIdemo/demo_py.sbatch

Monitor it with:

squeue -u $USER

This shows:

Job ID
Partition name
Job name
Your LRZ account name
Job state (see list of states)
Runtime and node assignment

Viewing Logs

If you used the example demo_py.py script, the output file (e.g. slurm/slurm-123456.out) will show:

Time remaining
Current memory usage
Iteration info

You can monitor the output live using:

tail -f lrzAIdemo/slurm/slurm-<JOBID>.out

Running R Batch Jobs Using Containers

Unlike Python, R is not pre-installed on the LRZ AI Systems. However, you can easily run R scripts using containers based on images from the Rocker project. This section walks through:

Creating a container with R and required packages.
Running an R batch job using that container.

The container typically only needs to be set up once for a given project and can be reused. For background and advanced usage, see the official guide.

Step 1: Create a Custom Container with R

Create a folder to store your container images:

mkdir -p ~/lrzAIdemo/containers
cd ~/lrzAIdemo/containers

Start an interactive session on a CPU node:

srun -p lrz-cpu -q cpu --mem=32G --pty bash

Import a Rocker container image. The ml-verse variant includes common data science packages and is recommended. Note that these images are large (> 6 GB) and importing them will take several minutes:

enroot import docker://rocker/ml-verse:latest

Create and start the container:

enroot create --name r-custom rocker+ml-verse+latest.sqsh
enroot start r-custom bash

Start R inside the container and install any additional packages:

R
> install.packages(c("CICI", "this.path")) # Replace with your actual requirements
> q()

Exit the container and export the modified image, which will take several minutes:

exit
enroot export --output r-custom-final.sqsh r-custom

This will save the finalized container as r-custom-final.sqsh in your containers folder.

Leave the interactive session and return to your home directory:

exit
cd

Step 2: Run an R Batch Job with Your Container

You can now run R scripts using the customized container. In your batch script (e.g. demo_r.sbatch), specify the image using --container-image:

#!/bin/bash
#SBATCH -p lrz-cpu
#SBATCH --qos=cpu
#SBATCH --nodelist=cpu-003
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=2-00:00:00
#SBATCH --job-name=demo2
#SBATCH --output=lrzAIdemo/slurm/slurm-%j.out
#SBATCH --error=lrzAIdemo/slurm/slurm-%j.err
#SBATCH --container-image=<YOUR-ABSOLUTE-PATH>/lrzAIdemo/containers/r-custom-final.sqsh

cd "$SLURM_SUBMIT_DIR"
Rscript lrzAIdemo/demo_r.R

This script will:

Run demo_r.R using 4 CPU cores.
Log output and error messages to the slurm directory.
Write results to demo_r_results/results.csv.

Submit the job as usual:

sbatch lrzAIdemo/demo_r.sbatch

Monitor it using:

squeue -u $USER

Additional Resources

LRZ AI Systems documentation, including:
- Running Applications as Batch Jobs
- Managing R Packages in a Containerized Environment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Demo: LRZ AI Systems Batch Jobs

Getting started

Prerequisites

Connecting to LRZ

Running a Batch Job

Explanation of Key Directives

Submitting the job

Viewing Logs

Running R Batch Jobs Using Containers

Step 1: Create a Custom Container with R

Step 2: Run an R Batch Job with Your Container

Additional Resources

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
containers		containers
demo_r_results		demo_r_results
slurm		slurm
.gitignore		.gitignore
README.md		README.md
demo_py.py		demo_py.py
demo_py.sbatch		demo_py.sbatch
demo_r.R		demo_r.R
demo_r.sbatch		demo_r.sbatch
lrz_cpu_availability.sh		lrz_cpu_availability.sh

leofhp/lrzAIdemo

Folders and files

Latest commit

History

Repository files navigation

Demo: LRZ AI Systems Batch Jobs

Getting started

Prerequisites

Connecting to LRZ

Running a Batch Job

Explanation of Key Directives

Submitting the job

Viewing Logs

Running R Batch Jobs Using Containers

Step 1: Create a Custom Container with R

Step 2: Run an R Batch Job with Your Container

Additional Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages