-
Notifications
You must be signed in to change notification settings - Fork 39
Advanced Topics
For container images frequently used across many Slurm jobs, it might
be beneficial to save a squashfs file on a shared filesystem (Lustre,
NFS, etc.) and then use --container-image to point to this file in
all your jobs:
$ srun --container-save=/lustre/containers/pytorch.sqsh --container-image=nvcr.io/nvidia/pytorch:25.09-py3 true
$ srun --container-image=/lustre/containers/pytorch.sqsh bash -c 'echo ${NVIDIA_PYTORCH_VERSION}'
25.09The efficiency of this approach compared to always pulling from the registry depends on many variables:
- The speed of your shared filesystem on your cluster.
- The speed of your connection to the registry.
- Whether container layers are already cached locally by enroot (in
ENROOT_CACHE_PATH).
You can also transfer this squashfs file to a different cluster to reuse it.
When creating reusable squashfs files, the compression settings can significantly impact both file size and import performance.
For temporary squashfs images, for example when not using
enroot load, disabling compression is the recommended
approach, and it is generally set as the cluster default in
/etc/enroot/enroot.conf:
ENROOT_SQUASH_OPTIONS -noI -noD -noF -noX -no-duplicates
For images stored on shared filesystems that will be reused many times, it is recommended to use compression:
$ ENROOT_SQUASH_OPTIONS="-comp zstd -Xcompression-level 3 -b 1M" \
srun --container-save=/lustre/containers/compressed.sqsh --container-image=nvcr.io/nvidia/pytorch:25.09-py3 trueSize Comparison:
$ ls -lh /lustre/containers/
-rw-r--r-- 1 user group 7.1G Oct 15 10:37 compressed.sqsh
-rw-r--r-- 1 user group 18G Oct 15 10:35 uncompressed.sqshDespite requiring CPU cycles to decompress, compressed images are often faster to import because the bottleneck is I/O, not the CPU. A smaller image means less I/O to perform, so the import is often faster with an image compressed with zstd.
If you store squashfs files on a Lustre filesystem, it is often useful to configure Lustre File Striping on the target folder to increase performance for reads and writes.
For example, to stripe across all available OSTs, run this command before populating the folder with squashfs files:
$ lfs setstripe -c -1 /lustre/containersFor large multi-node jobs, the application will likely start executing at a different time on each node, as it depends on the state of the local enroot cache. This can cause issues for applications that try to establish connections across all ranks and only retry for a short period of time.
For this case, it is recommended to pre-create the container filesystem on each node using a separate job step, before running the application:
$ srun --container-image=nvcr.io/nvidia/pytorch:25.09-py3 --container-name=pytorch true
$ srun --container-name=pytorch python train.pyThis ensures all ranks start at roughly the same time.
Pyxis automatically performs special handling for PyTorch containers:
Detection: Checks for PYTORCH_VERSION environment variable in the container
Remapping:
-
SLURM_PROCID→RANK(global rank) -
SLURM_LOCALID→LOCAL_RANK(local rank on node)
Combined with enroot hook:
-
MASTER_ADDR- Set by 50-slurm-pytorch.sh hook -
MASTER_PORT- Set by 50-slurm-pytorch.sh hook
This allows PyTorch distributed training to work seamlessly with Slurm without any user configuration.
When running multi-task jobs (e.g., srun -n 8):
- The first task creates all namespaces and starts the container (as a result, the container image is imported only once per node).
- Subsequent tasks join the existing namespaces using
setns(2) - Namespace file descriptors are preserved in
/proc/PID/ns/ - All tasks share the same filesystem view and mounts
This is different from running srun -n 8 enroot start ubuntu, which
would create 8 separate container instances on one node.
Like enroot, pyxis uses a seccomp filter to enable unprivileged
package installation inside containers. This allows users to run
commands like apt-get install or dnf install without requiring
subordinate user and group ids (subuid/subgid) or additional
privileges
When pyxis starts a container with user namespace remapping enabled, it installs a seccomp filter that intercepts some system calls and returns success without actually performing the operation.
Intercepted System Calls:
- File ownership:
chown,lchown,fchown,fchownat - User/group IDs:
setuid,setgid,setreuid,setregid,setresuid,setresgid - Group list:
setgroups - Filesystem IDs:
setfsuid,setfsgid
When any of these syscalls are invoked inside the container, the seccomp filter:
- Intercepts the call before it reaches the kernel
- Returns success (0) to the calling program
- Does not actually perform the operation
This approach allows package managers to complete their installation routines (which typically try to chown files and setuid).
Pyxis creates and manages the following namespaces (via enroot):
- Remap the user to appear as UID 0 (root) inside the container
- Controlled with
--container-remap-root/--no-container-remap-root
- Isolates the filesystem view: the container sees its own root filesystem and mounts
- All tasks in the same job on a node share the same mount namespace
- Container sees its own cgroup tree
Pyxis deliberately does not leverage these namespaces:
- Containers share the host's network stack as applications need direct access to RDMA devices.
- No port isolation, privileged ports are still restricted.
- No port mapping (
docker run -p) is needed / supported.
- Containers share the host's PID namespace
- Remove the need to handle the special case of PID 1 (init)
- Host processes are visible inside the container
- Containers share the host's IPC namespace