Skip to content

Advanced Topics

Felix Abecassis edited this page Nov 12, 2025 · 6 revisions

Reusable squashfs files

For container images frequently used across many Slurm jobs, it might be beneficial to save a squashfs file on a shared filesystem (Lustre, NFS, etc.) and then use --container-image to point to this file in all your jobs:

$ srun --container-save=/lustre/containers/pytorch.sqsh --container-image=nvcr.io/nvidia/pytorch:25.09-py3 true
$ srun --container-image=/lustre/containers/pytorch.sqsh bash -c 'echo ${NVIDIA_PYTORCH_VERSION}'
25.09

The efficiency of this approach compared to always pulling from the registry depends on many variables:

  • The speed of your shared filesystem on your cluster.
  • The speed of your connection to the registry.
  • Whether container layers are already cached locally by enroot (in ENROOT_CACHE_PATH).

You can also transfer this squashfs file to a different cluster to reuse it.

Compressing squashfs files

When creating reusable squashfs files, the compression settings can significantly impact both file size and import performance.

For temporary squashfs images, for example when not using enroot load, disabling compression is the recommended approach, and it is generally set as the cluster default in /etc/enroot/enroot.conf:

ENROOT_SQUASH_OPTIONS -noI -noD -noF -noX -no-duplicates

For images stored on shared filesystems that will be reused many times, it is recommended to use compression:

$ ENROOT_SQUASH_OPTIONS="-comp zstd -Xcompression-level 3 -b 1M" \
  srun --container-save=/lustre/containers/compressed.sqsh --container-image=nvcr.io/nvidia/pytorch:25.09-py3 true

Size Comparison:

$ ls -lh /lustre/containers/
-rw-r--r-- 1 user group  7.1G Oct 15 10:37 compressed.sqsh
-rw-r--r-- 1 user group   18G Oct 15 10:35 uncompressed.sqsh

Despite requiring CPU cycles to decompress, compressed images are often faster to import because the bottleneck is I/O, not the CPU. A smaller image means less I/O to perform, so the import is often faster with an image compressed with zstd.

Lustre tuning

If you store squashfs files on a Lustre filesystem, it is often useful to configure Lustre File Striping on the target folder to increase performance for reads and writes.

For example, to stripe across all available OSTs, run this command before populating the folder with squashfs files:

$ lfs setstripe -c -1 /lustre/containers

Synchronizing application startup for multi-node jobs

For large multi-node jobs, the application will likely start executing at a different time on each node, as it depends on the state of the local enroot cache. This can cause issues for applications that try to establish connections across all ranks and only retry for a short period of time.

For this case, it is recommended to pre-create the container filesystem on each node using a separate job step, before running the application:

$ srun --container-image=nvcr.io/nvidia/pytorch:25.09-py3 --container-name=pytorch true
$ srun --container-name=pytorch python train.py

This ensures all ranks start at roughly the same time.

PyTorch Environment Variable Remapping

Pyxis automatically performs special handling for PyTorch containers:

Detection: Checks for PYTORCH_VERSION environment variable in the container

Remapping:

  • SLURM_PROCIDRANK (global rank)
  • SLURM_LOCALIDLOCAL_RANK (local rank on node)

Combined with enroot hook:

  • MASTER_ADDR - Set by 50-slurm-pytorch.sh hook
  • MASTER_PORT - Set by 50-slurm-pytorch.sh hook

This allows PyTorch distributed training to work seamlessly with Slurm without any user configuration.

Multi-Task Container Coordination

When running multi-task jobs (e.g., srun -n 8):

  1. The first task creates all namespaces and starts the container (as a result, the container image is imported only once per node).
  2. Subsequent tasks join the existing namespaces using setns(2)
  3. Namespace file descriptors are preserved in /proc/PID/ns/
  4. All tasks share the same filesystem view and mounts

This is different from running srun -n 8 enroot start ubuntu, which would create 8 separate container instances on one node.

Seccomp Filter

Like enroot, pyxis uses a seccomp filter to enable unprivileged package installation inside containers. This allows users to run commands like apt-get install or dnf install without requiring subordinate user and group ids (subuid/subgid) or additional privileges

When pyxis starts a container with user namespace remapping enabled, it installs a seccomp filter that intercepts some system calls and returns success without actually performing the operation.

Intercepted System Calls:

  • File ownership: chown, lchown, fchown, fchownat
  • User/group IDs: setuid, setgid, setreuid, setregid, setresuid, setresgid
  • Group list: setgroups
  • Filesystem IDs: setfsuid, setfsgid

When any of these syscalls are invoked inside the container, the seccomp filter:

  1. Intercepts the call before it reaches the kernel
  2. Returns success (0) to the calling program
  3. Does not actually perform the operation

This approach allows package managers to complete their installation routines (which typically try to chown files and setuid).

Linux Namespaces Used

Pyxis creates and manages the following namespaces (via enroot):

User Namespace

  • Remap the user to appear as UID 0 (root) inside the container
  • Controlled with --container-remap-root / --no-container-remap-root

Mount Namespace

  • Isolates the filesystem view: the container sees its own root filesystem and mounts
  • All tasks in the same job on a node share the same mount namespace

Cgroup Namespace

  • Container sees its own cgroup tree

Linux Namespaces NOT Used

Pyxis deliberately does not leverage these namespaces:

Network Namespace

  • Containers share the host's network stack as applications need direct access to RDMA devices.
  • No port isolation, privileged ports are still restricted.
  • No port mapping (docker run -p) is needed / supported.

PID Namespace

  • Containers share the host's PID namespace
  • Remove the need to handle the special case of PID 1 (init)
  • Host processes are visible inside the container

IPC Namespace

  • Containers share the host's IPC namespace