Skip to content

GPU Workloads

The GPU cluster is the primary compute surface of PRESCIENT. Users SSH to a host, allocate GPUs through SLURM, and run their workloads.

Hosts and partitions

Host Role Notes
core-hgx.prescient.internal Compute H100 x 8
edge-lmd-1.prescient.internal Compute L40s x 8

You pick a host explicitly via SSH, and your jobs run only on that host.

SLURM basics

Interactive allocation

The most common entry point — get a shell with GPUs attached:

srun --gres=gpu:1 --time=2:00:00 --pty bash
Flag Meaning
--gres=gpu:N Allocate N GPUs (subject to your account limits).
--time=HH:MM:SS or --time=D-HH:MM:SS Wall-clock limit.
--pty bash Give me an interactive shell.
--qos=long Override default QoS — only some are allowed for your account.

When you exit the shell, the allocation ends. To survive a disconnect, run tmux before srun, or use salloc + srun --jobid=... (see below).

Batch submission

sbatch my-job.sbatch
#!/bin/bash
#SBATCH --job-name=train
#SBATCH --gres=gpu:2
#SBATCH --time=12:00:00
#SBATCH --output=logs/%j.out

python train.py

Inspect and manage jobs

myjobs                  # alias: squeue --me
sacct -X --starttime today
scancel <jobid>
gpu-status              # cluster-wide snapshot helper

Surviving disconnects

Two reliable patterns:

  1. tmux + srun. Start tmux on the host first, then run srun --pty bash inside the tmux session. Detach with Ctrl-b d. The allocation continues.
  2. salloc + reattach. salloc --gres=gpu:1 --time=8:00:00 returns a job ID; srun --jobid=<id> --pty bash joins it. Disconnect-safe.

Never wrap long-running services (e.g. a vLLM server) in a bare srun over SSH without tmux — a dropped connection kills the allocation.

Storage

PRESCIENT uses tiered storage. Pick the right tier for the data lifetime — putting hot data on /home is the most common cause of "why is my training slow."

Path Where Backed up Shared Use for
/home/$USER NFS from gpu-c code, configs, important results, papers
/cache/users/$USER Local ZFS on each host Python venvs, working datasets, model outputs
/scratch/jobs/$SLURM_JOB_ID Local ZFS, set as $TMPDIR ✗ (auto-deleted) intermediate files within a single job
/scratch/users/$USER Local ZFS ✗ (manual) persistent fast scratch, your responsibility
/shared/hf-cache NFS from gpu-c community HuggingFace cache
/cache/hf Local on each host per-host fast HF cache

Storage rules of thumb

  • Slow but safe → fast but at-risk as you move down the list. /home is NFS-served over 1 GbE; do not load multi-gigabyte model weights from there hot-path.
  • Per-user dirs auto-create on first login and at job-prolog time. You do not need to ask an admin.
  • Quotas apply on /home by Slurm account; ask the operators if you need yours increased.
  • /cache and /scratch have no redundancy. A disk failure means data loss on that pool. Keep authoritative copies on /home or upstream.

HuggingFace cache

Two environment variables are pre-set in your shell:

HF_HOME=/shared/hf-cache       # default; writes go here
HF_HUB_CACHE=/cache/hf          # checked first for cache hits; falls through

The HuggingFace library checks HF_HUB_CACHE for an existing model before downloading. If found locally, it loads fast. If not, the download lands in the shared cache so the next person benefits too.

For gated or private models, point HF_HOME at your own home directory:

HF_HOME=~/.cache/huggingface vllm serve <private-model>

Writes to /shared/hf-cache require membership in the shared-cache group; ask the operators to add you if you need write access.

Containers (Enroot + Pyxis)

Containers are the preferred way to run anything with a heavy dependency stack. Enroot is rootless (no Docker daemon, no setuid), and Pyxis hooks it into SLURM so containers run inside your allocation.

Run a container in an allocation

srun --gres=gpu:1 --time=2:00:00 \
     --container-image=docker.io/vllm/vllm-openai:latest \
     --pty bash

--container-image accepts:

  • Docker Hub references (docker.io/...).
  • HuggingFace, NVIDIA NGC, or other OCI registries (nvcr.io/...).
  • A pre-imported squashfs file (faster repeat launches).

Pre-import once, reuse fast

enroot import docker://docker.io/vllm/vllm-openai:latest
# produces vllm+vllm-openai+latest.sqsh in $PWD

srun --gres=gpu:1 --container-image=./vllm+vllm-openai+latest.sqsh --pty bash

Imports land under /cache/users/$USER/enroot/ — local fast storage, not NFS.

Common recipe — serve vLLM and tunnel to your laptop

On the cluster:

tmux
srun --gres=gpu:1 --time=8:00:00 \
     --container-image=docker.io/vllm/vllm-openai:latest \
     --pty bash
# inside the container:
vllm serve <model> --port 8000

On your laptop:

ssh -L 8000:localhost:8000 <name.#>@gpu-a.prescient.internal
# now hit http://localhost:8000 in your browser / curl

Python without containers

Prefer containers, but if you must use a venv:

  • Place it under /cache/users/$USER/venvs/<name> (fast, local).
  • Never place a venv on /home — NFS lookups for every import slow startup to a crawl.
  • uv is installed system-wide at /usr/local/bin/uv and resolves much faster than pip for ML stacks.
uv venv /cache/users/$USER/venvs/myenv
source /cache/users/$USER/venvs/myenv/bin/activate
uv pip install vllm

What you cannot do

Documented here so it doesn't surprise you:

  • No sudo. All elevated needs are met via containers. If you genuinely need something at the host level, open an issue — do not ask for sudo.
  • No SSH without an active allocation (except for operators). The PAM stack uses pam_slurm_adopt — if your srun ends, your shell session ends too. This is by design: it guarantees GPU isolation across users.
  • No system-wide CUDA toolkit. Get CUDA from your container image or pinned Python wheels.

Helper commands

Command What it does
gpu-status Cluster-wide snapshot — node state, GPU counts, queued jobs.
myquota Your storage usage across /home, /cache, /scratch.
myjobs Your current Slurm jobs.
myhistory Your jobs over the last week.
gpus-here nvidia-smi on the current host (not just your allocation).

Most of these are aliases defined in /etc/profile.d/gpu-cluster-aliases.sh.