GPU Workloads¶

The GPU cluster is the primary compute surface of PRESCIENT. Users SSH to a host, allocate GPUs through SLURM, and run their workloads.

Hosts and partitions¶

Host	Role	Notes
`core-hgx.prescient.internal`	Compute	H100 x 8
`edge-lmd-1.prescient.internal`	Compute	L40s x 8

You pick a host explicitly via SSH, and your jobs run only on that host.

SLURM basics¶

Interactive allocation¶

The most common entry point — get a shell with GPUs attached:

srun --gres=gpu:1 --time=2:00:00 --pty bash

Flag	Meaning
`--gres=gpu:N`	Allocate N GPUs (subject to your account limits).
`--time=HH:MM:SS` or `--time=D-HH:MM:SS`	Wall-clock limit.
`--pty bash`	Give me an interactive shell.
`--qos=long`	Override default QoS — only some are allowed for your account.

When you exit the shell, the allocation ends. To survive a disconnect, run tmux before srun, or use salloc + srun --jobid=... (see below).

Batch submission¶

sbatch my-job.sbatch

#!/bin/bash
#SBATCH --job-name=train
#SBATCH --gres=gpu:2
#SBATCH --time=12:00:00
#SBATCH --output=logs/%j.out

python train.py

Inspect and manage jobs¶

myjobs                  # alias: squeue --me
sacct -X --starttime today
scancel <jobid>
gpu-status              # cluster-wide snapshot helper

Surviving disconnects¶

Two reliable patterns:

tmux + srun. Start tmux on the host first, then run srun --pty bash inside the tmux session. Detach with Ctrl-b d. The allocation continues.
salloc + reattach. salloc --gres=gpu:1 --time=8:00:00 returns a job ID; srun --jobid=<id> --pty bash joins it. Disconnect-safe.

Never wrap long-running services (e.g. a vLLM server) in a bare srun over SSH without tmux — a dropped connection kills the allocation.

Storage¶

PRESCIENT uses tiered storage. Pick the right tier for the data lifetime — putting hot data on /home is the most common cause of "why is my training slow."

Path	Where	Backed up	Shared	Use for
`/home/$USER`	NFS from `gpu-c`	✓	✓	code, configs, important results, papers
`/cache/users/$USER`	Local ZFS on each host	✗	✗	Python venvs, working datasets, model outputs
`/scratch/jobs/$SLURM_JOB_ID`	Local ZFS, set as `$TMPDIR`	✗ (auto-deleted)	✗	intermediate files within a single job
`/scratch/users/$USER`	Local ZFS	✗ (manual)	✗	persistent fast scratch, your responsibility
`/shared/hf-cache`	NFS from `gpu-c`	✓	✓	community HuggingFace cache
`/cache/hf`	Local on each host	✗	✗	per-host fast HF cache

Storage rules of thumb¶

Slow but safe → fast but at-risk as you move down the list. /home is NFS-served over 1 GbE; do not load multi-gigabyte model weights from there hot-path.
Per-user dirs auto-create on first login and at job-prolog time. You do not need to ask an admin.
Quotas apply on /home by Slurm account; ask the operators if you need yours increased.
/cache and /scratch have no redundancy. A disk failure means data loss on that pool. Keep authoritative copies on /home or upstream.

HuggingFace cache¶

Two environment variables are pre-set in your shell:

HF_HOME=/shared/hf-cache       # default; writes go here
HF_HUB_CACHE=/cache/hf          # checked first for cache hits; falls through

The HuggingFace library checks HF_HUB_CACHE for an existing model before downloading. If found locally, it loads fast. If not, the download lands in the shared cache so the next person benefits too.

For gated or private models, point HF_HOME at your own home directory:

HF_HOME=~/.cache/huggingface vllm serve <private-model>

Writes to /shared/hf-cache require membership in the shared-cache group; ask the operators to add you if you need write access.

Containers (Enroot + Pyxis)¶

Containers are the preferred way to run anything with a heavy dependency stack. Enroot is rootless (no Docker daemon, no setuid), and Pyxis hooks it into SLURM so containers run inside your allocation.

Run a container in an allocation¶

srun --gres=gpu:1 --time=2:00:00 \
     --container-image=docker.io/vllm/vllm-openai:latest \
     --pty bash

--container-image accepts:

Docker Hub references (docker.io/...).
HuggingFace, NVIDIA NGC, or other OCI registries (nvcr.io/...).
A pre-imported squashfs file (faster repeat launches).

Pre-import once, reuse fast¶

enroot import docker://docker.io/vllm/vllm-openai:latest
# produces vllm+vllm-openai+latest.sqsh in $PWD

srun --gres=gpu:1 --container-image=./vllm+vllm-openai+latest.sqsh --pty bash

Imports land under /cache/users/$USER/enroot/ — local fast storage, not NFS.

Common recipe — serve vLLM and tunnel to your laptop¶

On the cluster:

tmux
srun --gres=gpu:1 --time=8:00:00 \
     --container-image=docker.io/vllm/vllm-openai:latest \
     --pty bash
# inside the container:
vllm serve <model> --port 8000

On your laptop:

ssh -L 8000:localhost:8000 <name.#>@gpu-a.prescient.internal
# now hit http://localhost:8000 in your browser / curl

Python without containers¶

Prefer containers, but if you must use a venv:

Place it under /cache/users/$USER/venvs/<name> (fast, local).
Never place a venv on /home — NFS lookups for every import slow startup to a crawl.
uv is installed system-wide at /usr/local/bin/uv and resolves much faster than pip for ML stacks.

uv venv /cache/users/$USER/venvs/myenv
source /cache/users/$USER/venvs/myenv/bin/activate
uv pip install vllm

What you cannot do¶

Documented here so it doesn't surprise you:

No sudo. All elevated needs are met via containers. If you genuinely need something at the host level, open an issue — do not ask for sudo.
No SSH without an active allocation (except for operators). The PAM stack uses pam_slurm_adopt — if your srun ends, your shell session ends too. This is by design: it guarantees GPU isolation across users.
No system-wide CUDA toolkit. Get CUDA from your container image or pinned Python wheels.

Helper commands¶

Command	What it does
`gpu-status`	Cluster-wide snapshot — node state, GPU counts, queued jobs.
`myquota`	Your storage usage across `/home`, `/cache`, `/scratch`.
`myjobs`	Your current Slurm jobs.
`myhistory`	Your jobs over the last week.
`gpus-here`	`nvidia-smi` on the current host (not just your allocation).

Most of these are aliases defined in /etc/profile.d/gpu-cluster-aliases.sh.