GPU Workloads¶
The GPU cluster is the primary compute surface of PRESCIENT. Users SSH to a host, allocate GPUs through SLURM, and run their workloads.
Hosts and partitions¶
| Host | Role | Notes |
|---|---|---|
core-hgx.prescient.internal |
Compute | H100 x 8 |
edge-lmd-1.prescient.internal |
Compute | L40s x 8 |
You pick a host explicitly via SSH, and your jobs run only on that host.
SLURM basics¶
Interactive allocation¶
The most common entry point — get a shell with GPUs attached:
| Flag | Meaning |
|---|---|
--gres=gpu:N |
Allocate N GPUs (subject to your account limits). |
--time=HH:MM:SS or --time=D-HH:MM:SS |
Wall-clock limit. |
--pty bash |
Give me an interactive shell. |
--qos=long |
Override default QoS — only some are allowed for your account. |
When you exit the shell, the allocation ends. To survive a disconnect, run
tmux before srun, or use salloc + srun --jobid=... (see below).
Batch submission¶
#!/bin/bash
#SBATCH --job-name=train
#SBATCH --gres=gpu:2
#SBATCH --time=12:00:00
#SBATCH --output=logs/%j.out
python train.py
Inspect and manage jobs¶
myjobs # alias: squeue --me
sacct -X --starttime today
scancel <jobid>
gpu-status # cluster-wide snapshot helper
Surviving disconnects¶
Two reliable patterns:
- tmux + srun. Start
tmuxon the host first, then runsrun --pty bashinside the tmux session. Detach withCtrl-b d. The allocation continues. - salloc + reattach.
salloc --gres=gpu:1 --time=8:00:00returns a job ID;srun --jobid=<id> --pty bashjoins it. Disconnect-safe.
Never wrap long-running services (e.g. a vLLM server) in a bare srun over
SSH without tmux — a dropped connection kills the allocation.
Storage¶
PRESCIENT uses tiered storage. Pick the right tier for the data lifetime —
putting hot data on /home is the most common cause of "why is my training
slow."
| Path | Where | Backed up | Shared | Use for |
|---|---|---|---|---|
/home/$USER |
NFS from gpu-c |
✓ | ✓ | code, configs, important results, papers |
/cache/users/$USER |
Local ZFS on each host | ✗ | ✗ | Python venvs, working datasets, model outputs |
/scratch/jobs/$SLURM_JOB_ID |
Local ZFS, set as $TMPDIR |
✗ (auto-deleted) | ✗ | intermediate files within a single job |
/scratch/users/$USER |
Local ZFS | ✗ (manual) | ✗ | persistent fast scratch, your responsibility |
/shared/hf-cache |
NFS from gpu-c |
✓ | ✓ | community HuggingFace cache |
/cache/hf |
Local on each host | ✗ | ✗ | per-host fast HF cache |
Storage rules of thumb¶
- Slow but safe → fast but at-risk as you move down the list.
/homeis NFS-served over 1 GbE; do not load multi-gigabyte model weights from there hot-path. - Per-user dirs auto-create on first login and at job-prolog time. You do not need to ask an admin.
- Quotas apply on
/homeby Slurm account; ask the operators if you need yours increased. /cacheand/scratchhave no redundancy. A disk failure means data loss on that pool. Keep authoritative copies on/homeor upstream.
HuggingFace cache¶
Two environment variables are pre-set in your shell:
HF_HOME=/shared/hf-cache # default; writes go here
HF_HUB_CACHE=/cache/hf # checked first for cache hits; falls through
The HuggingFace library checks HF_HUB_CACHE for an existing model before
downloading. If found locally, it loads fast. If not, the download lands in
the shared cache so the next person benefits too.
For gated or private models, point HF_HOME at your own home directory:
Writes to /shared/hf-cache require membership in the shared-cache group;
ask the operators to add you if you need write access.
Containers (Enroot + Pyxis)¶
Containers are the preferred way to run anything with a heavy dependency stack. Enroot is rootless (no Docker daemon, no setuid), and Pyxis hooks it into SLURM so containers run inside your allocation.
Run a container in an allocation¶
--container-image accepts:
- Docker Hub references (
docker.io/...). - HuggingFace, NVIDIA NGC, or other OCI registries (
nvcr.io/...). - A pre-imported squashfs file (faster repeat launches).
Pre-import once, reuse fast¶
enroot import docker://docker.io/vllm/vllm-openai:latest
# produces vllm+vllm-openai+latest.sqsh in $PWD
srun --gres=gpu:1 --container-image=./vllm+vllm-openai+latest.sqsh --pty bash
Imports land under /cache/users/$USER/enroot/ — local fast storage, not NFS.
Common recipe — serve vLLM and tunnel to your laptop¶
On the cluster:
tmux
srun --gres=gpu:1 --time=8:00:00 \
--container-image=docker.io/vllm/vllm-openai:latest \
--pty bash
# inside the container:
vllm serve <model> --port 8000
On your laptop:
ssh -L 8000:localhost:8000 <name.#>@gpu-a.prescient.internal
# now hit http://localhost:8000 in your browser / curl
Python without containers¶
Prefer containers, but if you must use a venv:
- Place it under
/cache/users/$USER/venvs/<name>(fast, local). - Never place a venv on
/home— NFS lookups for every import slow startup to a crawl. uvis installed system-wide at/usr/local/bin/uvand resolves much faster thanpipfor ML stacks.
uv venv /cache/users/$USER/venvs/myenv
source /cache/users/$USER/venvs/myenv/bin/activate
uv pip install vllm
What you cannot do¶
Documented here so it doesn't surprise you:
- No
sudo. All elevated needs are met via containers. If you genuinely need something at the host level, open an issue — do not ask for sudo. - No SSH without an active allocation (except for operators). The PAM
stack uses
pam_slurm_adopt— if yoursrunends, your shell session ends too. This is by design: it guarantees GPU isolation across users. - No system-wide CUDA toolkit. Get CUDA from your container image or pinned Python wheels.
Helper commands¶
| Command | What it does |
|---|---|
gpu-status |
Cluster-wide snapshot — node state, GPU counts, queued jobs. |
myquota |
Your storage usage across /home, /cache, /scratch. |
myjobs |
Your current Slurm jobs. |
myhistory |
Your jobs over the last week. |
gpus-here |
nvidia-smi on the current host (not just your allocation). |
Most of these are aliases defined in
/etc/profile.d/gpu-cluster-aliases.sh.