Help & FAQ¶
If something is broken or confusing, this page is where to start. If your question isn't here, see How to ask for help below.
VPN issues¶
tailscale up opens the browser but the SSO login rejects me¶
- Check that your OSU
name.#is correct. - Confirm your account has been provisioned — if you can't log in to the account portal either, that's the underlying issue.
- TOTP code is 6 digits and rotates every 30 seconds; resync your authenticator if codes are repeatedly rejected.
I'm connected to the VPN but ping gpu-a.prescient.internal fails¶
tailscale status— confirm you see yourself as "active".- DNS for
*.prescient.internalis pushed automatically when you join the VPN. If queries still fail, restart the Tailscale client.
I get logged out immediately on SSH¶
You have no active Slurm allocation on that host and you don't have
operator-level access. This is intentional (pam_slurm_adopt). Allocate
first:
# from your laptop:
ssh <name.#>@gpu-a.prescient.internal "srun --gres=gpu:0 --time=1:00:00 --pty true &"
# then SSH again — your session adopts into the allocation
In practice: open a tmux on the host once via your first allocation, and keep using that session.
id <name.#> returns nothing on a compute host¶
The host can't reach the directory service. Usually transient — try again in a minute. If it persists, file a ticket (see below).
GPU and Slurm issues¶
srun says "Invalid qos specification"¶
You requested a QoS your account is not allowed to use. Ask the operators
which QoS levels your account permits, or pick the default by omitting
--qos=.
nvidia-smi inside my allocation shows fewer GPUs than I requested¶
- Confirm with
scontrol show job <jobid>that yourGres=line matches what you asked for. - If it does, file a ticket — this indicates a cgroup config bug, not a user error.
nvidia-smi outside an allocation shows all GPUs¶
That's expected on gpus-here / direct nvidia-smi. The isolation kicks in
inside an allocation. The cgroup hides unallocated devices from your
processes.
My job was killed for "exceeding time limit"¶
You hit your QoS MaxWall. Re-submit with --qos=long if your account
allows it, or split the work into checkpoint-resumable chunks.
Container issues¶
enroot import runs out of space¶
The Enroot cache lives at /cache/users/$USER/enroot/. That's on per-host
fast local storage with no quota but finite size. Clean up old .sqsh files
you don't reuse.
srun --container-image=docker.io/... fails to pull¶
- Some registries rate-limit anonymous pulls. Pre-import once with
enroot import(which can authenticate), then run from the local.sqsh. - Check the image exists for your CPU architecture (most are
linux/amd64).
A container can't see GPUs¶
- Confirm you allocated GPUs (
--gres=gpu:N) — Pyxis only injects what SLURM granted you. - Confirm the container image has the NVIDIA Container Toolkit's expected runtime — most ML images do.
Storage issues¶
I'm out of /home quota¶
Run myquota to confirm. Move bulky working files to /cache/users/$USER
(which has no quota). For genuine quota increases, your PI can request one.
My venv is slow to import¶
It's probably on /home. Recreate it under /cache/users/$USER/venvs/:
NFS lookups for many small .py files dominate import time.
I lost data on /cache or /scratch¶
These tiers are not backed up. A disk failure on the local pool will
take everything with it. Always keep an authoritative copy on /home or
upstream (Git, HuggingFace, S3) for anything you can't regenerate.
How to ask for help¶
Before opening a ticket, gather:
- The exact command you ran.
- The exact error message (paste, don't paraphrase).
- Hostname (
hostname), date/time, and yourname.#. - Job ID if applicable (
squeue --me).
Status and incidents¶
TODO: contact information for users