Help & FAQ¶

If something is broken or confusing, this page is where to start. If your question isn't here, see How to ask for help below.

VPN issues¶

`tailscale up` opens the browser but the SSO login rejects me¶

Check that your OSU name.# is correct.
Confirm your account has been provisioned — if you can't log in to the account portal either, that's the underlying issue.
TOTP code is 6 digits and rotates every 30 seconds; resync your authenticator if codes are repeatedly rejected.

I'm connected to the VPN but `ping slurm-head.prescient.internal` fails¶

tailscale status — confirm you see yourself as "active".
DNS for *.prescient.internal is pushed automatically when you join the VPN. If queries still fail, restart the Tailscale client.

I get logged out immediately when I SSH to a compute node¶

Compute nodes (core-hgx, edge-lmd-1) reject SSH unless you already hold an active allocation on them. This is intentional (pam_slurm_adopt), and it's why you submit jobs from slurm-head rather than logging into a GPU node.

For normal work, SSH to slurm-head.prescient.internal — no allocation is needed there — and run jobs with srun / sbatch. See Slurm Cluster.
The one time you connect to a compute node directly is to tunnel to a service running inside your own job; that works because the allocation grants you SSH for its lifetime. See Reach a service from your laptop.

`id <name.#>` returns nothing on a cluster host¶

The host can't reach the directory service. Usually transient — try again in a minute. If it persists, file a ticket (see below).

GPU and Slurm issues¶

`nvidia-smi` inside my allocation shows fewer GPUs than I requested¶

Confirm with scontrol show job <jobid> that your Gres= line matches what you asked for.
If it does, file a ticket — this indicates a cgroup config bug, not a user error.

`nvidia-smi` outside an allocation shows all GPUs¶

That's expected on gpus-here / direct nvidia-smi. The isolation kicks in inside an allocation. The cgroup hides unallocated devices from your processes.

My job was killed for "exceeding time limit"¶

You hit the wall-clock limit you requested with --time. Re-submit with a longer --time (up to the maximum allowed), or split the work into checkpoint-resumable chunks.

Container issues¶

`enroot import` runs out of space¶

The Enroot cache lives at /cache/users/$USER/enroot/. That's on per-host fast local storage with no quota but finite size. Clean up old .sqsh files you don't reuse.

`srun --container-image=docker.io/...` fails to pull¶

Some registries rate-limit anonymous pulls. Pre-import once with enroot import (which can authenticate), then run from the local .sqsh.
Check the image exists for your CPU architecture (most are linux/amd64).

A container can't see GPUs¶

Confirm you allocated GPUs (--gres=gpu:N) — Pyxis only injects what SLURM granted you.
Confirm the container image has the NVIDIA Container Toolkit's expected runtime — most ML images do.

Storage issues¶

I'm out of `/home` quota¶

Run myquota to confirm. Move bulky working files to /cache/users/$USER (which has no quota). For genuine quota increases, your PI can request one.

My venv is slow to import¶

It's probably on /home. Recreate it under /cache/users/$USER/venvs/:

uv venv /cache/users/$USER/venvs/myenv

NFS lookups for many small .py files dominate import time.

I lost data on `/cache` or `/scratch`¶

These tiers are not backed up. A disk failure on the local pool will take everything with it. Always keep an authoritative copy on /home or upstream (Git, HuggingFace, S3) for anything you can't regenerate.

How to ask for help¶

Before opening a ticket, gather:

The exact command you ran.
The exact error message (paste, don't paraphrase).
Hostname (hostname), date/time, and your name.#.
Job ID if applicable (squeue --me).

Status and incidents¶

TODO: contact information for users

Help & FAQ¶

VPN issues¶

tailscale up opens the browser but the SSO login rejects me¶

I'm connected to the VPN but ping slurm-head.prescient.internal fails¶