Skip to content

Help & FAQ

If something is broken or confusing, this page is where to start. If your question isn't here, see How to ask for help below.

VPN issues

tailscale up opens the browser but the SSO login rejects me

  • Check that your OSU name.# is correct.
  • Confirm your account has been provisioned — if you can't log in to the account portal either, that's the underlying issue.
  • TOTP code is 6 digits and rotates every 30 seconds; resync your authenticator if codes are repeatedly rejected.

I'm connected to the VPN but ping gpu-a.prescient.internal fails

  • tailscale status — confirm you see yourself as "active".
  • DNS for *.prescient.internal is pushed automatically when you join the VPN. If queries still fail, restart the Tailscale client.

I get logged out immediately on SSH

You have no active Slurm allocation on that host and you don't have operator-level access. This is intentional (pam_slurm_adopt). Allocate first:

# from your laptop:
ssh <name.#>@gpu-a.prescient.internal "srun --gres=gpu:0 --time=1:00:00 --pty true &"
# then SSH again — your session adopts into the allocation

In practice: open a tmux on the host once via your first allocation, and keep using that session.

id <name.#> returns nothing on a compute host

The host can't reach the directory service. Usually transient — try again in a minute. If it persists, file a ticket (see below).

GPU and Slurm issues

srun says "Invalid qos specification"

You requested a QoS your account is not allowed to use. Ask the operators which QoS levels your account permits, or pick the default by omitting --qos=.

nvidia-smi inside my allocation shows fewer GPUs than I requested

  • Confirm with scontrol show job <jobid> that your Gres= line matches what you asked for.
  • If it does, file a ticket — this indicates a cgroup config bug, not a user error.

nvidia-smi outside an allocation shows all GPUs

That's expected on gpus-here / direct nvidia-smi. The isolation kicks in inside an allocation. The cgroup hides unallocated devices from your processes.

My job was killed for "exceeding time limit"

You hit your QoS MaxWall. Re-submit with --qos=long if your account allows it, or split the work into checkpoint-resumable chunks.

Container issues

enroot import runs out of space

The Enroot cache lives at /cache/users/$USER/enroot/. That's on per-host fast local storage with no quota but finite size. Clean up old .sqsh files you don't reuse.

srun --container-image=docker.io/... fails to pull

  • Some registries rate-limit anonymous pulls. Pre-import once with enroot import (which can authenticate), then run from the local .sqsh.
  • Check the image exists for your CPU architecture (most are linux/amd64).

A container can't see GPUs

  • Confirm you allocated GPUs (--gres=gpu:N) — Pyxis only injects what SLURM granted you.
  • Confirm the container image has the NVIDIA Container Toolkit's expected runtime — most ML images do.

Storage issues

I'm out of /home quota

Run myquota to confirm. Move bulky working files to /cache/users/$USER (which has no quota). For genuine quota increases, your PI can request one.

My venv is slow to import

It's probably on /home. Recreate it under /cache/users/$USER/venvs/:

uv venv /cache/users/$USER/venvs/myenv

NFS lookups for many small .py files dominate import time.

I lost data on /cache or /scratch

These tiers are not backed up. A disk failure on the local pool will take everything with it. Always keep an authoritative copy on /home or upstream (Git, HuggingFace, S3) for anything you can't regenerate.

How to ask for help

Before opening a ticket, gather:

  1. The exact command you ran.
  2. The exact error message (paste, don't paraphrase).
  3. Hostname (hostname), date/time, and your name.#.
  4. Job ID if applicable (squeue --me).

Status and incidents

TODO: contact information for users