Skip to main content

Slurm: Your Cluster's Traffic Controller

· 23 min read
info

If you're only working on the BASIC Lab server, Slurm might not be necessary yet. However, if you plan to use NCHC resources, then learning Slurm is a must. All NCHC clusters are managed through Slurm.

Introduction

So you've finally gotten access to NCHC's fancy compute cluster. No more waiting three days using slow and outdated GPUs to finish training that model. You're ready for the big leagues!

You SSH in, navigate to your directory, and type the command you've been using on your local machine for months: python my_awesome_script.py

And then... someone taps you on the shoulder (virtually or literally) and says, "Oh no, you can't just run things directly here. You need to submit it through Slurm."

Slurm? What's a Slurm?

This is THE moment. The transition from "I run things on my own computer" to "I run things on a shared cluster" is where most of us first encounter job schedulers. And if you're anything like me, your first reaction was probably a mix of confusion and mild annoyance. On your laptop or lab server, you just... run things. Why is this so complicated?

Here's the thing: when you're sharing a cluster with dozens (or hundreds) of other people, you can't just run things whenever you want. If everyone did that, it would be absolute chaos – imagine 50 people all trying to grab the same 32 cores at once. Someone's job would crash, someone else would get nothing, and the whole system would become a nightmare.

That's why Slurm exists. And once you get over the initial learning curve (which, I'll be honest, can be a bit steep), you'll actually appreciate what it does for you.

What Actually Is Slurm?

Okay, let's start with the basics. Slurm stands for "Simple Linux Utility for Resource Management". And yes, I know what you're thinking, calling anything in HPC "simple" is a bold choice. But compared to manually coordinating who gets to use which compute nodes when, it actually is pretty simple.

Think of Slurm as an air traffic controller for your computing cluster.

Imagine an airport where dozens of planes need to take off and land, but there are only a few runways. The air traffic controller makes sure everyone gets their turn safely, decides which planes have priority, coordinates timing so nobody crashes into each other, and keeps track of everything happening at once.

That's exactly what Slurm does for your cluster. You (and everyone else) come in saying:

"Hey, I need 16 CPU cores, 64GB of RAM, and about 3 hours of compute time."

Slurm looks at what's currently available, what's in the queue, and figures out when it can safely "land" your job on the cluster. Maybe you get resources immediately, or maybe Slurm tells you, "You're number 5 in line, estimated wait time: 47 minutes."

More technically speaking, Slurm is a workload manager and job scheduler for Linux clusters. It does three main things:

  • Allocates resources: It decides who gets which compute nodes, how many cores, how much memory, which GPUs, etc.
  • Monitors jobs: It keeps track of what's running, what's waiting, what failed, and what finished successfully (and occasionally what crashed spectacularly).
  • Manages the queue: It maintains a fair system where everyone gets their turn based on priority, availability, and a bunch of policies your system administrator configured (probably while drinking coffee at 3 AM).

The beauty of Slurm is that it sits between you and the actual hardware. Instead of you having to SSH into different nodes, check what's available1, hope nobody else jumps on the same node, and manually juggle resources, Slurm handles all of that. You just tell it what you need, and it handles the logistics.

Why Bother with Slurm?

Alright, so you know what Slurm is. But you might still be thinking, "This sounds like extra work. Why can't I just run my code like I did?"

  • Fair Sharing Remember when someone taking up all GPUs2? Without Slurm, that's just... allowed. First come, first served. With Slurm, there are rules. Fair-share policies ensure everyone gets a reasonable slice of the computing pie. If you've been hogging resources, your priority drops. If you've barely used anything, you get bumped up.
  • Efficiency: Here's a dirty secret about compute clusters without job schedulers: they're often terribly underutilized. Someone reserves a node "just in case," but their job only uses 30% of the resources. Another person's job finishes early, but the node sits idle because nobody knows it's available. Slurm is like a Tetris master for resources. It packs jobs efficiently, fills in gaps when shorter jobs can squeeze in, and immediately makes resources available when jobs finish. A well-managed Slurm cluster can hit 80-90% utilization, compared to maybe 40-50% for a free-for-all system. That means your job gets scheduled faster, the cluster does more science per dollar, and your advisor stops complaining about the compute budget. Win-win-win.
  • Resource Tracking: Ever wonder who's been using all the GPUs? With Slurm, you can check. Want to see how much compute time your group used last month? Slurm tracks that. Need to prove to your advisor that yes, you really did run those 500 experiments? Slurm has receipts.
  • Sanity Preservation: Picture your life without Slurm: You SSH into basic1. Nope, all cores busy. You SSH into basic2. Also full. You SSH into basic3... you get the idea. Finally, you find basic17 has some free cores3. You start your job. Three hours later, someone else didn't check properly and starts their job on the same node. Your job crashes. You cry. With Slurm? You submit once. Slurm finds the resources. You go get coffee (or sleep, or actually work on your paper). You get an email when it's done. Your blood pressure stays normal.

The Basic Workflow

Okay, enough theory. Let's talk about what actually happens when you want to run something on a Slurm cluster. The good news? It's really just four steps. The bad news? Each step has its own quirks. But don't worry, we'll keep it simple.

Step 1: Write Your Job Script

Instead of just running python my_script.py directly, you write a small bash script that tells Slurm:

  • What resources you need (cores, memory, GPUs, etc.)
  • How long you'll need them
  • Where to put the output
  • What command to actually run

Think of it like filling out a form before getting in line.

"Hi, I need 4 cores, 16GB RAM, and I'll be done in 2 hours. Oh, and run this Python script for me."

Here's what a super basic job script looks like:

#!/bin/bash
#SBATCH --job-name=my_awesome_job
#SBATCH --output=logs/result.out
#SBATCH --error=logs/result.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=02:00:00

python my_script.py

Those lines starting with #SBATCH are instructions for Slurm (not just comments, even though they look like them). The last line is your actual command.

important

Always start with #!/bin/bash on the very first line. This is called a "shebang" and it tells the system to run your script using bash. Without it, slurm might fail to execute the job.

Step 2: Submit Your Job to the Queue

You submit your job script to Slurm using sbatch:

> sbatch my_job_script.sh

Slurm responds with something like:

Submitted batch job 123456

That number (123456) is your job ID. At this point, your job enters the queue. It might start immediately if resources are available, or it might wait if the cluster is busy.

Step 3: Wait (and Maybe Check on It)

Now you wait. But you're not just staring at a blank screen wondering what's happening. You can check your job's status:

> squeue -u $USER

This shows you something like:

JOBID    PARTITION    NAME              USER    ST   TIME  NODES
123456 gpu my_awesome_job you R 0:15 1

That ST column shows the status:

  • PD = Pending (waiting in queue)
  • R = Running (woohoo!)
  • CG = Completing (almost done)

You can also check more details (e.g, resources and setup) about your specific job:

> scontrol show job 123456

Step 4: Get Your Results

When your job finishes, Slurm writes the output to the file you specified (log/result.txt in our example). Any errors go to a separate error file (log/result.txt in our example).

cat result.txt

And that's it! Your job ran, you got results, and you didn't have to manually babysit it or hunt for available nodes.

Getting Started: Your First Slurm Job

note

Users can also refer to the NCHC document for the examples.

Here are the essential Slurm commands that'll become part of your daily vocabulary:

sbatch - "Submit a Job"

This is how you submit your job script to the queue.

> sbatch my_job.sh

Simple as that. Slurm gives you back a job ID, and your job enters the queue.

Quick tip: You can also override settings from your script directly on the command line:

> sbatch --cpus-per-task=8 --mem=32G my_job.sh

This is handy when you want to test different resource configurations without editing your script every time.

squeue - "Where's My Job?"

This is the command you'll probably use most often. It shows you what's currently in the queue.

> squeue -u $USER

Shows only your jobs. Without the -u flag, you'll see everyone's jobs, which can be... overwhelming on a busy cluster. Want more details? Add some formatting:

squeue -u your_username --format="%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"

Yeah, that's ugly. Most people create an alias for this in their .xxxrc.

warning

Don't use watch squeue or constantly run squeue in a loop! The squeue command itself needs to query the scheduler, and if everyone is hammering it every second, it can actually slow down the entire cluster. Check your job status occasionally, not obsessively. If you really need continuous monitoring, space it out to at least 30-60 seconds between checks.

scancel - "Oops, Wrong Script"

Made a mistake? Need to stop a job? This is your panic button.

> scancel 123456

Cancels job 123456. Simple. Want to cancel ALL your jobs? (Use with caution!)

scancel -u $USER
warning

There's no "undo" button, so make sure you're canceling the right job!

sinfo - "See What's Available"

Want to know what resources are actually available on the cluster?

> sinfo

This shows you all the partitions (think of them as different "queues" for different types of jobs), how many nodes are in each, and their current state.

> sinfo -N -l

Shows more detailed info (-l) about individual nodes (-N) – useful when you're trying to figure out why your job is stuck in the queue.

For GPU information specifically:

> sinfo -o "%20P %5D %8G %N"

This shows which partitions have GPUs and how many. You can also check detailed GPU availability with:

> sinfo -p partition --Format=NodeList,Gres,GresUsed

(Replace partition with your actual partition name)

Real Example

Enough commands. Let's actually submit a deep learning job.

Step 1: Create your training script (train_model.py):
import torch
import torch.nn as nn
import time

print("=" * 50)
print("GPU Availability Check")
print("=" * 50)

# Check CUDA availability
if torch.cuda.is_available():
print(f"✓ CUDA is available!")
print(f"✓ Number of GPUs: {torch.cuda.device_count()}")
print(f"✓ Current GPU: {torch.cuda.current_device()}")
print(f"✓ GPU Name: {torch.cuda.get_device_name(0)}")
print(f"✓ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
print("✗ CUDA is not available. Running on CPU.")

print("=" * 50)

# Simple model for demonstration
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 10)
)

if torch.cuda.is_available():
model = model.cuda()
print("Model moved to GPU")

print(f"Model has {sum(p.numel() for p in model.parameters())} parameters")

# Simulate training
print("\nStarting training simulation...")
for epoch in range(5):
print(f"Epoch {epoch+1}/5")
time.sleep(2) # Simulate some work

print("\n✓ Training complete!")
print("This is where your actual training would happen.")
Step 2: Create your Slurm job script (train.sh):
#!/bin/bash
#SBATCH --job-name=train_model
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --gres=gpu:1
#SBATCH --time=00:10:00
#SBATCH --partition=dev

# Create logs directory if it doesn't exist
mkdir -p logs

echo "=========================================="
echo "Job started at: $(date)"
echo "Running on node: $(hostname)"
echo "Job ID: $SLURM_JOB_ID"
echo "Job name: $SLURM_JOB_NAME"
echo "=========================================="

# Show GPU assignment
echo "Assigned GPU(s): $CUDA_VISIBLE_DEVICES"
nvidia-smi

# Activate your virtual environment if you have one
# 1. using conda (might require module loading)
# ml load miniconda3
# conda activate torch
# 2. using pyenv or uv
# source ~/venv/bin/activate

echo ""
echo "Starting training script..."
python train_model.py

echo ""
echo "=========================================="
echo "Job finished at: $(date)"
echo "=========================================="

Notice a few things:

  • %x gets replaced with the job name, %j with the job ID
  • We're saving logs to a logs/ directory to keep things organized
  • --gres=gpu:1 requests 1 GPU
  • We're using the partition named dev (your cluster might call it something different)
  • We print GPU assignment and run nvidia-smi to verify we got the GPU
Step 3: Submit it
> sbatch trainsh

You'll see:

Submitted batch job 73438
Step 4: Check on it:
> squeue -u $USER
Step 5: When it's done, check the output:
> cat logs/train_model_73438.sh

You should see something like:

==========================================
Job started at: Sat Nov 01 10:30:45 2025
Running on node: gpu-node-03
Job ID: 73438
Job name: train_model
==========================================
Assigned GPU(s): 0

[nvidia-smi output showing GPU info]

Starting training script...
==================================================
GPU Availability Check
==================================================
✓ CUDA is available!
✓ Number of GPUs: 1
✓ Current GPU: 0
✓ GPU Name: NVIDIA A100-SXM4-40GB
✓ GPU Memory: 40.00 GB
==================================================
Model moved to GPU
Model has 203,530 parameters

Starting training simulation...
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5

✓ Training complete!
This is where your actual training would happen.
==========================================
Job finished at: Sat Nov 01 10:31:00 2025
==========================================

Understanding the Job Script Options

Let's break down what those #SBATCH lines actually mean:

  • --job-name: Give your job a memorable name (shows up in squeue)
  • --output: Where to save standard output (%x = job name, %j = job ID)
  • --error: Where to save error messages
  • --ntasks: Number of tasks (usually 1 unless you're doing MPI)
  • --cpus-per-task: How many CPU cores you need. Recommend to set 4 to 8 for normal data processing. Higher to 16 or 32 for complicated process task.
  • --gres=gpu:1: Request 1 GPU (use gpu:2 for 2 GPUs, etc.)
  • --partition: Which partition to use (e.g., dev, normal)

(Optional) Below are some optional settings that you might not need to worry about at first. You can leave them out when you're getting started. If these settings aren't specified, Slurm will use default values (though it's generally better to specify them explicitly once you know what your job needs).

  • --mem: Total memory your job needs (use M for megabytes, G for gigabytes, e.g., 16G or 32000M)
  • --time: Maximum runtime before Slurm kills your job (format: HH:MM:SS or DD-HH:MM:SS for longer jobs, e.g., 02:30:00 for 2.5 hours or 3-12:00:00 for 3 days and 12 hours)
tip

While these are technically optional, it's a good habit to always specify them. If you don't set a time limit and your job hangs, it could sit there wasting resources. And if you don't set memory, you might get way less (or way more) than you actually need, affecting your queue wait time.

Email Notifications

As mention before, one should not using squeue obsessively. An alternative is to let Slurm email you.

#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=your.email@is.cool.com

Mail types:

  • BEGIN - When job starts
  • END - When job finishes successfully
  • FAIL - When job fails
  • ALL - All of the above
tip

Use END, FAIL to avoid inbox spam from dozens of "job started" emails.

Interactive Jobs for Debugging

important

This is actually super important and often overlooked! Your code might work perfectly on your laptop or local machine, but then mysteriously fail when you submit it to the cluster. Why? Different OS versions, different library versions, different CUDA versions, different compilers - the list goes on. This is especially painful with packages that require compilation (flash-attn, PyTorch with custom CUDA extensions, or anything involving C++ bindings).

Instead of the frustrating cycle of "submit job → wait in queue → job fails → edit script → submit again → wait → fails again," use interactive jobs to debug directly on the cluster environment:

> srun --pty --cpus-per-task=4 --mem=16G --gres=gpu:1 --time=01:00:00 bash

This gives you an interactive shell on a compute node with resources allocated. You're now in the exact same environment where your batch jobs will run - same OS, same libraries, same CUDA drivers, everything. When done, just type exit.

Perfect for:

  • Testing if your code actually works in the cluster environment before submitting a long job
  • Debugging "it works on my machine but not on the cluster" issues
  • Installing packages in a virtual environment (some packages need to be compiled for the specific system)
  • Checking GPU compatibility issues
  • Verifying module loads and environment variables

Common Mistakes

We've all been there. You submit a job, feeling confident, and then... it fails spectacularly. Or worse, it sits in the queue for three days. Here are the classic mistakes everyone makes (so you can skip straight to the less common ones):

Requesting Way Too Many Resources

  • The mistake: You need 4 cores, but you request 128 "just to be safe."

  • Why it's bad: Slurm has to wait until 128 cores are available. Meanwhile, people requesting reasonable amounts are getting scheduled first. Your job sits in the queue forever while smaller jobs zoom past you.

  • The fix: Request what you actually need, plus maybe 20% buffer. Test with smaller resources first, then scale up if needed. Use sacct to check what your past jobs actually used:

    > sacct -j your_job_id --format=JobID,MaxRSS,MaxVMSize,CPUTime,Elapsed

    This shows you the actual memory and CPU your job used. It's probably (might) way less than you requested.

Not Understanding Partitions/Queues

  • The mistake: You submit to the default partition, not realizing there's a special GPU partition, or a "short" queue that gets scheduled faster.
  • Why it's bad: You're waiting behind 50 other jobs when there's an empty express lane right next to you.
  • The fix: Run sinfo to see available partitions. Check your cluster documentation. Common partition types:

Not Checking Your Job Actually Started Correctly

  • The mistake: You submit a job, see it's running (R status in squeue), assume everything is fine, and go home for the weekend.

  • Why it's bad: Your job is "running" but actually crashed in the first 30 seconds due to a typo. It's just sitting there doing nothing for 3 days.

  • The fix: After submitting, wait a minute or two, then check the output file:

    sbatch train_job.sh
    # Wait 2-3 minutes
    tail -f logs/train_model_123456.out
    # or
    tail -f logs/train_model_123456.err

    Make sure you see actual output, not just error messages. Press Ctrl+C to exit tail when you're satisfied. The tail -f command is great for watching files as they update in real-time, like a live feed of your job's progress.

important

Don't panic if you see messages in the *.err file! Despite the name, not everything in the error file is actually an error. Many programs print normal informational messages, warnings, and progress updates to stderr, which ends up in your *.err file. Meanwhile, your *.out file might be empty or only contain your explicit echo statements. Therefore, always check BOTH files - *.out AND *.err - to get the full picture of what your job is doing. The *.err file often contains the most useful information, even when everything is working perfectly fine.

Wrapper Script for Dynamic Job Submission

Here's a common frustration: Slurm job scripts don't take command-line arguments in the way you might expect. You can't just do sbatch my_job.sh --learning-rate 0.001 and have it work.

The problem: You want to run the same experiment with different hyperparameters, seeds, or datasets, but you don't want to manually edit your job script 20 times or create 20 different files. The solution: Create a bash wrapper script that takes arguments and generates + submits the Slurm job for you!

Example

Let's say you want to run training jobs with different learning rates and batch sizes. Here's a wrapper script:

submit_training.sh
#!/bin/bash

# Check if correct number of arguments provided
if [ "$#" -ne 3 ]; then
echo "Usage: $0 <learning_rate> <batch_size> <experiment_name>"
echo "Example: $0 0.001 32 exp1"
exit 1
fi

# Parse arguments
LR=$1
BATCH_SIZE=$2
EXP_NAME=$3

# Create a temporary job script
JOB_SCRIPT=$(mktemp /tmp/slurm_job_XXXXXX.sh)

# Write the Slurm job script dynamically
cat > $JOB_SCRIPT << EOF
#!/bin/bash
#SBATCH --job-name=train_${EXP_NAME}
#SBATCH --output=logs/train_${EXP_NAME}_%j.out
#SBATCH --error=logs/train_${EXP_NAME}_%j.err
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00
#SBATCH --partition=gpu

echo "=========================================="
echo "Experiment: ${EXP_NAME}"
echo "Learning Rate: ${LR}"
echo "Batch Size: ${BATCH_SIZE}"
echo "Job ID: \$SLURM_JOB_ID"
echo "=========================================="

module load python/3.10 cuda/11.8
source ~/venv/bin/activate

# Run training with specified parameters
python train.py \\
--learning-rate ${LR} \\
--batch-size ${BATCH_SIZE} \\
--experiment-name ${EXP_NAME} \\
--output-dir results/${EXP_NAME}

echo "Training completed at \$(date)"
EOF

# Make sure logs directory exists
mkdir -p logs
mkdir -p results/${EXP_NAME}

# Submit the job
echo "Submitting job for experiment: ${EXP_NAME}"
echo " Learning rate: ${LR}"
echo " Batch size: ${BATCH_SIZE}"
sbatch $JOB_SCRIPT

# Clean up the temporary script
rm $JOB_SCRIPT

echo "Job submitted successfully!"

Now you can easily submit jobs with different parameters:

> ./submit_training.sh 0.001 32 exp_lr001_bs32
> ./submit_training.sh 0.01 64 exp_lr01_bs64
> ./submit_training.sh 0.0001 128 exp_lr0001_bs128

Or even loop through multiple configurations:

for lr in 0.001 0.01 0.0001; do
for bs in 32 64 128; do
./submit_training.sh $lr $bs exp_lr${lr}_bs${bs}
done
done

With Slurm parameters as arguments

Here's a more sophisticated version that includes optional parameters:

submit_training_with_slurm_params.sh
#!/bin/bash

# Default values
CPUS=4
MEM="16G"
GPUS=1
TIME="04:00:00"
PARTITION="gpu"

# Parse command line arguments
while [[ $# -gt 0 ]]; do
case $1 in
--lr)
LR="$2"
shift 2
;;
--batch-size)
BATCH_SIZE="$2"
shift 2
;;
--exp-name)
EXP_NAME="$2"
shift 2
;;
--cpus)
CPUS="$2"
shift 2
;;
--mem)
MEM="$2"
shift 2
;;
--gpus)
GPUS="$2"
shift 2
;;
--time)
TIME="$2"
shift 2
;;
--partition)
PARTITION="$2"
shift 2
;;
*)
echo "Unknown option: $1"
echo "Usage: $0 --lr <value> --batch-size <value> --exp-name <name> [options]"
echo "Options:"
echo " --cpus <num> Number of CPUs (default: 4)"
echo " --mem <size> Memory (default: 16G)"
echo " --gpus <num> Number of GPUs (default: 1)"
echo " --time <time> Time limit (default: 04:00:00)"
echo " --partition <name> Partition name (default: gpu)"
exit 1
;;
esac
done

# Check required arguments
if [ -z "$LR" ] || [ -z "$BATCH_SIZE" ] || [ -z "$EXP_NAME" ]; then
echo "Error: --lr, --batch-size, and --exp-name are required"
exit 1
fi

# Create temporary job script
JOB_SCRIPT=$(mktemp /tmp/slurm_job_XXXXXX.sh)

cat > $JOB_SCRIPT << EOF
#!/bin/bash
#SBATCH --job-name=train_${EXP_NAME}
#SBATCH --output=logs/train_${EXP_NAME}_%j.out
#SBATCH --error=logs/train_${EXP_NAME}_%j.err
#SBATCH --cpus-per-task=${CPUS}
#SBATCH --mem=${MEM}
#SBATCH --gres=gpu:${GPUS}
#SBATCH --time=${TIME}
#SBATCH --partition=${PARTITION}

echo "=========================================="
echo "Experiment: ${EXP_NAME}"
echo "Learning Rate: ${LR}"
echo "Batch Size: ${BATCH_SIZE}"
echo "CPUs: ${CPUS}, Memory: ${MEM}, GPUs: ${GPUS}"
echo "Job ID: \$SLURM_JOB_ID"
echo "=========================================="

module load python/3.10 cuda/11.8
source ~/venv/bin/activate

python train.py \\
--learning-rate ${LR} \\
--batch-size ${BATCH_SIZE} \\
--experiment-name ${EXP_NAME} \\
--output-dir results/${EXP_NAME}

echo "Training completed at \$(date)"
EOF

mkdir -p logs results/${EXP_NAME}

echo "Submitting job: ${EXP_NAME}"
echo " LR: ${LR}, Batch Size: ${BATCH_SIZE}"
echo " Resources: ${CPUS} CPUs, ${MEM} RAM, ${GPUS} GPUs"
echo " Time limit: ${TIME}, Partition: ${PARTITION}"

sbatch $JOB_SCRIPT
rm $JOB_SCRIPT

echo "Job submitted!"

To use this script, you can run:

# Basic usage
> ./submit_training_advanced.sh --lr 0.001 --batch-size 32 --exp-name my_exp

# With custom resources
> ./submit_training_advanced.sh \
--lr 0.001 \
--batch-size 32 \
--exp-name big_exp \
--cpus 8 \
--mem 32G \
--gpus 2 \
--time 12:00:00
note

The wrapper creates temporary job scripts that are deleted after submission. If you want to keep them for debugging, you can save them to a directory instead:

# Instead of mktemp and rm, use:
JOB_SCRIPT="job_scripts/train_${EXP_NAME}_$(date +%Y%m%d_%H%M%S).sh"
mkdir -p job_scripts
# ... write to $JOB_SCRIPT ...
# Don't delete it

Wrapping Up

So there you have it - Slurm in a nutshell (okay, maybe a large nutshell).

Yes, Slurm has a learning curve. Yes, you'll probably forget to include #!/bin/bash at least three more times. Yes, you'll submit a job and then immediately realize you used the wrong script. We've all been there. But here's the thing: once you get the hang of it, Slurm becomes invisible. You stop thinking about the scheduler and just think about your research. You submit jobs before leaving for the day and come back to results in the morning. You run 100 experiments in parallel without breaking a sweat. You stop SSH-ing into random nodes hoping to find free GPUs.

What's Next? Advanced Topics to Explore:

Once you're comfortable with the basics, here are some powerful features worth exploring:

  • Array jobs (--array) - Run the same script hundreds of times with different parameters in one submission
  • Job dependencies (--dependency) - Chain jobs together so they run sequentially without manual intervention
  • Job history (sacct) - Analyze past jobs to optimize your resource requests
  • Job templates - Create reusable script templates for common workflows

These features can seriously streamline your workflow once you're ready for them!

A few final thoughts:

  • Start simple. Get basic jobs working before you try fancy array jobs or dependencies.
  • Check your cluster's specific documentation. Every cluster is configured slightly differently.
  • Be a good cluster citizen. Don't request way more resources than you need. Cancel jobs you don't need anymore. Don't spam squeue every 2 seconds.
  • Your sysadmins are your friends. If something genuinely doesn't make sense, ask them (send them a mail). They'd much rather answer questions than debug why the cluster is slow because someone is running watch squeue on infinite loop.

Footnotes

  1. Fun fact: on your BASIC Lab server, you usually just see what's available right when you log in, no detective work required.

  2. Okay, full confession: I'm totally that person who occasionally hogs all the H100 GPUs. 😔

  3. When Prof. Shuai retire and we'll have basic100. The struggle is real.