Slurm Job Scheduling
IST cluster uses Slurm to schedule job and manage policy in our cluster, e.g. the number of jobs/time/resource (Nodes, CPUs, GPUs) per user, scheduling policies.
Slurm Overview
Basically, Slurm is a opensource jobs scheduling which is reponsible for allocating resource for each job, prioritizing user jobs and montioring user’s jobs.
To send job in to slurm, users need to define jobs setting such as Account, Parition, Memory, CPU Cores, etc. Then, Slurm will allocate resource for jobs and execute user’s jobs. We list the partition and limitation for each jobs in cluster policy section
There are 2 ways to execute jobs(your script) via Slurm.
- sbatch command submit your script to partition.
ex. sbatch helloworld.sub
- srun: Run parallel jobs
We create a sinteractive
command: Run an interactive session which is easier for testing program.
Slurm Terminologies
- Account: an account in Slurm system. One user can have many accounts.
- User: user in Linux system.
- Partition: A job’s queue. The jobs from users will be queued and executed consecutively.
- Time: Time limits for job.
- Hardware Specification: the number of resource in each nodes such as nodes, cores, GPUs,and memory.
- Quality of Service (QoS): The QoS helps to limit the hardware resource/time for each account or group of account.
Example: Hello World
In this example, we will run helloworld.py
using sbatch and sinteractive.
## helloworld.py
#!/bin/python
print "Hello Worldddd"
View available account and partitions in your user
[songpon@ist-frontend-001 ~]$ myassoc
Account QOS Def QOS Partition
---------- -------------------- --------- ----------
scads cpu
scads dgx
scads bash-cpu
scads bash-dgx
SBATCH
define resource in sbatch. (helloworld.sub)
#!/bin/bash -l
#SBATCH --mem=50mb
#SBATCH --nodes=1
#SBATCH --partition=cpu
#SBATCH --account=scads
srun python helloworld.py
Submit job to queue
sbatch helloworld.sub
See output
[songpon@ist-frontend-001 test-job]$ ls
hello.sub helloworld.py output slurm-1667.out test.sh
[songpon@ist-frontend-001 test-job]$ cat slurm-1667.out
Hello Worldddd
SINTERACTIVE
define resource in sinteractive arguments which is the same argument for srun
more information
sinteractive -A scads -p bash-cpu --mem=1gb -c 1 -N 1
python helloworld.py