Submitting Jobs
OS HPC is an auto scaling cluster on Google Cloud on managed by Slurm Workload Manager.
Jobs may take up to five minutes to start if a node is not provisioned at the time of submission.
Slurm Account and QOS
When attending a hackathon, you must use the oshackathon Slurm account and the hackathon QOS.
When running batch jobs, your batch headers must have the following :
#SBATCH --account=ornl
#SBATCH --qos=ornlq
When running interactive jobs ( on a V100 GPU )
srun --account=ornl --qos=ornlq --partition=v100-gpu --gres=gpu:1 --pty /bin/bash
Submit Batch Jobs
Batch jobs are useful when you have an application that can run unattended for long periods of time. You run a batch job by using the sbatch command with a batch file.
Writing a batch file
A batch file consists of two sections
Batch header - Communicates settings to Slurm that specify your slurm account, the compute partition to submit the job to, the number of tasks to run, the amount of resources (cpu, gpu, and memory), and the task affinity.
Shell script instructions
Below is a simple example batch file:
#!/bin/bash
#SBATCH --account=ornl
#SBATCH --qos=ornlq
#SBATCH --partition=this-partition
#SBATCH --job-name=example_job_name
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --time=00:05:00
#SBATCH --output=serial_test_%j.log
hostname
The above batch file has multiple constraints that dictate how the job will be executed.
--account=ornl indicates that you are using the "ornl" to log the resource usage against.
--qos=ornlq indicates that you are requesting the "hackathon" quality-of-service.
--partition=v100-gpu requests the job execute on a partition called "v100-gpu"
--job-name=’name’ sets the job name
--ntasks=1 advises the slurm controller that job steps run within the allocation will launch a maximum of 1 tasks
--ntasks-per-node=1 When used by itself, this constraint requests that 1 task per node be invoked. When used with --ntasks, --ntasks-per-node is treated as the maximum count of tasks per node.
--gres=gpu:1 indicates that 1 GPU is requested to execute this batch job
--time=00:05:00 sets a total run time of 5 minutes for job allocation.
--output=name.log out creates a file containing the batch script’s stdout and stderr.
SchedMD’s sbatch documentation provides a more complete description of the sbatch command line interface and the available options for specifying resource requirements and task affinity.
Submitting a batch job
Batch jobs are submitted using the sbatch command.
sbatch example.batch
Once you have submitted your batch job, you can check the status of your job with the squeue command. Since Fluid-Slurm-GCP is an autoscaling cluster, you may notice that your job is in a configuring (CF) state for some time before starting. This happens because compute nodes are created when needed to meet the compute resource demands on-the-fly. This process can take anywhere from 30s - 3 minutes.
squeue
Interactive Jobs
The interactive workflows described here use a combination of salloc and srun command line interfaces. It is highly recommended that you read through SchedMD's salloc documentation and srun documentation to understand how to reserve and release compute resources in addition to specifying task affinity and other resource-task bindings.
For all interactive workflows, you should be aware that you are charged for each second of allocated compute resources. It is best practice to set a wall-time when allocating resources. This practice helps avoid situations where you will be billed for idle resources you have reserved.
Allocate and Execute Workflow
With slurm, you can allocate compute resources that are reserved for your sole use. This is done using the salloc command. As an example, you can reserve exclusive access on one compute node for an hour on the default partition
salloc --account=ornl --qos=ornlq --partition=v100-gpu --gres=gpu:1 --time=1:00:00 --N1 --exclusive
Once resources are allocated, Slurm responds with a job id. From here, you can execute commands on compute resources using `srun`. srun is a command line interface for executing “job steps” in slurm. You can specify how much of the allocated compute resources to use for each job step. For example, the srun command below executes provides access to 4 cores for executing ./my-application.
srun -n4 ./my-application
It is highly recommended that you familiarize yourself with Slurm’s salloc and srun command line tools so that you can make efficient use of your compute resources.
To release your allocation before the requested wall-time, you can use scancel
scancel <job-id>
After cancelling your job, or after the wall-clock limit is exceeded, Slurm will automatically delete compute nodes for you.
Interactive Shell Workflow
If your workflow requires graphics forwarding from compute resources, you can allocate resources as before using salloc, e.g.,
salloc --account=ornl --qos=ornlq --partition=v100-gpu --gres=gpu:1 --time=1:00:00 --N1 --exclusive
Once resources are allocated, you can launch a shell on the compute resources with X11 forwarding enabled.
srun -N1 --pty /bin/bash
Once you are complete with your work, exit the shell and release your resources.
exit
scancel <job-id>
Monitoring Jobs and Resources
Checking Slurm job status
Slurm's squeue command can be used to keep track of jobs that have been submitted to the job queue.
squeue
You can use optional flags, such as --user and --partition to filter results based on username or compute partition associated with each job.
squeue --user=USERNAME --partition=PARTITION
Slurm jobs have a status code associated with them which change during the lifespan of the job
CF | The job is in a configuring state. Typically this state is seen when autoscaling compute nodes are being provisioned to execute work.
PD | The job is in a pending state.
R | The job is in a running state.
CG | The job is in a completing state and the associated compute resources are being cleaned up.
(Resources) | There are insufficient resources available to schedule your job at the moment.
Checking Slurm compute node status
Slurm's sinfo command can be used to keep track of the compute nodes and partitions available for executing workloads.
sinfo
Compute nodes have a status code associated with them that change during the lifespan of each node. A few common state codes are shown below. A more detailed list can be found in SchedMD's documentation.
idle| The compute node is in an idle state and can receive work.
down | The compute node is in a down state and may need to be drained and placed back in down state. Downed nodes are also symptomatic of other issues on your cluster, such as insufficient quota or improperly configured machine blocks.
mixed | A portion of the compute nodes resources have been allocated, but additional resources are still available for work.
allocated | The compute node is fully allocated
Additionally, east state code has a modifier with the following meanings
~ | The compute node is in a "cloud" state and will need to be provisioned before receiving work
# | The compute node is currently being provisioned (powering up)
% | The compute node is currently being deleted (powering down)