Resource Managing System

The OpenGridEngine queuing system is useful when you have a lot of tasks to execute and want to distribute the tasks over a cluster of machines. For example, you might need to run hundreds of simulations/experiments with varying parameters. Using a queuing system in these situations has the following advantages:

  • Scheduling - allows you to schedule a virtually unlimited amount of work to be performed when resources become available. This means you can simply submit as many tasks (or jobs) as you like and let the queuing system handle executing them all.

  • Load Balancing - automatically distributes tasks across the cluster such that any one node doesn’t get overloaded compared to the rest.

  • Monitoring/Accounting - ability to monitor all submitted jobs and query which cluster nodes they’re running on, whether they’re finished, encountered an error, etc. Also allows querying job history to see which tasks were executed on a given date, by a given user, etc.

 

Queue:
Cluster queues are logical groups of machines that provide certain resources and on which jobs can execute. When a job is execute by the OpenGridEngine, it is placed into a queue which satisfies all the resource requirements of the job. 

Here is a listing of the current Newton cluster queues:

 

Submitting Jobs

All jobs require at least one available slot on a node in the cluster to run. A slot is equivalent to one core managed by the parallel environment.

Submitting jobs is done using the qsub(batch jobs) or the qlogin(interactive jobs) command. Let’s try submitting a simple job that runs the hostname command on a given cluster node:

Syntax:

qsub -q [queue] -w e -V -N [job_name] -l h_vmem=[memory] -l h_rt=[time] -l s_rt=[time] -pe shm [n_processors] -o [outputlogfile] -e [errorlogfile] [pathtoScript] [arg1] [arg2]

Example:

username@node:~$ qsub -V -b y -cwd -q applicate.q hostname Your job 2585 ("hostname") has been submitted

 

  • The -V option to qsub states that the job should have the same environment variables as the shell executing qsub (recommended)

  • The -b option to qsub states that the command being executed could be a single binary executable or a bash script. In this case the command hostname is a single binary. This option takes a y or n argument indicating either yes the command is a binary or no it is not a binary.

  • The -cwd option to qsub tells OpenGridEngine that the job should be executed in the same directory that qsub was called.

  • The 2nd last argument specifies the queue in which the job has to run.

  • The last argument to qsub is the command to be executed (hostname in this case)

optional options:

  • -w e — verify options and abort if there is an error

  • -N <jobname> : name of the job

  • -l h_vmem=size — specify the amount of memory required (e.g. 3G or 3500M) (NOTE: This is memory per processor slot. So if you ask for 2 processors total memory will be 2 X hvmem_value)

  • -l h_rt=hh:mm:ss — specify the maximum run time (hours, minutes and seconds)

  • -l s_rt=hh:mm:ss — specify the soft run time limit (hours, minutes and seconds) - Remember to set both s_rt and h_rt

  • -wd <dir> : Set working directory for this job

  • -j [y/n] : whether you want to merge output and error log files

  • -m ea :Will send email when job ends or aborts

  • -P <projectName> — set the job's project

  • -t <start>-<end>:<incr> : submit a job array with start index <start>, stop index <end> in increments using <incr>

  • -hold_jid <comma separated list of job-ids, can also be a job id pattern such as 2722*> : will start the current job/job -array only after completion of all jobs in the comma separated list

  • -hold_jid_ad <job array id, pattern or name>: will start the current job in a job array only after completion of corresponding job in the job array

The index numbers will be exported to the job tasks via the environment variable $SGE_TASK_ID. The option arguments n, m and s will be available through the environment variables $SGE_TASK_FIRST, $SGE_TASK_LAST and $SGE_TASK_STEPSIZE.

 

Other Options

[-a date_time]                           request a start time
[-ac context_list]                       add context variable(s)
[-ar ar_id]                              bind job to advance reservation
[-A account_string]                      account string in accounting record
[-binding [env|pe|set] exp|lin|str]      binds job to processor cores
[-c n s m x]                             define type of checkpointing for job
           n           no checkpoint is performed.
           s           checkpoint when batch server is shut down.
           m           checkpoint at minimum CPU interval.
           x           checkpoint when job gets suspended.
           <interval>  checkpoint in the specified time interval.
[-ckpt ckpt-name]                        request checkpoint method
[-clear]                                 skip previous definitions for job
[-C directive_prefix]                    define command prefix for job script
[-dc simple_context_list]                delete context variable(s)
[-dl date_time]                          request a deadline initiation time
[-h]                                     place user hold on job
[-hard]                                  consider following requests "hard"
[-help]                                  print this help
[-i file_list]                           specify standard input stream file(s)
[-js job_share]                          share tree or functional job share
[-jsv jsv_url]                           job submission verification script to be used
[-masterq wc_queue_list]                 bind master task to queue(s)
[-notify]                                notify job before killing/suspending it
[-now y[es]|n[o]]                        start job immediately or not at all
[-p priority]                            define job's relative priority
[-R y[es]|n[o]]                          reservation desired
[-r y[es]|n[o]]                          define job as (not) restartable
[-sc context_list]                       set job context (replaces old context)
[-shell y[es]|n[o]]                      start command with or without wrapping <loginshell> -c
[-soft]                                  consider following requests as soft
[-sync y[es]|n[o]]                       wait for job to end and return exit code
[-S path_list]                           command interpreter to be used
[-verify]                                do not submit just verify
[-w e|w|n|v|p]                           verify mode (error|warning|none|just verify|poke) for jobs
[-@ file]                                read commandline input from file

Notice that the qsub command, when successful, will print the job number to stdout. You can use the job number to monitor the job’s status and progress within the queue as we’ll see in the next section.

 

Example: How to submit a 16 core process run.

The example below defines a job request with the job name example, it will allocate one full node 16 cores for 24 hours to run 16 parallel mpi tasks via mpirun. Total vmem allocation for this job is about 16x500M.

#!/bin/bash -f
#$ -N example
#$ -l h_rt=24:00:00
#$ -pe mpi-fn 16
#$ -l h_vmem=500M
#$ -S /bin/bash
#$ -q ded-parallelx.q
#$ -M username@met.no
#$ -m bae
#$ -o /home/username/OUT_$JOB_NAME.$JOB_ID
#$ -e /home/username/ERR_$JOB_NAME.$JOB_ID
#$ -R y
# ---------------------------

echo "Got $NSLOTS slots." 

module add your_software_module (e.g. benchmark/IOR-2.10.3)
module list

mpirun executable parameterfile

 

  • -N example - job name.

  • -l h_rt=24:00:00 - defines the job runtime. After 24 hours runtime the job will be finished, if it hasn't been finished successfully before.

  • -pe mpi-fn 16 - the variable defines the parallel environment - it must be a multiple of 16 in order to allocate always full nodes exclusively.

  • -q ded-parallel.q - the variable defines the queue in which the job must run. Currently MET users are allowed to run in the ded-parallel.q only.

  • -M username@met.no - user username will get email notifications.

  • -abe - event notification: job abort, job begin and job end.

  • -o - specify the job stdout directory and file name.

  • -e - specify the job stderr directory and file name.

  • -R y - enable job reservation.

  • -cwd – Place the output files (.e,.o) in the current working directory.

  • -r [y,n] – Should this job be re-runnable (default y)

 

16 programs on 16 cpus on the same node, parallel execution

The example below defines a job request with a runtime of 120 hours, the job will allocate one full node 16 cores to run 16 parallel tasks via parallel.

#!/bin/bash
#$ -S /bin/bash
#$ -l h_rt=120:00:00
#$ -pe mpi 16
#$ -l h_vmem=500M
#$ -q applicate.q
#$ -M username@email
#$ -m abe

echo "Running on $NSLOTS CPUs"
set -x
cd /lustre/store/username/FAUNA/Snap
/usr/bin/parallel -j $NSLOTS ./runSnapTerada.sh -- SnapRun0[0]? > snapRun00.nohup

exit 0 

 

Array-jobs: 2164 single-cpu programs started over complete system

The example below defines a job-array request. The -t 1-2184 will submit the job 2184 times, each time with a new $SGE_TASK_ID. It is the task of the user to map the the $SGE_TASK_ID to a useful task. All jobs can be run in parallel on all available machines. As of the time of writing, it is limited to 200 parallel instances. The -t has to start at a number >= 1.

#!/bin/bash
#$ -S /bin/bash
#$ -l h_rt=24:00:00
#$ -q applicate.q
#$ -l h_vmem=500M
#$ -t 1-2184
#$ -o /home/username/OUT_$JOB_NAME.$JOB_ID.$HOSTNAME.$TASK_ID
#$ -e /home/username/ERR_$JOB_NAME.$JOB_ID.$HOSTNAME.$TASK_ID

echo "Got $NSLOTS slots for job $SGE_TASK_ID."

./runSnapAllForecasts.pl $SGE_TASK_ID

 

Interactive job:

The example below defines an interactive job request, it will allocate one full node 16 cores for 1 hour.



username@c6220ii-9t58002-bj-compute-ext:~# qlogin -pe mpi 1 -q applicate.q -l h_vmem=10G
local configuration c6220ii-9t58002-bj-compute-ext.met.no not defined - using global configuration
Your job 98 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 98 has been successfully scheduled.

Last login: Fri Jan 12 12:18:08 2018 from 157.249.157.82
username@c6220ii-7t58002-bj-compute-ext:~#