How to use the Batch system at UMD

 

Before starting

It would be helpful to add the following paths in your path: /usr/torque/bin and /usr/torque/maui/bin.

For bash add 'export PATH=$PATH:/usr/torque/bin:/usr/torque/maui/bin' to your .bashrc and for csh add 'setenv PATH ${PATH}:/usr/torque/bin:/usr/torque/maui/bin' to your .cshrc

How to Submit a job

You can submit jobs using the 'qsub' command from cetus.umd.edu. You must specify the queue name and the script/binary to be executed. Optionally you can specify the required resources for your job (memory, #cpus, time allowed to run etc).

When you submit a job with qsub it goes to the queue. After some time and depending on the priority of the job (see below) the jobs is sent to a node for execution.

When the jobs finishes (for any reason), Torque saves the standard output and the standard error to some files. You can specify the location of these files with appropriate parameters in qsub. By default there are saved in the same folder of your executable. Optionally, you can make the system send an mail to you when a job is started, finished or aborted. The default job name is the name of the script/binary to be executed. You can change it with a qsub parameter.

Parameters:

-q Specify the queue use -q MilagroMC or -q MilagroAnalysis depending on the type of your job. There is also a queue with very high priority but max time=30mins called "Short".
-j {eo,oe} Causes the standard error and standard output to be combined in one log file.
  • eo - standard output is added to standard error
  • oe - standard error is added to standard output
-m {n,a,b,e} Causes mail to be sent to the user or not
  • n - no email is sent
  • a - the job aborts
  • b - the job begins running
  • e - the job ends running
-M email Email address to sent notification emails If not specified, an email is sent to username@localhost
-N name Sets the job name to "name" instead of the name of the script file.

 

-o name

Sets the standard output file to "name" instead of script_file_name.o$PBS_JOBID.

$PBS_JOBID is an environment variable created by pbs that contains the PBS job identifier (eg. 1234.cetus.umd.edu)

 
-e name Sets the standard error file to "name" instead of script_file_name.e$PBS_JOBID  
-l {mem, walltime} Set's some maximum resources for the job. If the jobs exceeds these resources it's killed.
  • mem - Maximum amount of physical memory used by the job
  • walltime - Maximum amount of real time during which the job can be in the running state ( seconds, or [[HH:]MM:]SS)
-l nice Adjust the process' execution priority integer between -20 (highest priority) and 19 (lowest priority)
-u username Specifies the name of the user submitting the job  

All these parameters can also be passed to Torque from inside your script as preprocessing directives.

So

qsub -l mem=300MB -q MilagroMC myMCscript.sh

is equivalent to

qsub myMCscript.sh

provided you add the following lines at the beginning of your script:

#PBS -l mem=300MB

#PBS -q MilagroMC

 

For more information on qsub visit:

Torque Job Submission

qsub man page

common qsub usage


For a working example on how to use the batch system see the scripts folder in the g4sim distribution. The script to be queued is '_sendone.pl' and the script 'Submit' issues the 'qsub' command with the appropriate parameters. There is a loop in the 'Submit' script so if the user issues

Submit 20 proton

the script sends 20 instances of _sendone.pl to the queue via qsub commands.



How to check job/queue status

You can check the current queue with 'qstat' from /usr/torque/bin on cetus. The columns of 'qstat's output are Job Number, Job Name, User, Time running, Status, Queue. The Status can be Q(ueued) = waiting for execution with its time comes, R(unning), E(xiting). For more information visit qstat manual page.

You can also use the utility checkjob from /usr/torque/maui/bin (on cetus always) to get more information about a specific job:

ex. checkjob 11243

NOTE: Use just the job number with 'checkjob'. Don't add the host in the jobname

Delete a job

To delete a job from the queue, you have to use 'qdel' from /usr/torque/bin on cetus.

Example

qdel 12197.cetus.umd.edu

qdel man page

qdel doesn't accept wildcards. If you want to delete many jobs you have to use the Graphical front end from PBS 'xpbs'.

 

Using the Graphical Frontend xpbs

In cetus.umd.edu, in the folder /usr/pbs/bin, you can find the graphical front end for torque 'xpbs'. With it you can have massive control of your jobs.

For more information on how to use xpbs see here.

xpbs manual page

 

Scheduling

The batch system is composed of two parts Torque and Maui. Torque is the resource manager, it keeps an eye on the available resources and is responsible for starting/killing the jobs. However it doesn't decide which job is to be run first from the queue. The queue isn't working on a first come first serve basis. The selection of the to-be-executed job is made by the scheduler, Maui. Maui picks the jobs from the queue and sends them to Torque for execution.

Each job has a priority and the one with the highest priority is pulled from the queue for execution. The default starting priority for the queue MilagroAnalysis is 1000, for the MilagroMC queue 100 and for the Short queue 10000. The short queue is for running short test jobs. This starting priority is modified by the time a job is queued and the amount of cpu time the user submitting the job has used in the previous 4 hours.

Also take note that the MilagroMC queue can use all the computers in the cluster, while the MilagroAnalysis queue cannot use the workstations (due to the fact that analysis jobs are more memory consuming).

For information of Maui see the Maui Scheduler Administrator's Guide

For more information on Maui commands see Appendix G. Commands Overview

 

vlasisva@umdgrb.umd.edu