Gracefully terminate Jobs that run into timeout
When a Job reaches the requested timelimit (i.e. the partition limit), it will be terminated (most likely not gracefully). Make sure one of the following conditions is true and choose the appropriate method:
- your application periodically saves some kind of checkpoint file to disk, from which your calculation can be resumed.
- your application can be gracefully terminated using some sort of signal (presence of a file, sending of a specific posix signal)
- your application can be gracefully terminated by checking for existance of a file that instructs the programm to come to an end
Automatically requeuing Jobs that timeout
In some cases it might be desirable to automatically requeue a Job until it is fully finished.
Watch your jobs that are setup this way! These Jobs might be running forever as there is no limit in the number of requeues. Recommendation: limit the maximum number of restarts by tracking SLURM_RESTART_COUNT
.
# Make sure the variable $LASTEXITCODE has been saved, unless you use a different condition as shown below. # modify the nextline to check for an specific exitcode if needed, otherwise a value greater zero typically means that it is not done yet. # this condition can also be totally different. Like the existance or non-existance of a particular file indicating that the calculation is finished or not. if [[ $LASTEXITCODE -gt 0 ]]; then # Set limits for maximum number of restarts SLURM_RESTART_COUNT_MAX=10 SLURM_JOB_MIN_RUNTIME=60 if (( $(date +%s) - SLURM_JOB_START_TIME < SLURM_JOB_MIN_RUNTIME )); then echo "SLURM_JOB_MIN_RUNTIME ($SLURM_JOB_MIN_RUNTIME) NOT reached! Not requeing..." elif (( SLURM_RESTART_COUNT < SLURM_RESTART_COUNT_MAX )); then scontrol requeue $SLURM_JOB_ID else echo "SLURM_RESTART_COUNT_MAX ($SLURM_RESTART_COUNT_MAX) reached!" fi fi
As an alternative way Chain Jobs can be used, which have a predefined limit of repetitions.
Run cleanup actions before a Job is terminated by Slurm due to TIMEOUT
Sometimes it is necessary to perform some postprocessing or cleanup actions at the end of a Job. If a Job runs into TIMEOUT these actions won't be performed. A typical example of this is the usage of local filesystem space, because files need to be copied back.