Automatically requeuing Jobs that timeout
When a Job reaches the requested timelimit (i.e. the partition limit), it will be terminated (most likely not gracefully). In some cases it might be desirable to automatically requeue a Job in this case, however there are prerequisites that need to be fullfilled. Make sure one of the following conditions is true:
- your application periodically saves some kind of checkpoint file to disk, from which your calculation can be resumed.
- your application can be gracefully terminated using some sort of signal (presence of a file, sending of a specific posix signal)
Watch your jobs that are setup this way! These Jobs might be running forever as there is no limit in the number of requeues. Recommendation: limit the maximum number of restarts by tracking SLURM_RESTART_COUNT
.
As an alternative way Chain Jobs can be used, which have a predefined limit of repetitions.
Run cleanup actions before a Job is terminated by Slurm due to TIMEOUT
Sometimes it is necessary to perform some postprocessing or cleanup actions at the end of a Job. If a Job runs into TIMEOUT these actions won't be performed. A typical example of this is the usage of local filesystem space, because files need to be copied back.