Automatically requeuing Jobs that timeout

When a Job reaches the requested timelimit (i.e. the partition limit), it will be terminated (most likely not gracefully). In some cases it might be desirable to automatically requeue a Job in this case, however there are prerequisites that need to be fullfilled. Make sure one of the following conditions is true:

  • your application periodically saves some kind of checkpoint file to disk, from which your calculation can be resumed.
  • your application can be gracefully terminated using some sort of signal (presence of a file, sending of a specific posix signal)


Watch your jobs that are setup this way! These Jobs might be running forever as there is no limit in the number of requeues. Recommendation: limit the maximum number of restarts by tracking SLURM_RESTART_COUNT .


Example using Slurm signal
#!/usr/bin/env bash
#SBATCH ...

# Send signal SIGINT to the currently running processes approx. 900 (may already happen up to 60s earlier)
# You application needs to act upon receiving this signal, otherwise nothing will happen
#SBATCH --signal=SIGINT@900

srun my_command
LASTEXITCODE=$?
echo "Exited with code ${LASTEXITCODE}"

# you may do some more things here, but it has to be finished latest within 840 seconds.

# Set limits for maximum number of restarts
SLURM_RESTART_COUNT=${SLURM_RESTART_COUNT:-0}
SLURM_RESTART_COUNT_MAX=10

# typically applications emit an exitcode
if [[ $LASTEXITCODE -gt 0 ]]; then
  if [[ $SLURM_RESTART_COUNT -lt $SLURM_RESTART_COUNT_MAX ]]; then
  	scontrol requeue $SLURM_JOB_ID
  else
	echo "SLURM_RESTART_COUNT_MAX ($SLURM_RESTART_COUNT_MAX) reached!"
  fi
fi
Example using timeout
#!/usr/bin/env bash
#SBATCH ...

# timeout approx 900s before the Job runs into TIMEOUT
# lookup "man timeout" if you need other options like different termination signals
timeout $((SLURM_JOB_END_TIME - $(date +%s) - 900)) srun my_command
LASTEXITCODE=$?
echo "Exited with code ${LASTEXITCODE}"

# you may do some more things here, but it has to be finished latest within 840 seconds.

# Set limits for maximum number of restarts
SLURM_RESTART_COUNT=${SLURM_RESTART_COUNT:-0}
SLURM_RESTART_COUNT_MAX=10

# typically applications emit an exitcode
if [[ $LASTEXITCODE -gt 0 ]]; then
  if [[ $SLURM_RESTART_COUNT -lt $SLURM_RESTART_COUNT_MAX ]]; then
  	scontrol requeue $SLURM_JOB_ID
  else
	echo "SLURM_RESTART_COUNT_MAX ($SLURM_RESTART_COUNT_MAX) reached!"
  fi
fi


As an alternative way Chain Jobs can be used, which have a predefined limit of repetitions.

Run cleanup actions before a Job is terminated by Slurm due to TIMEOUT

Sometimes it is necessary to perform some postprocessing or cleanup actions at the end of a Job. If a Job runs into TIMEOUT these actions won't be performed. A typical example of this is the usage of local filesystem space, because files need to be copied back.

Example performing cleanup of local filesystem using timeout
#!/usr/bin/env bash
#SBATCH ...

# Step 1
TMP=/tmp
cp job.inp job.dat $TMP
 
# Step 2
pushd $TMP
# use signal or timout method here (see above)
popd
 
# Step 3
mv -v $TMP/* .