Gracefully terminate Jobs that run into timeout

When a Job reaches the requested timelimit (i.e. the partition limit), it will be terminated (most likely not gracefully). Make sure one of the following conditions is true and choose the appropriate method:

  • your application periodically saves some kind of checkpoint file to disk, from which your calculation can be resumed.
  • your application can be gracefully terminated using some sort of signal (presence of a file, sending of a specific posix signal)
  • your application can be gracefully terminated by checking for existance of a file that instructs the programm to come to an end
Example using Slurm signal sent to the running process
#!/usr/bin/env bash
#SBATCH ...

# Send signal SIGINT to the currently running processes approx. 900 (may already happen up to 60s earlier)
# You application needs to act upon receiving this signal, otherwise nothing will happen
#SBATCH --signal=SIGINT@900

srun my_command
LASTEXITCODE=$?
echo "Exited with code ${LASTEXITCODE}"

# Cleanup actions here, has to be finished latest within 900 seconds.
Example using Slurm signal sent to the bash jobscript triggering a bash trap
#!/usr/bin/env bash
#SBATCH ...

# Send signal SIGINT to the currently running processes approx. 900 (may already happen up to 60s earlier)
# You application needs to act upon receiving this signal, otherwise nothing will happen
#SBATCH --signal=B:SIGUSR1@900

# Define a trap function, will be executed in
function trigger_termination() {
	touch STOPFILE
	wait $LASTPID
	LASTEXITCODE=$?
}

# Activate the trap on a signal
trap trigger_termination SIGUSR1

# Sending the main program to background is essential, because the trap will only be activated during wait!
# If the main program is done before TIMEOUT
srun my_command &

LASTPID=$!
wait $LASTPID
LASTEXITCODE=${LASTEXITCODE:-$?}
echo "Exited with code ${LASTEXITCODE}" 

# Cleanup actions here, has to be finished latest within 900 seconds.
Example using timeout
#!/usr/bin/env bash
#SBATCH ...

# timeout approx 900s before the Job runs into TIMEOUT
# lookup "man timeout" if you need other options like different termination signals
timeout $((SLURM_JOB_END_TIME - $(date +%s) - 900)) srun my_command
LASTEXITCODE=$?
echo "Exited with code ${LASTEXITCODE}"

# Cleanup actions here, has to be finished latest within 900 seconds.


Automatically requeuing Jobs that timeout

In some cases it might be desirable to automatically requeue a Job until it is fully finished.

Watch your jobs that are setup this way! These Jobs might be running forever as there is no limit in the number of requeues. Recommendation: limit the maximum number of restarts by tracking SLURM_RESTART_COUNT .

# Make sure the variable $LASTEXITCODE has been saved, unless you use a different condition as shown below.
# modify the nextline to check for an specific exitcode if needed, otherwise a value greater zero typically means that it is not done yet.
# this condition can also be totally different. Like the existance or non-existance of a particular file indicating that the calculation is finished or not.
if [[ $LASTEXITCODE -gt 0 ]]; then
  # Set limits for maximum number of restarts
  SLURM_RESTART_COUNT_MAX=10
  SLURM_JOB_MIN_RUNTIME=60
  if  (( $(date +%s) - SLURM_JOB_START_TIME < SLURM_JOB_MIN_RUNTIME )); then
    echo "SLURM_JOB_MIN_RUNTIME ($SLURM_JOB_MIN_RUNTIME) NOT reached! Not requeing..."
  elif (( SLURM_RESTART_COUNT < SLURM_RESTART_COUNT_MAX )); then
  	scontrol requeue $SLURM_JOB_ID
  else
	echo "SLURM_RESTART_COUNT_MAX ($SLURM_RESTART_COUNT_MAX) reached!"
  fi
fi

 As an alternative way Chain Jobs can be used, which have a predefined limit of repetitions.

Run cleanup actions before a Job is terminated by Slurm due to TIMEOUT

Sometimes it is necessary to perform some postprocessing or cleanup actions at the end of a Job. If a Job runs into TIMEOUT these actions won't be performed. A typical example of this is the usage of local filesystem space, because files need to be copied back.

Example performing cleanup of local filesystem using timeout
#!/usr/bin/env bash
#SBATCH ...

# Step 1
TMP=/tmp
cp job.inp job.dat $TMP
 
# Step 2
pushd $TMP
# use signal or timout method here (see above)
popd
 
# Step 3
mv -v $TMP/* .