generated from amazon-archives/__template_MIT-0
-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
- Consider a timeout as nodes are joining, eg:
timeout=0
while [ "$AWS_BATCH_JOB_NUM_NODES" -gt "$lines" ]
do
timeout=$((timeout + 1))
if [ $timeout -gt 240 ]; then
echo "All nodes not joined within 4 minutes. Terminating. Recommend rerun."
exit 1
fi
log "$lines out of $AWS_BATCH_JOB_NUM_NODES nodes joined, will check again in 1 second"
sleep 1
lines=$(uniq $HOST_FILE_PATH|wc -l)
done
Should a node fail during startup, the master and other workers will spin until the overall timeout kills it. You can get rid of it quicker by limiting the join time.
-
For TCP, you'll want an appropriate set of flags. The last one is key here or the MPI network gets a packet from an IP it doesn't expect, causing all kinds of problems:
--mca pml ob1 --mca btl tcp,self --mca btl_tcp_if_include eth0 -
A small modification to reflect whatever the application returned in the status of the job:
<user's logic>
RESULT_CODE=$?
sleep 2
log "done! goodbye, writing exit code to $AWS_BATCH_EXIT_CODE_FILE and shutting down my supervisord"
echo $RESULT_CODE > $AWS_BATCH_EXIT_CODE_FILE
Metadata
Metadata
Assignees
Labels
No labels