Skip to content

Few trivial changes we've noted  #5

@c200chromebook

Description

@c200chromebook
  1. Consider a timeout as nodes are joining, eg:
timeout=0
while [ "$AWS_BATCH_JOB_NUM_NODES" -gt "$lines" ]
do
  timeout=$((timeout + 1))
  if [ $timeout -gt 240 ]; then
    echo "All nodes not joined within 4 minutes. Terminating. Recommend rerun."
    exit 1
  fi
  log "$lines out of $AWS_BATCH_JOB_NUM_NODES nodes joined, will check again in 1 second"
  sleep 1
  lines=$(uniq $HOST_FILE_PATH|wc -l)
done

Should a node fail during startup, the master and other workers will spin until the overall timeout kills it. You can get rid of it quicker by limiting the join time.

  1. For TCP, you'll want an appropriate set of flags. The last one is key here or the MPI network gets a packet from an IP it doesn't expect, causing all kinds of problems: --mca pml ob1 --mca btl tcp,self --mca btl_tcp_if_include eth0

  2. A small modification to reflect whatever the application returned in the status of the job:

  <user's logic>
  RESULT_CODE=$?
  sleep 2
  log "done! goodbye, writing exit code to $AWS_BATCH_EXIT_CODE_FILE and shutting down my supervisord"
  echo $RESULT_CODE > $AWS_BATCH_EXIT_CODE_FILE

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions