Skip to content

babysitter job to automatically kill hung jobs #8

@yaroslavvb

Description

@yaroslavvb

idea for babysitter job, cc @8enmann

job = ncluster.make_job(name=args.name,
                            run_name=f"{args.name}",
                            num_tasks=config.machines,
                            image_name=config.image_name,
                            instance_type=config.instance_type,
                            spot=not args.nospot,
                            skip_setup=args.skip_setup)

killer_task = ncluster.make_task()
killer_task.run(f'export AWS_ACCESS_KEY_ID={os.environ["AWS_ACCESS_KEY_ID"]}')
killer_task.run(f'export AWS_SECRET_ACCESS_KEY={os.environ["AWS_SECRET_ACCESS_KEY"]')
killer_task.run(f'export AWS_DEFAULT_REGION={os.environ["AWS_DEFAULT_REGION"]')
killer_task.run(f'python hung_job_killer.py --watchdir={job.logdir} --instances=",".join(t.name for t in job.tasks)')

The hung_job_killer would check watch_dir on a regular basis and kill all instances in --instances if watch_dir had no modifications for an hour.

Killing can be done with subset of the logic from ncluster command-line tool. Currently lookup_instances has special logic for exact_match which kicks in when fragment is wrapped in '', this should probably be a keyword argument instead

def kill(fragment=''):
  instances = u.lookup_instances(fragment, valid_states=['running', 'stopped'])
  instances_to_kill = []
  for i in instances:
    state = i.state['Name']
    if LIMIT_TO_CURRENT_USER and i.key_name != u.get_keypair_name():
      print(f"Skipping instance launched with key {i.key_name}, use reallykill to kill")
      continue
    print(u.get_name(i), i.instance_type, i.key_name,
          state if state == 'stopped' else '')
    instances_to_kill.append(i)

  action = 'terminating'
  if not _check_instance_found(instances, fragment):
    return

  ec2_client = u.get_ec2_client()
  if answer.lower() == "y":
    instance_ids = [i.id for i in instances_to_kill]
    response = ec2_client.terminate_instances(InstanceIds=instance_ids)

    assert u.is_good_response(response), response
    print(f"{action}: success")
  else:
    print("Didn't get y, doing nothing")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions