Skip to content

Launching SmartRedis on the cluster #32

@Fantasy98

Description

@Fantasy98

Hi!
I got an issue with lanunching the framework on a cluster.

Encountered issue:

When using a single GPU node on the cluster JUWELS, I encoutered an issue with using the slurm launcher

  # Setting up "slurm" runtime...
  �[91m
   !! Failed: Less than 2 hosts found in environment! !! 
  �[0m
  # Trying to setup LOCAL runtime instead...
  # Success!
  # Starting the Orchestrator...
  # Success!
  
  # Use this command to shutdown database if not terminated correctly:
  # $(smart dbcli) -h 127.0.0.1 -p 6557 shutdown
  
  # Configuration of runtime environment:
  #   Scheduler: local
  #   Hosts:     ['jwb0129']

This returns error regarding runtime, as the launcher is switched to local:

  File "/p/scratch/deepwing/yuningw/04-Reproduce-SOD2D/examples/juwels_gpu_cyl_rl/ywsmf/runtime.py", line 230, in launch_models
    raise ValueError('srun launcher only supported for SLURM scheduler!')
ValueError: srun launcher only supported for SLURM scheduler!

Background:

In config.yml:

      smartsim:
        n_dbs: 1
        network_interface: "ib0"
        run_command: "srun"
        launcher: "slurm"

For sbatch script:

      #SBATCH -t 01:00:00
      #SBATCH -p develbooster
      #SBATCH --exclusive
      #SBATCH -N 1
      #SBATCH --ntasks-per-node=1
      #SBATCH --cpus-per-task=48

My questions:

  1. How to solve this issue?
  2. Why requiring 2 hosts?
  3. Any tricks for SBATCH?
  4. Any room for improvement?
    Please let me know your thoughts on this, thanks a lot!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions