-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
Hi!
I got an issue with lanunching the framework on a cluster.
Encountered issue:
When using a single GPU node on the cluster JUWELS, I encoutered an issue with using the slurm launcher
# Setting up "slurm" runtime...
�[91m
!! Failed: Less than 2 hosts found in environment! !!
�[0m
# Trying to setup LOCAL runtime instead...
# Success!
# Starting the Orchestrator...
# Success!
# Use this command to shutdown database if not terminated correctly:
# $(smart dbcli) -h 127.0.0.1 -p 6557 shutdown
# Configuration of runtime environment:
# Scheduler: local
# Hosts: ['jwb0129']
This returns error regarding runtime, as the launcher is switched to local:
File "/p/scratch/deepwing/yuningw/04-Reproduce-SOD2D/examples/juwels_gpu_cyl_rl/ywsmf/runtime.py", line 230, in launch_models
raise ValueError('srun launcher only supported for SLURM scheduler!')
ValueError: srun launcher only supported for SLURM scheduler!
Background:
In config.yml:
smartsim:
n_dbs: 1
network_interface: "ib0"
run_command: "srun"
launcher: "slurm"
For sbatch script:
#SBATCH -t 01:00:00
#SBATCH -p develbooster
#SBATCH --exclusive
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
My questions:
- How to solve this issue?
- Why requiring 2 hosts?
- Any tricks for
SBATCH? - Any room for improvement?
Please let me know your thoughts on this, thanks a lot!
Metadata
Metadata
Assignees
Labels
No labels