Skip to content

MPI/SRUN usage on a single cluster node #40

@Fantasy98

Description

@Fantasy98

Background: DRL using 4CFDs-Environments

  • Using a computation node with 4GPUs and 64CPUS

  • Using the smartsim configuration:

    smartsim:
      n_dbs: 1
      network_interface: "lo"
      run_command: "mpirun"
      launcher: "local"
    

Encountered issue:

  • All of the CFD are run at GPU devices device:0, which leads to low-efficiency usage of the computation resources.

  • The rank_file (i.e., .env000.txt) for launching mpirun is:

    rank 0=alvis4-05 slot=1        
    

Clearly, there is no binding of GPU devices.

Suggestions:

  • Modify the usage of local / slurm configuration of SmartFlow to adopt this usage. We may consider any of the paths:
  1. Make srun able to use for single cluster node
  2. Incorporate the GPU-related arguments in rank_file.

I am currently working on the option#2, and I shall keep you posted by this issue. @soaringxmc

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions