|
| 1 | +Multinode Training |
| 2 | +================== |
| 3 | + |
| 4 | +.. _wuxibin89: https://github.com/wuxibin89 |
| 5 | + |
| 6 | +Author: `Xibin Wu <https://github.com/wuxibin89>`_ |
| 7 | + |
| 8 | +Manual |
| 9 | +------ |
| 10 | + |
| 11 | +Set up multinode ray cluster |
| 12 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 13 | +1. Start head node with ``ray start --head --dashboard-host=0.0.0.0``, there're 2 address you should care about: |
| 14 | + |
| 15 | +- GCS address: ``ray start --address=<address>``, where worker node should connect to. |
| 16 | +- Dashboard address: ``<address>:8265``, where you should submit job to the cluster. |
| 17 | + |
| 18 | +.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/head.png?raw=true |
| 19 | + |
| 20 | +2. Start worker node with ``ray start --address=<address>`` you get above. |
| 21 | + |
| 22 | +.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/worker.png?raw=true |
| 23 | + |
| 24 | +3. Now you should see the cluster have 2 nodes with ``ray status``. |
| 25 | + |
| 26 | +.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/status.png?raw=true |
| 27 | + |
| 28 | +4. Additionally, you can access dashboard in the browser with the address you get above. |
| 29 | + |
| 30 | +*Firewall rules maybe need configure to access the dashboard, if there's any trouble, please contact your network administrator.* |
| 31 | + |
| 32 | +.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/overview.png?raw=true |
| 33 | + |
| 34 | +Submit job to ray cluster |
| 35 | +~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 36 | +1. Submit ray job to cluster with the dashboard address you get above. |
| 37 | + |
| 38 | +.. code-block:: bash |
| 39 | +
|
| 40 | + ray job submit --address="http://127.0.0.1:8265" \ |
| 41 | + --runtime-env=verl/trainer/runtime_env.yaml \ |
| 42 | + --no-wait \ |
| 43 | + -- \ |
| 44 | + python3 -m verl.trainer.main_ppo \ |
| 45 | + trainer.n_gpus_per_node=8 \ |
| 46 | + trainer.nnodes=2 \ |
| 47 | + ... |
| 48 | +
|
| 49 | +.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/submit.png?raw=true |
| 50 | + |
| 51 | +2. Then you can check the job status with the following commands: |
| 52 | + |
| 53 | +- ray job list: list all jobs submitted to the cluster. |
| 54 | +- ray job logs <Submission ID>: query the logs of the job. |
| 55 | +- ray job status <Submission ID>: query the status of the job. |
| 56 | +- ray job stop <Submission ID>: request the job to be stopped. |
| 57 | + |
| 58 | +3. You can also access driver/task/actor logs in ``/tmp/ray/session_latest/logs/``, driver log is ``job-driver-raysubmit_<Submission ID>.log``. |
| 59 | + |
| 60 | +4. We strongly recommend you to view job detail from dashboard in multinode training, because it provide more structure way to view the job information. |
| 61 | + |
| 62 | +.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/job.png?raw=true |
| 63 | +.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/job_detail.png?raw=true |
| 64 | + |
| 65 | + |
| 66 | +Slurm |
| 67 | +----- |
| 68 | +TBD |
| 69 | + |
| 70 | +How to debug? |
| 71 | +--------------------- |
| 72 | + |
| 73 | +Legacy Ray Debugger |
| 74 | +~~~~~~~~~~~~~~~~~~~ |
| 75 | +1. Ray has a builtin legacy `debugger <https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/ray-debugging.html>`_ that allows you to debug your distributed applications. To enable debugger, start ray cluster with ``RAY_DEBUG=legacy`` and ``--ray-debugger-external``. |
| 76 | + |
| 77 | +.. code-block:: bash |
| 78 | +
|
| 79 | + # start head node |
| 80 | + RAY_DEBUG=legacy ray start --head --dashboard-host=0.0.0.0 --ray-debugger-external |
| 81 | + # start worker node |
| 82 | + RAY_DEBUG=legacy ray start --address='10.124.46.192:6379' --ray-debugger-external |
| 83 | +
|
| 84 | +2. Set up breakpoint in your code, and submit job to cluster. Then run ``ray debug`` to wait breakpoint: |
| 85 | + |
| 86 | +.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/legacy.png?raw=true |
| 87 | + |
| 88 | +Ray Distributed Debugger VSCode Extension |
| 89 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 90 | + |
| 91 | +1. Starting with Ray 2.39, Anyscale introduce a new `Ray Distributed Debugger <https://docs.ray.io/en/latest/ray-observability/ray-distributed-debugger.html>`_ VSCode extension. Please follow the instruction to install the extension, and then add cluster with the dashboard address you get above. |
| 92 | + |
| 93 | +*NOTE: Don't forget remove RAY_DEBUG=legacy and --ray-debugger-external in ray start* |
| 94 | + |
| 95 | +.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/debugger.png?raw=true |
| 96 | + |
| 97 | +2. Set up breakpoint in your code, and submit job to cluster. Then the extension will show the breakpoint information. |
| 98 | + |
| 99 | +.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/breakpoint.png?raw=true |
0 commit comments