Skip to content

Commit 9ae01af

Browse files
authored
doc: add multinode training and debug tutorial (verl-project#585)
verl-project#354
1 parent cae8d2f commit 9ae01af

File tree

3 files changed

+101
-1
lines changed

3 files changed

+101
-1
lines changed

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ verl is fast with:
3232

3333
start/install
3434
start/quickstart
35+
start/multinode
3536

3637
.. toctree::
3738
:maxdepth: 4

docs/start/multinode.rst

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
Multinode Training
2+
==================
3+
4+
.. _wuxibin89: https://github.com/wuxibin89
5+
6+
Author: `Xibin Wu <https://github.com/wuxibin89>`_
7+
8+
Manual
9+
------
10+
11+
Set up multinode ray cluster
12+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
13+
1. Start head node with ``ray start --head --dashboard-host=0.0.0.0``, there're 2 address you should care about:
14+
15+
- GCS address: ``ray start --address=<address>``, where worker node should connect to.
16+
- Dashboard address: ``<address>:8265``, where you should submit job to the cluster.
17+
18+
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/head.png?raw=true
19+
20+
2. Start worker node with ``ray start --address=<address>`` you get above.
21+
22+
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/worker.png?raw=true
23+
24+
3. Now you should see the cluster have 2 nodes with ``ray status``.
25+
26+
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/status.png?raw=true
27+
28+
4. Additionally, you can access dashboard in the browser with the address you get above.
29+
30+
*Firewall rules maybe need configure to access the dashboard, if there's any trouble, please contact your network administrator.*
31+
32+
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/overview.png?raw=true
33+
34+
Submit job to ray cluster
35+
~~~~~~~~~~~~~~~~~~~~~~~~~
36+
1. Submit ray job to cluster with the dashboard address you get above.
37+
38+
.. code-block:: bash
39+
40+
ray job submit --address="http://127.0.0.1:8265" \
41+
--runtime-env=verl/trainer/runtime_env.yaml \
42+
--no-wait \
43+
-- \
44+
python3 -m verl.trainer.main_ppo \
45+
trainer.n_gpus_per_node=8 \
46+
trainer.nnodes=2 \
47+
...
48+
49+
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/submit.png?raw=true
50+
51+
2. Then you can check the job status with the following commands:
52+
53+
- ray job list: list all jobs submitted to the cluster.
54+
- ray job logs <Submission ID>: query the logs of the job.
55+
- ray job status <Submission ID>: query the status of the job.
56+
- ray job stop <Submission ID>: request the job to be stopped.
57+
58+
3. You can also access driver/task/actor logs in ``/tmp/ray/session_latest/logs/``, driver log is ``job-driver-raysubmit_<Submission ID>.log``.
59+
60+
4. We strongly recommend you to view job detail from dashboard in multinode training, because it provide more structure way to view the job information.
61+
62+
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/job.png?raw=true
63+
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/job_detail.png?raw=true
64+
65+
66+
Slurm
67+
-----
68+
TBD
69+
70+
How to debug?
71+
---------------------
72+
73+
Legacy Ray Debugger
74+
~~~~~~~~~~~~~~~~~~~
75+
1. Ray has a builtin legacy `debugger <https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/ray-debugging.html>`_ that allows you to debug your distributed applications. To enable debugger, start ray cluster with ``RAY_DEBUG=legacy`` and ``--ray-debugger-external``.
76+
77+
.. code-block:: bash
78+
79+
# start head node
80+
RAY_DEBUG=legacy ray start --head --dashboard-host=0.0.0.0 --ray-debugger-external
81+
# start worker node
82+
RAY_DEBUG=legacy ray start --address='10.124.46.192:6379' --ray-debugger-external
83+
84+
2. Set up breakpoint in your code, and submit job to cluster. Then run ``ray debug`` to wait breakpoint:
85+
86+
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/legacy.png?raw=true
87+
88+
Ray Distributed Debugger VSCode Extension
89+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
90+
91+
1. Starting with Ray 2.39, Anyscale introduce a new `Ray Distributed Debugger <https://docs.ray.io/en/latest/ray-observability/ray-distributed-debugger.html>`_ VSCode extension. Please follow the instruction to install the extension, and then add cluster with the dashboard address you get above.
92+
93+
*NOTE: Don't forget remove RAY_DEBUG=legacy and --ray-debugger-external in ray start*
94+
95+
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/debugger.png?raw=true
96+
97+
2. Set up breakpoint in your code, and submit job to cluster. Then the extension will show the breakpoint information.
98+
99+
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/breakpoint.png?raw=true

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ peft
1313
pyarrow>=15.0.0
1414
pybind11
1515
pylatexenc
16-
ray[data,train,tune,serve]
16+
ray[default]
1717
tensordict<0.6
1818
torchdata
1919
transformers

0 commit comments

Comments
 (0)