Skip to content

Commit 2fd6591

Browse files
authored
[megatron,ci] chore: update instructions and scripts for LoRA (#4533)
### What does this PR do? Now that several important fixes have been merged into Megatron-Bridge, it's better to update the instructions so that everything can really work correctly. Related to: - NVIDIA-NeMo/Megatron-Bridge#1564 - NVIDIA-NeMo/Megatron-Bridge#1603 - NVIDIA-NeMo/Megatron-Bridge#1627 - NVIDIA-NeMo/Megatron-Bridge#1628 Fix: #4303 ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) <sub>✨ Presented to you with <a href="https://macaron.im/mindlab">Mind Lab</a> - A Lab for Experiential Intelligence.</sub> Signed-off-by: Hollow Man <hollowman@opensuse.org>
1 parent ebec85d commit 2fd6591

File tree

4 files changed

+37
-11
lines changed

4 files changed

+37
-11
lines changed

.github/workflows/e2e_ppo_trainer_megatron_vllm.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -143,8 +143,8 @@ jobs:
143143
- name: clean up and install Megatron-Bridge
144144
run: |
145145
rm -rf checkpoints
146-
pip3 install git+https://github.com/NVIDIA-NeMo/Megatron-Bridge.git@af21db0 --no-deps --no-build-isolation
147-
pip3 install git+https://github.com/NVIDIA/Megatron-LM.git@3cbe5c6 --no-deps --no-build-isolation
146+
pip3 install git+https://github.com/NVIDIA-NeMo/Megatron-Bridge.git@a489bed --no-deps --no-build-isolation
147+
pip3 install git+https://github.com/NVIDIA/Megatron-LM.git@2d398b4 --no-deps --no-build-isolation
148148
pip3 install "nvidia-modelopt[torch]>=0.37.0" transformers==4.57.1
149149
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron, use Megatron-Bridge LoRA e2e to pre-load and save (Deepseek)
150150
run: |

.github/workflows/e2e_ppo_trainer_megatron_vllm_2.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -122,8 +122,8 @@ jobs:
122122
- name: Install the current repository
123123
run: |
124124
pip3 install --no-deps -e .[test]
125-
pip3 install git+https://github.com/NVIDIA-NeMo/Megatron-Bridge.git@af21db0 --no-deps --no-build-isolation
126-
pip3 install git+https://github.com/NVIDIA/Megatron-LM.git@3cbe5c6 --no-deps --no-build-isolation
125+
pip3 install git+https://github.com/NVIDIA-NeMo/Megatron-Bridge.git@a489bed --no-deps --no-build-isolation
126+
pip3 install git+https://github.com/NVIDIA/Megatron-LM.git@2d398b4 --no-deps --no-build-isolation
127127
pip3 install "nvidia-modelopt[torch]>=0.37.0" transformers==4.57.1
128128
- name: Prepare GSM8K dataset
129129
run: |

examples/grpo_trainer/run_qwen2-7b_math_megatron_lora.sh

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,11 @@
11
#!/usr/bin/env bash
22
set -xeuo pipefail
33

4+
# Need to install Megatron-Bridge
5+
# NOTE: Make sure you use Megatron-Bridge later than 0.2.0
6+
# (Recommend https://github.com/NVIDIA-NeMo/Megatron-Bridge/commit/a489bed3a2410ed9b000ec13a3c90176fec7d99c or later)
7+
# for proper MoE LoRA support.
8+
49
# For Megatron communication/computation overlapping
510
export CUDA_DEVICE_MAX_CONNECTIONS=1
611

@@ -41,8 +46,16 @@ DATA=(
4146

4247
MODEL=(
4348
actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct
44-
actor_rollout_ref.model.lora.rank=16
45-
actor_rollout_ref.model.lora.alpha=32
49+
actor_rollout_ref.model.lora.rank=256
50+
actor_rollout_ref.model.lora.alpha=512
51+
actor_rollout_ref.model.lora.lora_A_init_method=kaiming
52+
# # Optional: Use canonical LoRA
53+
# actor_rollout_ref.model.lora.type="canonical_lora"
54+
# actor_rollout_ref.model.lora.target_modules='["linear_q","linear_k","linear_v","linear_proj","linear_fc1_up","linear_fc1_gate","linear_fc2"]'
55+
56+
# # Optional: Add dropout to LoRA layers
57+
# actor_rollout_ref.model.lora.dropout=0.05
58+
# actor_rollout_ref.model.lora.dropout_position=pre
4659
)
4760

4861
ACTOR=(
@@ -58,6 +71,9 @@ ACTOR=(
5871
actor_rollout_ref.actor.kl_loss_coef=0.001
5972
actor_rollout_ref.actor.kl_loss_type=low_var_kl
6073
actor_rollout_ref.actor.entropy_coeff=0
74+
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform
75+
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full
76+
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1
6177
)
6278

6379
ROLLOUT=(

examples/grpo_trainer/run_qwen3moe-30b_megatron_lora.sh

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,11 @@ set -xeuo pipefail
33

44
# Need to install Megatron-Bridge
55
# NOTE: Make sure you use Megatron-Bridge later than 0.2.0
6-
# (after https://github.com/NVIDIA-NeMo/Megatron-Bridge/commit/36302b7ca1305f0690e17cf4e4019ac822746872)
7-
# for MoE LoRA When you want to set ETP and ETP!=TP.
8-
# https://github.com/NVIDIA-NeMo/Megatron-Bridge/issues/1363
6+
# (Recommend https://github.com/NVIDIA-NeMo/Megatron-Bridge/commit/a489bed3a2410ed9b000ec13a3c90176fec7d99c or later)
7+
# for proper MoE LoRA support.
8+
9+
# For Megatron communication/computation overlapping
10+
export CUDA_DEVICE_MAX_CONNECTIONS=1
911

1012
########################### Quick Config ###########################
1113

@@ -41,9 +43,17 @@ DATA=(
4143

4244
MODEL=(
4345
actor_rollout_ref.model.path=Qwen/Qwen3-30B-A3B-Instruct-2507
44-
actor_rollout_ref.model.lora.rank=16
45-
actor_rollout_ref.model.lora.alpha=32
4646
actor_rollout_ref.model.use_fused_kernels=True
47+
actor_rollout_ref.model.lora.rank=32
48+
actor_rollout_ref.model.lora.alpha=64
49+
actor_rollout_ref.model.lora.lora_A_init_method=kaiming
50+
# # Optional: Use canonical LoRA
51+
# actor_rollout_ref.model.lora.type="canonical_lora"
52+
# actor_rollout_ref.model.lora.target_modules='["linear_q","linear_k","linear_v","linear_proj","linear_fc1_up","linear_fc1_gate","linear_fc2"]'
53+
54+
# # Optional: Add dropout to LoRA layers
55+
# actor_rollout_ref.model.lora.dropout=0.05
56+
# actor_rollout_ref.model.lora.dropout_position=pre
4757
)
4858

4959
ACTOR=(

0 commit comments

Comments
 (0)