-
Notifications
You must be signed in to change notification settings - Fork 1.1k
add npu megatron multi-node example #7321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,33 @@ | ||||||||
| # Atlas A2 * 2 nodes * 8 cards per node | ||||||||
|
|
||||||||
| ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ | ||||||||
| NNODES=2 \ | ||||||||
| NODE_RANK=0 \ | ||||||||
| MASTER_ADDR=127.0.0.1 \ | ||||||||
| MASTER_PORT=29500 \ | ||||||||
| NPROC_PER_NODE=8 \ | ||||||||
|
||||||||
| NPROC_PER_NODE=8 \ | |
| NPROC_PER_NODE=8 \ | |
| # Replace 'xxx' with your actual network interface name (e.g., eth0, ens33). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scripts node1.sh and node2.sh are nearly identical, which introduces code duplication and can make maintenance difficult. Consider merging them into a single script that accepts NODE_RANK and MASTER_ADDR as command-line arguments. This would make the example cleaner, more robust, and easier for users to adapt. For example, a single run.sh could be used as bash run.sh <NODE_RANK> <MASTER_ADDR>.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| # Atlas A2 * 2 nodes * 8 cards per node | ||
|
|
||
| ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ | ||
| NNODES=2 \ | ||
| NODE_RANK=1 \ | ||
| MASTER_ADDR=xxx.xxx.xxx.xxx \ | ||
| MASTER_PORT=29500 \ | ||
| NPROC_PER_NODE=8 \ | ||
| HCCL_SOCKET_IFNAME=xxx \ | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Comment on lines
+6
to
+9
|
||
| megatron sft \ | ||
| --model 'Qwen/Qwen3-8B' \ | ||
| --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#1000' \ | ||
| --save './SAVE' \ | ||
| --train_type 'lora' \ | ||
| --lora_rank 8 \ | ||
| --lora_alpha 32 \ | ||
| --target_modules 'all-linear' \ | ||
| --tensor_model_parallel_size 2 \ | ||
| --pipeline_model_parallel_size 1 \ | ||
| --context_parallel_size 1 \ | ||
| --sequence_parallel true \ | ||
| --micro_batch_size 1 \ | ||
| --global_batch_size 64 \ | ||
| --recompute_granularity selective \ | ||
| --recompute_modules core_attn \ | ||
| --cross_entropy_loss_fusion true \ | ||
| --no_gradient_accumulation_fusion true \ | ||
| --lr 1e-4 \ | ||
| --lr_warmup_fraction 0.05 \ | ||
| --min_lr 1e-5 \ | ||
| --max_epochs 1 \ | ||
| --log_interval 5 \ | ||
| --num_workers 4 | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MASTER_ADDRis set to127.0.0.1, which is incorrect for a multi-node setup as it refers to the local machine. For this example to work across multiple nodes, this should be the IP address of the master node, which must be reachable from all other nodes. Please use a placeholder like innode2.sh.