Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions examples/ascend/multi-node/megatron/node1.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Atlas A2 * 2 nodes * 8 cards per node

ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NNODES=2 \
NODE_RANK=0 \
MASTER_ADDR=127.0.0.1 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

MASTER_ADDR is set to 127.0.0.1, which is incorrect for a multi-node setup as it refers to the local machine. For this example to work across multiple nodes, this should be the IP address of the master node, which must be reachable from all other nodes. Please use a placeholder like in node2.sh.

Suggested change
MASTER_ADDR=127.0.0.1 \
MASTER_ADDR=xxx.xxx.xxx.xxx \

MASTER_PORT=29500 \
NPROC_PER_NODE=8 \
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The placeholder value 'xxx' for HCCL_SOCKET_IFNAME needs to be replaced with the actual network interface name. Consider adding a comment explaining that users must replace this placeholder with their actual network interface name (e.g., eth0, ens33).

Suggested change
NPROC_PER_NODE=8 \
NPROC_PER_NODE=8 \
# Replace 'xxx' with your actual network interface name (e.g., eth0, ens33).

Copilot uses AI. Check for mistakes.
HCCL_SOCKET_IFNAME=xxx \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The value xxx for HCCL_SOCKET_IFNAME is a placeholder. It would be helpful to add a comment above this line explaining that this needs to be replaced with the actual network interface name used for communication between nodes (e.g., eth0).

megatron sft \
--model 'Qwen/Qwen3-8B' \
--dataset 'AI-ModelScope/alpaca-gpt4-data-zh#1000' \
--save './SAVE' \
--train_type 'lora' \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules 'all-linear' \
--tensor_model_parallel_size 2 \
--pipeline_model_parallel_size 1 \
--context_parallel_size 1 \
--sequence_parallel true \
--micro_batch_size 1 \
--global_batch_size 64 \
--recompute_granularity selective \
--recompute_modules core_attn \
--cross_entropy_loss_fusion true \
--no_gradient_accumulation_fusion true \
--lr 1e-4 \
--lr_warmup_fraction 0.05 \
--min_lr 1e-5 \
--max_epochs 1 \
--log_interval 5 \
--num_workers 4
Comment on lines +3 to +33
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The scripts node1.sh and node2.sh are nearly identical, which introduces code duplication and can make maintenance difficult. Consider merging them into a single script that accepts NODE_RANK and MASTER_ADDR as command-line arguments. This would make the example cleaner, more robust, and easier for users to adapt. For example, a single run.sh could be used as bash run.sh <NODE_RANK> <MASTER_ADDR>.

33 changes: 33 additions & 0 deletions examples/ascend/multi-node/megatron/node2.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Atlas A2 * 2 nodes * 8 cards per node

ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NNODES=2 \
NODE_RANK=1 \
MASTER_ADDR=xxx.xxx.xxx.xxx \
MASTER_PORT=29500 \
NPROC_PER_NODE=8 \
HCCL_SOCKET_IFNAME=xxx \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The value xxx for HCCL_SOCKET_IFNAME is a placeholder. It would be helpful to add a comment above this line explaining that this needs to be replaced with the actual network interface name used for communication between nodes (e.g., eth0).

Comment on lines +6 to +9
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The placeholder values 'xxx.xxx.xxx.xxx' for MASTER_ADDR and 'xxx' for HCCL_SOCKET_IFNAME need to be replaced with actual values. Consider adding a comment explaining that users must replace these placeholders with their actual master node IP address and network interface name (e.g., eth0, ens33).

Copilot uses AI. Check for mistakes.
megatron sft \
--model 'Qwen/Qwen3-8B' \
--dataset 'AI-ModelScope/alpaca-gpt4-data-zh#1000' \
--save './SAVE' \
--train_type 'lora' \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules 'all-linear' \
--tensor_model_parallel_size 2 \
--pipeline_model_parallel_size 1 \
--context_parallel_size 1 \
--sequence_parallel true \
--micro_batch_size 1 \
--global_batch_size 64 \
--recompute_granularity selective \
--recompute_modules core_attn \
--cross_entropy_loss_fusion true \
--no_gradient_accumulation_fusion true \
--lr 1e-4 \
--lr_warmup_fraction 0.05 \
--min_lr 1e-5 \
--max_epochs 1 \
--log_interval 5 \
--num_workers 4
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
NPROC_PER_NODE=2 \
ASCEND_RT_VISIBLE_DEVICES=0,1 \
megatron sft \
--model Qwen/Qwen2.5-7B-Instruct \
--model Qwen/Qwen3-8B \
--load_safetensors true \
--save_safetensors true \
--dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
Expand All @@ -24,7 +24,7 @@ megatron sft \
--lr_warmup_fraction 0.05 \
--min_lr 1e-5 \
--max_epochs 1 \
--save megatron_output/Qwen2.5-7B-Instruct \
--save megatron_output/Qwen3-8B \
--save_interval 100 \
--max_length 2048 \
--system 'You are a helpful assistant.' \
Expand Down
Loading