updated README.md

sugunav14 · sugunav14 · commit 91a95052e4db · 2025-10-14T18:31:37.000Z
Signed-off-by: Suguna Velury &lt;178320438+sugunav14@users.noreply.github.com&gt;
diff --git a/examples/llm_ptq/README.md b/examples/llm_ptq/README.md
@@ -235,6 +235,38 @@ with init_quantized_weights(mtq.NVFP4_DEFAULT_CFG):
 mtq.calibrate(model, algorithm="max", forward_loop=calibrate_loop)
 ```
 
+## Multi-Node Post-Training Quantization with FSDP2
+
+ModelOpt enables quantization of LLMs across multiple GPU nodes using various quantization formats. It leverages HuggingFace's Accelerate library and FSDP2 for distributed model sharding and calibration.
+
+### Usage
+
+For distributed execution across multiple nodes, use the `accelerate` library. A template configuration file (`fsdp2.yaml`) is provided and can be customized based on your specific requirements.
+
+On each node run the following command:
+
+```bash
+accelerate launch --config_file fsdp2.yaml \
+    --num_machines=<num_nodes> \
+    --machine_rank=<current_node_rank> \
+    --main_process_ip=<node0_ip_addr> \
+    --main_process_port=<port> \
+    --fsdp_transformer_layer_cls_to_wrap=<decoder_layer_name>
+     multinode-ptq.py \
+    --pyt_ckpt_path <path_to_model> \
+    --qformat <fp8/nvfp4/nvfp4_awq/int4_awq/int8_sq> \
+    --kv_cache_quant <fp8/nvfp4/nvfp4_affine/none> \
+    --batch_size <calib_batch_size> \
+    --calib-size <no_calib_samples> \
+    --dataset <dataset> \
+    --export_path <export_path> \
+    --trust_remote_code 
+```
+
+The exported checkpoint can be deployed using TensorRT-LLM/ vLLM/ SGLang. For more details refer to the [deployment section](#deployment) of this document.
+
+> *Performance Note: FSDP2 is designed for training workloads and may result in longer calibration and export times. For faster calibration, maximize the batch size based on available GPU memory.*
+>
 ## Framework Scripts
 
 ### Hugging Face Example [Script](./scripts/huggingface_example.sh)