Skip to content

Conversation

Bruce-x-1997
Copy link

@Bruce-x-1997 Bruce-x-1997 commented Sep 15, 2025

What does this PR do?

when we run ptq, we see warning say that lack of dist.destroy_process, so I add it

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

  • Bug Fixes
    • Ensures distributed PTQ runs cleanly shut down on completion, preventing hanging processes.
    • Frees process groups across workers to avoid lingering GPU/CPU resource usage.
    • Invokes cleanup when PTQ finishes to improve reliability for multi-GPU workflows and repeated runs.
    • Reduces risk of deadlocks and timeouts when exiting distributed sessions.

@Bruce-x-1997 Bruce-x-1997 requested a review from a team as a code owner September 15, 2025 09:44
Copy link

copy-pr-bot bot commented Sep 15, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

coderabbitai bot commented Sep 15, 2025

Walkthrough

Adds a new helper function end_process() to examples/deepseek/ptq.py that reads WORLD_SIZE, RANK, and LOCAL_RANK from the environment and conditionally calls torch.distributed.destroy_process_group(); the function is defined twice in the file and invoked after saving PTQ results.

Changes

Cohort / File(s) Summary
PTQ distributed cleanup updates
examples/deepseek/ptq.py
Added end_process() which reads WORLD_SIZE/RANK/LOCAL_RANK and, when world_size > 1 and torch.distributed.is_initialized(), calls torch.distributed.destroy_process_group(); called after saving quantization results. A duplicate end_process() definition was also introduced.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant PTQ as "PTQ Script\n(examples/deepseek/ptq.py)"
    participant Worker
    participant Dist as "torch.distributed"

    User->>PTQ: run PTQ flow
    PTQ->>Worker: perform calibration / quantization
    Worker-->>PTQ: save quantization results

    rect rgba(224,240,255,0.3)
    note right of PTQ: call end_process()
    PTQ->>PTQ: read WORLD_SIZE/RANK/LOCAL_RANK
    alt world_size > 1 and dist.is_initialized()
        PTQ->>Dist: destroy_process_group()
        Dist-->>PTQ: resources cleaned
    else single-process or not initialized
        note over PTQ: no-op
    end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

a rabbit taps the code with care,
"read the ranks, then end the fare" 🐇
two twins named end_process appear,
one call to tidy, all workers clear.
hops away with a quantized cheer. ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "add end_process in deepseek ptq" succinctly and accurately summarizes the primary change introduced by the PR — adding an end_process call in the DeepSeek PTQ flow to clean up distributed processes. It refers to the actual file/feature changed (examples/deepseek/ptq.py) and matches the raw_summary and PR objectives describing the helper and its invocation. The phrasing is concise, specific, and free of extraneous detail, making it clear to teammates scanning PR history. Therefore it aligns with the title guidance.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
examples/deepseek/ptq.py (1)

376-376: Ensure cleanup runs on exceptions: call end_process() from finally

If any step before the current call raises, the process group will leak.

Apply:

-    args = parser.parse_args()
-    model = load_deepseek_model(args.config, args.model_path, args.batch_size)
-    tokenizer = AutoTokenizer.from_pretrained(
-        args.model_path, trust_remote_code=args.trust_remote_code
-    )
-    model = ptq(model, tokenizer, args.quant_cfg, args.batch_size, args.calib_size)
-    save_amax_and_quant_config(model, args.output_path, not args.disable_fp8_kvcache)
-    end_process()
+    args = parser.parse_args()
+    try:
+        model = load_deepseek_model(args.config, args.model_path, args.batch_size)
+        tokenizer = AutoTokenizer.from_pretrained(
+            args.model_path, trust_remote_code=args.trust_remote_code
+        )
+        model = ptq(model, tokenizer, args.quant_cfg, args.batch_size, args.calib_size)
+        save_amax_and_quant_config(model, args.output_path, not args.disable_fp8_kvcache)
+    finally:
+        end_process()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 76e8ce2 and 1538644.

📒 Files selected for processing (1)
  • examples/deepseek/ptq.py (2 hunks)

Signed-off-by: bruce.xu <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
examples/deepseek/ptq.py (3)

293-295: KeyError risk for LOCAL_RANK; use getenv fallback.

In non‑distributed runs LOCAL_RANK may be unset.

-    if int(os.environ["LOCAL_RANK"]) == 0:
+    if int(os.getenv("LOCAL_RANK", "0")) == 0:

270-272: Guard unconditional dist.barrier() calls — will raise if the default process group isn't initialized

Unconditional dist.barrier() calls found; wrap each call with a guard: if dist.is_available() and dist.is_initialized(): ... (preserve any group argument, e.g. dist.barrier(group) inside the guard).

Occurrences found (update all): modelopt/torch/export/postprocess.py:567,575; modelopt/torch/export/distribute.py:149,173,258,295; examples/cnn_qat/torchvision_qat.py:176; examples/deepseek/ptq.py:271,307

-    dist.barrier()
+    if dist.is_available() and dist.is_initialized():
+        dist.barrier()

299-327: Guard torch.distributed collectives with dist.is_initialized().

dist.barrier(), dist.get_world_size(), and dist.all_gather_object() in examples/deepseek/ptq.py (around lines 299–327) are unguarded and will raise when torch.distributed isn't initialized; wrap them with if dist.is_available() and dist.is_initialized(): ... else: skip the collectives and have rank 0 write outputs directly (e.g., world_size=1 / all_quant_configs=[quant_config]).

♻️ Duplicate comments (1)
examples/deepseek/ptq.py (1)

232-239: Tighten distributed teardown: guard + remove unused vars.

Early-return when not distributed and guard with dist.is_available(); drop unused rank/local_rank.

 def end_process():
-    world_size = int(os.getenv("WORLD_SIZE", "1"))
-    rank = int(os.getenv("RANK", "0"))
-    local_rank = int(os.getenv("LOCAL_RANK", "0"))
-    if world_size > 1:
-        if dist.is_initialized():
-            dist.destroy_process_group()
+    world_size = int(os.getenv("WORLD_SIZE", "1"))
+    if world_size <= 1:
+        return
+    if dist.is_available() and dist.is_initialized():
+        dist.destroy_process_group()
🧹 Nitpick comments (1)
examples/deepseek/ptq.py (1)

369-377: Ensure cleanup runs on exceptions (wrap in try/finally).

This guarantees end_process() executes even if PTQ or saving raises.

     args = parser.parse_args()
-    model = load_deepseek_model(args.config, args.model_path, args.batch_size)
-    tokenizer = AutoTokenizer.from_pretrained(
-        args.model_path, trust_remote_code=args.trust_remote_code
-    )
-    model = ptq(model, tokenizer, args.quant_cfg, args.batch_size, args.calib_size)
-    save_amax_and_quant_config(model, args.output_path, not args.disable_fp8_kvcache)
-    end_process()
+    try:
+        model = load_deepseek_model(args.config, args.model_path, args.batch_size)
+        tokenizer = AutoTokenizer.from_pretrained(
+            args.model_path, trust_remote_code=args.trust_remote_code
+        )
+        model = ptq(model, tokenizer, args.quant_cfg, args.batch_size, args.calib_size)
+        save_amax_and_quant_config(model, args.output_path, not args.disable_fp8_kvcache)
+    finally:
+        end_process()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1538644 and ee2a1c7.

📒 Files selected for processing (1)
  • examples/deepseek/ptq.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
examples/deepseek/ptq.py (1)
modelopt/torch/utils/distributed.py (1)
  • world_size (204-206)

@Bruce-x-1997
Copy link
Author

@sugunav14 hello, please help review it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant