add end_process in deepseek ptq #317

Bruce-x-1997 · 2025-09-15T09:44:50Z

What does this PR do?

when we run ptq, we see warning say that lack of dist.destroy_process, so I add it

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

Bug Fixes
- Ensures distributed PTQ runs cleanly shut down on completion, preventing hanging processes.
- Frees process groups across workers to avoid lingering GPU/CPU resource usage.
- Invokes cleanup when PTQ finishes to improve reliability for multi-GPU workflows and repeated runs.
- Reduces risk of deadlocks and timeouts when exiting distributed sessions.

copy-pr-bot · 2025-09-15T09:44:54Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-09-15T09:44:59Z

Walkthrough

Adds a new helper function end_process() to examples/deepseek/ptq.py that reads WORLD_SIZE, RANK, and LOCAL_RANK from the environment and conditionally calls torch.distributed.destroy_process_group(); the function is defined twice in the file and invoked after saving PTQ results.

Changes

Cohort / File(s)	Summary
PTQ distributed cleanup updates `examples/deepseek/ptq.py`	Added `end_process()` which reads `WORLD_SIZE`/`RANK`/`LOCAL_RANK` and, when `world_size > 1` and `torch.distributed.is_initialized()`, calls `torch.distributed.destroy_process_group()`; called after saving quantization results. A duplicate `end_process()` definition was also introduced.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant PTQ as "PTQ Script\n(examples/deepseek/ptq.py)"
    participant Worker
    participant Dist as "torch.distributed"

    User->>PTQ: run PTQ flow
    PTQ->>Worker: perform calibration / quantization
    Worker-->>PTQ: save quantization results

    rect rgba(224,240,255,0.3)
    note right of PTQ: call end_process()
    PTQ->>PTQ: read WORLD_SIZE/RANK/LOCAL_RANK
    alt world_size > 1 and dist.is_initialized()
        PTQ->>Dist: destroy_process_group()
        Dist-->>PTQ: resources cleaned
    else single-process or not initialized
        note over PTQ: no-op
    end
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

a rabbit taps the code with care,
"read the ranks, then end the fare" 🐇
two twins named end_process appear,
one call to tidy, all workers clear.
hops away with a quantized cheer. ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "add end_process in deepseek ptq" succinctly and accurately summarizes the primary change introduced by the PR — adding an end_process call in the DeepSeek PTQ flow to clean up distributed processes. It refers to the actual file/feature changed (examples/deepseek/ptq.py) and matches the raw_summary and PR objectives describing the helper and its invocation. The phrasing is concise, specific, and free of extraneous detail, making it clear to teammates scanning PR history. Therefore it aligns with the title guidance.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

examples/deepseek/ptq.py (1)

376-376: Ensure cleanup runs on exceptions: call end_process() from finally

If any step before the current call raises, the process group will leak.

Apply:

-    args = parser.parse_args()
-    model = load_deepseek_model(args.config, args.model_path, args.batch_size)
-    tokenizer = AutoTokenizer.from_pretrained(
-        args.model_path, trust_remote_code=args.trust_remote_code
-    )
-    model = ptq(model, tokenizer, args.quant_cfg, args.batch_size, args.calib_size)
-    save_amax_and_quant_config(model, args.output_path, not args.disable_fp8_kvcache)
-    end_process()
+    args = parser.parse_args()
+    try:
+        model = load_deepseek_model(args.config, args.model_path, args.batch_size)
+        tokenizer = AutoTokenizer.from_pretrained(
+            args.model_path, trust_remote_code=args.trust_remote_code
+        )
+        model = ptq(model, tokenizer, args.quant_cfg, args.batch_size, args.calib_size)
+        save_amax_and_quant_config(model, args.output_path, not args.disable_fp8_kvcache)
+    finally:
+        end_process()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 76e8ce2 and 1538644.

📒 Files selected for processing (1)

examples/deepseek/ptq.py (2 hunks)

examples/deepseek/ptq.py

Signed-off-by: bruce.xu <[email protected]>

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

examples/deepseek/ptq.py (3)
293-295: KeyError risk for LOCAL_RANK; use getenv fallback.

In non‑distributed runs LOCAL_RANK may be unset.
-    if int(os.environ["LOCAL_RANK"]) == 0:
+    if int(os.getenv("LOCAL_RANK", "0")) == 0:
270-272: Guard unconditional dist.barrier() calls — will raise if the default process group isn't initialized

Unconditional dist.barrier() calls found; wrap each call with a guard: if dist.is_available() and dist.is_initialized(): ... (preserve any group argument, e.g. dist.barrier(group) inside the guard).

Occurrences found (update all): modelopt/torch/export/postprocess.py:567,575; modelopt/torch/export/distribute.py:149,173,258,295; examples/cnn_qat/torchvision_qat.py:176; examples/deepseek/ptq.py:271,307
-    dist.barrier()
+    if dist.is_available() and dist.is_initialized():
+        dist.barrier()
299-327: Guard torch.distributed collectives with dist.is_initialized().

dist.barrier(), dist.get_world_size(), and dist.all_gather_object() in examples/deepseek/ptq.py (around lines 299–327) are unguarded and will raise when torch.distributed isn't initialized; wrap them with if dist.is_available() and dist.is_initialized(): ... else: skip the collectives and have rank 0 write outputs directly (e.g., world_size=1 / all_quant_configs=[quant_config]).

♻️ Duplicate comments (1)

examples/deepseek/ptq.py (1)

232-239: Tighten distributed teardown: guard + remove unused vars.

Early-return when not distributed and guard with dist.is_available(); drop unused rank/local_rank.

 def end_process():
-    world_size = int(os.getenv("WORLD_SIZE", "1"))
-    rank = int(os.getenv("RANK", "0"))
-    local_rank = int(os.getenv("LOCAL_RANK", "0"))
-    if world_size > 1:
-        if dist.is_initialized():
-            dist.destroy_process_group()
+    world_size = int(os.getenv("WORLD_SIZE", "1"))
+    if world_size <= 1:
+        return
+    if dist.is_available() and dist.is_initialized():
+        dist.destroy_process_group()

🧹 Nitpick comments (1)

examples/deepseek/ptq.py (1)

369-377: Ensure cleanup runs on exceptions (wrap in try/finally).

This guarantees end_process() executes even if PTQ or saving raises.

     args = parser.parse_args()
-    model = load_deepseek_model(args.config, args.model_path, args.batch_size)
-    tokenizer = AutoTokenizer.from_pretrained(
-        args.model_path, trust_remote_code=args.trust_remote_code
-    )
-    model = ptq(model, tokenizer, args.quant_cfg, args.batch_size, args.calib_size)
-    save_amax_and_quant_config(model, args.output_path, not args.disable_fp8_kvcache)
-    end_process()
+    try:
+        model = load_deepseek_model(args.config, args.model_path, args.batch_size)
+        tokenizer = AutoTokenizer.from_pretrained(
+            args.model_path, trust_remote_code=args.trust_remote_code
+        )
+        model = ptq(model, tokenizer, args.quant_cfg, args.batch_size, args.calib_size)
+        save_amax_and_quant_config(model, args.output_path, not args.disable_fp8_kvcache)
+    finally:
+        end_process()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1538644 and ee2a1c7.

📒 Files selected for processing (1)

examples/deepseek/ptq.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

examples/deepseek/ptq.py (1)

modelopt/torch/utils/distributed.py (1)

world_size (204-206)

Bruce-x-1997 · 2025-09-17T03:16:08Z

@sugunav14 hello, please help review it

Bruce-x-1997 requested a review from a team as a code owner September 15, 2025 09:44

Bruce-x-1997 requested a review from sugunav14 September 15, 2025 09:44

coderabbitai bot reviewed Sep 15, 2025

View reviewed changes

examples/deepseek/ptq.py Show resolved Hide resolved

Bruce-x-1997 force-pushed the bruce_add_endprocess branch from 1538644 to 1207c2c Compare September 15, 2025 11:18

add end_process

ee2a1c7

Signed-off-by: bruce.xu <[email protected]>

Bruce-x-1997 force-pushed the bruce_add_endprocess branch from 1207c2c to ee2a1c7 Compare September 15, 2025 11:19

coderabbitai bot reviewed Sep 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add end_process in deepseek ptq #317

add end_process in deepseek ptq #317

Bruce-x-1997 commented Sep 15, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Sep 15, 2025

Uh oh!

coderabbitai bot commented Sep 15, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Bruce-x-1997 commented Sep 17, 2025

Uh oh!

Uh oh!

add end_process in deepseek ptq #317

Are you sure you want to change the base?

add end_process in deepseek ptq #317

Conversation

Bruce-x-1997 commented Sep 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Sep 15, 2025

Uh oh!

coderabbitai bot commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Bruce-x-1997 commented Sep 17, 2025

Uh oh!

Uh oh!

Bruce-x-1997 commented Sep 15, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 15, 2025 •

edited

Loading