Skip to content

DDP PoC#19

Closed
yiliu30 wants to merge 22 commits intomainfrom
ddp
Closed

DDP PoC#19
yiliu30 wants to merge 22 commits intomainfrom
ddp

Conversation

@yiliu30
Copy link
Copy Markdown
Owner

@yiliu30 yiliu30 commented Feb 5, 2026

SUMMARY:
"please provide a brief summary"

TEST PLAN:
"please outline how the changes were tested"

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 5, 2026

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@yiliu30 yiliu30 changed the title Ddp DDP PoC Feb 5, 2026
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @yiliu30, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces Distributed Data Parallel (DDP) capabilities to the AutoRound quantization process within the llmcompressor library. This enhancement allows for more efficient quantization of large language models by distributing the computational workload across multiple GPUs. The changes include a new example script showcasing this functionality, modifications to the data loading utilities for DDP compatibility, and the addition of debugging and device management tools to support distributed operations.

Highlights

  • Distributed Data Parallel (DDP) Integration: Implemented DDP support for the AutoRound quantization modifier, enabling multi-GPU quantization of large language models.
  • New DDP Example Script: Added experimental/ddp/ddp_qwen3_example.py to demonstrate how to perform AutoRound quantization using DDP for Qwen3 models, including DDP setup, model loading, and quantization execution.
  • Distributed Data Sampling: Modified the dataset utility in src/llmcompressor/datasets/utils.py to automatically use torch.utils.data.distributed.DistributedSampler when DDP is active, ensuring calibration datasets are correctly distributed across processes.
  • DDP Debugging and Device Management Utilities: Introduced ForkedPdb for debugging multiprocessing children and helper functions like rank_log and check_device for DDP-aware logging. The apply_autoround method was updated to explicitly move the model to the correct GPU device for each DDP process.
Changelog
  • examples/quantization_w4a4_fp4/llama3_example.py
    • Temporarily updated MODEL_ID to a local Qwen3 model path for testing purposes.
  • experimental/ddp/README.md
    • Added a new README file documenting the DDP + AutoRound Proof of Concept, its current status, and outlining future development steps.
  • experimental/ddp/ddp_qwen3_example.py
    • Added a new example script that demonstrates DDP-enabled AutoRound quantization for Qwen3 models.
    • Includes functions for DDP setup and cleanup, deterministic configuration, model and tokenizer loading, calibration dataset preparation, and applying the AutoRound modifier.
    • Handles model saving and sample generation specifically on rank 0, while other ranks perform quantization without saving.
  • src/llmcompressor/datasets/utils.py
    • Added an is_ddp function to check if a Distributed Data Parallel environment is active.
    • Modified the _make_sampler function to utilize torch.utils.data.distributed.DistributedSampler when DDP is detected, ensuring proper data distribution for calibration.
  • src/llmcompressor/modifiers/autoround/base.py
    • Introduced ForkedPdb, a pdb subclass for debugging in forked multiprocessing environments.
    • Added rank_log and check_device utility functions for logging messages and model device information in a DDP-aware manner.
    • Updated the apply_autoround method to explicitly move the wrapped_model to the current GPU rank, ensuring correct device placement during multi-GPU quantization.
Activity
  • The author yiliu30 created this pull request with a minimal title 'Ddp' and placeholder descriptions for the summary and test plan.
  • The changes introduce significant new functionality for Distributed Data Parallel (DDP) support in AutoRound quantization.
  • The experimental/ddp/README.md indicates that a Proof of Concept (PoC) implementation is complete and functional, with clear next steps for further development and verification.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a proof-of-concept for Distributed Data Parallel (DDP) support, primarily for AutoRound quantization. It adds a new DDP example, modifies dataset utilities to support distributed sampling, and updates the AutoRound modifier to be DDP-aware.

My review focuses on cleaning up the proof-of-concept code. I've identified several instances of dead or commented-out code in the new example files that should be removed for clarity. I also suggest replacing a print statement with a proper logger call in the AutoRoundModifier and removing an unused debugging utility class. These changes will improve the code's readability and maintainability as it moves from a PoC to a more permanent feature.

Comment on lines +8 to +9
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

MODEL_ID = "/storage/yiliu7/Qwen/Qwen3-30B-A3B-Instruct-2507-FP8"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The MODEL_ID variable is defined on line 8 and then immediately overwritten on line 9. The first assignment is now dead code and can be confusing. Please remove the unused assignment.

Suggested change
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
MODEL_ID = "/storage/yiliu7/Qwen/Qwen3-30B-A3B-Instruct-2507-FP8"
MODEL_ID = "/storage/yiliu7/Qwen/Qwen3-30B-A3B-Instruct-2507-FP8"

Comment on lines +14 to +15
model_id = "Qwen/Qwen3-235B-A22B"
model_id = "Qwen/Qwen3-8B"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The model_id variable is defined on line 14 and then immediately overwritten on line 15. The first assignment is dead code. Please remove it to improve clarity.

Suggested change
model_id = "Qwen/Qwen3-235B-A22B"
model_id = "Qwen/Qwen3-8B"
model_id = "Qwen/Qwen3-8B"

Comment on lines +150 to +155
# except Exception as e:
# logger.info(f"[Rank {rank}] Error during quantization: {e}")
# raise

# finally:
# # Cleanup DDP
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This commented-out try-except-finally block appears to be leftover from debugging. It should be removed to clean up the code.

Comment on lines +210 to +218
# Parse scheme from string if needed
from auto_round import schemes as ar_schemes

scheme_map = {
"FP8_STATIC": ar_schemes.FP8_STATIC,
"MXFP8": ar_schemes.MXFP8,
"MXFP4": ar_schemes.MXFP4,
}
# scheme = scheme_map.get(args.scheme, args.scheme)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block for parsing the scheme seems to be unused as the line that would use scheme_map is commented out. This makes the import and the map itself dead code. Please remove this block if it's not needed.

Comment on lines +220 to +223
# # Check if running with torchrun
# if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
# logger.info("Detected torchrun environment")
# main_torchrun(model_name, scheme, args.iters, args.nsamples)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This commented-out block for handling torchrun is unused. It should be removed to improve code clarity.

Comment on lines +37 to +54
import pdb
import sys


class ForkedPdb(pdb.Pdb):
"""A Pdb subclass that may be used
from a forked multiprocessing child

"""

def interaction(self, *args, **kwargs):
_stdin = sys.stdin
try:
sys.stdin = open("/dev/stdin")
pdb.Pdb.interaction(self, *args, **kwargs)
finally:
sys.stdin = _stdin

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ForkedPdb class and its imports (pdb, sys) are defined but not used within this file, making it dead code. This appears to be a debugging utility and should be removed from the production codebase.

kwargs["device_map"] = (
f"cuda:{rank}" if torch.cuda.is_available() else "cpu"
)
print(f"AutoRoundModifier: moving wrapped_model to {kwargs['device_map']}", flush=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

A print statement is used for logging. It's better to use the existing logging framework for consistency and better control over log levels and output. Please replace this with a call to rank_log.

Suggested change
print(f"AutoRoundModifier: moving wrapped_model to {kwargs['device_map']}", flush=True)
rank_log(f"AutoRoundModifier: moving wrapped_model to {kwargs['device_map']}")

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
@yiliu30 yiliu30 closed this Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant