[Distributed] [model_free_ptq] Eliminate reindexing step via fine-grained parallelized partial reads#2498
Conversation
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refines the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request successfully eliminates the reindex_fused_weights preprocessing step for microscale schemes by making model_free_ptq fusion-aware. The changes are well-structured, introducing new helper functions for file grouping and processing, which improves modularity. The logic for identifying and processing cross-shard fused weights seems correct. I've included a couple of suggestions to reduce code duplication, which would enhance maintainability. Additionally, the bundled changes to AWQModifier are beneficial, improving its flexibility and robustness.
Test Results10 passed in 7.62s |
0f959c2 to
787057c
Compare
11240f1 to
cd0d7e1
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant improvement by making the model_free_ptq process fusion-aware, which eliminates the need for a manual reindex_fused_weights preprocessing step for microscale schemes. The implementation is well-designed, utilizing a union-find algorithm for efficient file grouping and a new processing function to handle these groups. The code is well-structured and includes comprehensive new tests. My review includes a couple of suggestions to enhance code clarity and maintainability.
72ca14f to
7230bb1
Compare
|
Hi @dzhengAP, can you describe at a high level how your changes avoid the reindexing step? It seems like this doesn't handle the case where two processes end up writing to the same save file at the same time/ end up overwriting each other's changes. You also need to make sure that parallel processes don't attempt to read while the others are writing. |
@kylesayrs Great question! The key here is that, group jobs are never parallelized against each other — I added High-level flow wise:
So the invariant is: one job = one group = one worker = no concurrent access to the same shard. Does that address your concern, or are there edge cases you're thinking of that I'm missing? |
|
@kylesayrs, I validated the concurrency safety claims from my previous comment using both determinism checks and high-concurrency stress testing to address concerns around race conditions and file access conflicts. I’ve also committed the test to the repo as a potential regression safeguard for future concurrency-related changes(e3b7d16) ExperimentsExperiment 1: Determinism Test
Experiment 2: High-Concurrency Stress Test
Test Setup
Results Summary
ConclusionThese results support the invariant:
The fusion-aware file grouping logic properly isolates file access across parallel workers. No race conditions detected. Test ScriptThe validation test script has been added to this branch:
|
- Validates fusion-aware file grouping prevents race conditions - Tests determinism across 1, 8, and 16 workers - Verifies SHA256 hash consistency under high concurrency - Supports the 'one job = one group = one worker' invariant
- Validates fusion-aware file grouping prevents race conditions - Tests determinism across 1, 8, and 16 workers - Verifies SHA256 hash consistency under high concurrency - Supports the 'one job = one group = one worker' invariant Tested on: 4x CUDA GPUs Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 Scheme: w8a16 (weight-only) Author: David Zheng (dqzheng1996@gmail.com) Signed-off-by: David Zheng <dqzheng1996@gmail.com>
accc40e to
e3b7d16
Compare
|
The quality checks have failed. Please run |
Hi @kylesayrs, You're right: in the worst corner case in theory, union-find could collapse all shards into one group in pathological cases. I indeed considered the parallel reads/writes in the beginning. Here's why I think the current UF approach is still the right trade-off in practice:
If we're concerned about pathological cases, we could add a hybrid fallback: process large groups with parallel partitions while keeping union-find for the common case. Happy to explore that if you think it's necessary. Does this address your concern, or do you see specific scenarios where the current approach would cause real performance issues? |
Let me know what you think |
|
Hi @dzhengAP , I spoke to @kylesayrs about this, I need to handle reindexing in another flow in cases where a model is quantized to fp8 block but the weight and weight_scale tensors are split across file (my PR code here). I agree re-indexing is a pain and it would be good to eliminate it. I spoke with @kylesayrs about an implementation based on his diagram posted above. If you'd like, I can look into adding it to my PR in a way that works with microscale and fp8 block, or I can rebase against your PR if you'd like to work on this. Just lmk how you'd like to proceed |
|
Love the sketch! I agree that finer-grained parallelism offers real advantages — more parallelism, better speedup, and lower per-worker memory. I ran some simulations to benchmark and visualize the tradeoffs, and two concerns came up worth discussing, which are also what I was worried previously: 1. Scale consistency (NVFP4/MXFP4) 2. Redundant tensor reads Proposed next steps:
if max_group_size < 5 and estimated_redundancy < 100:
return "fine_grained" # maximize parallelism
else:
return "grouping" # predictable I/OPlease check the simulation results below. What's your take on the scale question? |
|
Thanks @dzhengAP ! Happy to review when ready. regarding your points
|
|
The quality checks have failed. Please run |
brian-dellabetta
left a comment
There was a problem hiding this comment.
Thanks @dzhengAP , one request on the process an validate job signatures, to keep them consistent for standard vs. microscale. No other comments beyond that, hopefully we can get this in soon. Thanks for all the work on this!
ab15889 to
21931c9
Compare
|
The quality checks have failed. Please run |
|
Updated per review from @kylesayrs and @brian-dellabetta: microscale.py:
process.py:
init.py:
helpers.py:
save_utils.py / init.py:
tests:
|
|
The quality checks have failed. Please run |
|
The quality checks have failed. Please run |
brian-dellabetta
left a comment
There was a problem hiding this comment.
I committed some changes to clean up the top level functions, so we have a single function to build jobs, and validate/process jobs have the same file signature regardless of whether they are standard or microscale jobs. @dzhengAP reviewed and tests are passing. This should be good to merge in now, and I will tackle the TODOs in my follow-up PR #2491
b92af83 to
728d84a
Compare
…job signatures Per review from @kylesayrs and @brian-dellabetta: microscale.py: - Refactor DEFAULT_FUSED_MAPPINGS to {primary_pattern: [partner_templates]} so only the primary-owning shard fetches its partners, preventing double reads for cross-shard fused weight sets - build_inverse_weights_map uses re.match with named group substitution to construct partner names exactly as Kyle suggested process.py: - Unified signature for validate_file, process_file, process_file_microscale_scheme: (inverse_weights_map, save_path, scheme, ignore, device, converter) - All functions use safe_open for true partial reads - Remove assert on unmatched fused sets — non-primary shards legitimately have incomplete sets (k/v without q) __init__.py: - Single _get_weights_map() helper handles both single-file and multi-file models - Single _build_quantization_jobs() with identical tuple structure for all jobs - Fix import path for compressed-tensors dev version helpers.py / validate.py: - Remove build_inverse_weights_map (moved to microscale.py) - Update validate.py to reflect inverse_weights_map approach tests: - Update test signatures for new inverse_weights_map interface Closes vllm-project#2497 Signed-off-by: David Zheng <dqzheng1996@gmail.com>
728d84a to
c045c63
Compare
|
kylesayrs
left a comment
There was a problem hiding this comment.
Nice job, thanks for being open to suggestions!
SUMMARY: Follow-up to #2498 and pre-cursor to landing #2491. This PR cleans up a few things: - [x] Use the same function signature for building standard jobs, microscale jobs, and validation jobs. These will be needed in #2491. - [x] Renamed microscale-specific `build_inverse_weights_map` -> `build_microscale_inverse_weights_map` because other reindexing logic will need different functionality when determining fused tensors. - [x] Prunes unused `_get_all_tensor_names` - [x] Breaks out loading logic for inverse_weights_map to a helper that can be moved to CT in follow-up #2491 TEST PLAN: No net new functionality, if all tests pass should be good to go --------- Signed-off-by: David Zheng <dqzheng1996@gmail.com> Signed-off-by: Brian Dellabetta <bdellabe@redhat.com> Signed-off-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Co-authored-by: David Zheng <dqzheng1996@gmail.com> Co-authored-by: David Zheng <153074367+dzhengAP@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ld_inverse_weights_map (#2546) ## Purpose Follow-up to #2498 addressing review comments from @kylesayrs. ## Changes - `partner_shard is None`: log warning instead of silent skip — indicates unexpected model architecture where expected partner tensor doesn't exist - `partner_resolved is None`: raise ValueError instead of silent skip — indicates corrupt or incomplete checkpoint and should surface as an error ## Notes These cases were flagged as defensive guards that either shouldn't exist or should error loudly. The warning approach for partner_shard=None handles the edge case of non-standard model architectures gracefully while still surfacing the issue. ## Test Unit tests passed 35/35 Signed-off-by: David Zheng <dqzheng1996@gmail.com> Signed-off-by: root <root@bolt-6jxv69gfv8-tqazfcpmhh.bolt-pods.turi-bolt.svc.kube.us-east-1d.k8s.cloud.apple.com> Co-authored-by: root <root@bolt-6jxv69gfv8-tqazfcpmhh.bolt-pods.turi-bolt.svc.kube.us-east-1d.k8s.cloud.apple.com> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>






Purpose
Eliminates the
reindex_fused_weightspreprocessing step for microscaleschemes (NVFP4, MXFP4) by enabling each shard to be processed independently
with full parallelism, even when fused weight sets (q/k/v, gate/up) span
multiple shards.
Approach
Instead of grouping shards together (which reduces parallelism), each shard
process fetches only the specific fused partner tensors it needs from other
shards via targeted partial safetensors reads, computes the fused global
scale locally, and writes only its own output shard. No cross-process
coordination or file locking required.
Changes
helpers.pyAdded
build_tensor_file_index()— readsindex.jsononce at startup andbuilds a flat mapping of
tensor_name → resolved_file_path. This gives eachworker process an O(1) lookup to find which file contains any fused partner
tensor, without re-scanning headers at runtime.
process.pyUpdated
process_file_microscale_scheme()with an optionaltensor_file_indexparameter. When provided:_fetch_fused_partners()is called to identify any fused set membersmissing from the current shard, then fetches only those specific tensors
via partial safetensors reads (headers + target tensors only, not full files)
_belongs_to_shard()ensures only native tensors are written to the outputshard — fetched partner tensors are used for scale computation only and
never written to the wrong shard
__init__.pySimplified back to one job per shard — full parallelism restored. For
microscale schemes, builds the
tensor_file_indexonce fromindex.jsonand passes it to each job. No union-find, no grouping logic needed.
validate.pyRemoved
NotImplementedErrorfor cross-shard fused weights — the case isnow handled natively. Replaced with
logger.debugnoting that partnertensors will be resolved via partial reads.
Latest Updates: Eliminate reindexing step via inverse_weights_map with unified job signatures
Approach
Each shard job receives a precomputed
inverse_weights_mapspecifying exactlywhich tensors to load from which files. For cross-shard fused weights, only the
shard owning the primary tensor (q_proj, gate_proj) fetches its partners —
preventing double reads. All jobs share a unified signature for both standard
and microscale schemes.
Changes
microscale.pyDEFAULT_FUSED_MAPPINGSfrom a list of lists to{primary_pattern: [partner_templates]}— only the primary-owning shardfetches its partners, preventing double reads for cross-shard fused weights
build_inverse_weights_map()here — uses regex match on primarypatterns to construct partner names and locate them in other shards
process.pyvalidate_file,process_file, andprocess_file_microscale_scheme:(inverse_weights_map, save_path, scheme, ignore, device, converter)safe_open+f.get_tensor()for true partial readssafetensors index to reflect new locations
__init__.py_get_weights_map()helper handles both single-file and multi-filemodels (reads
safetensors.index.jsonor scans file headers viasafe_open)_build_quantization_jobs()replaces separate standard/microscalebuilders — one job per shard with identical tuple structure for both
*job[1:]for full future-proofinghelpers.pybuild_weights_mapandbuild_inverse_weights_map(moved tomicroscale.py)validate.pyNotImplementedErrorfor cross-shard fused weights — handled nativelyinverse_weights_map-based approachTesting
pytest tests/llmcompressor/entrypoints/model_free/— all passing locallymake style && make quality— all checks passSigned-off-by: David Zheng dqzheng1996@gmail.com
Closes #2497
Related to #2448