Skip to content

[Distributed] [model_free_ptq] Eliminate reindexing step via fine-grained parallelized partial reads#2498

Merged
dsikka merged 1 commit intovllm-project:mainfrom
dzhengAP:model-free-ptq-runtime-optimization
Mar 30, 2026
Merged

[Distributed] [model_free_ptq] Eliminate reindexing step via fine-grained parallelized partial reads#2498
dsikka merged 1 commit intovllm-project:mainfrom
dzhengAP:model-free-ptq-runtime-optimization

Conversation

@dzhengAP
Copy link
Copy Markdown
Contributor

@dzhengAP dzhengAP commented Mar 20, 2026

Purpose

Eliminates the reindex_fused_weights preprocessing step for microscale
schemes (NVFP4, MXFP4) by enabling each shard to be processed independently
with full parallelism, even when fused weight sets (q/k/v, gate/up) span
multiple shards.

Approach

Instead of grouping shards together (which reduces parallelism), each shard
process fetches only the specific fused partner tensors it needs from other
shards via targeted partial safetensors reads, computes the fused global
scale locally, and writes only its own output shard. No cross-process
coordination or file locking required.

Changes

helpers.py

Added build_tensor_file_index() — reads index.json once at startup and
builds a flat mapping of tensor_name → resolved_file_path. This gives each
worker process an O(1) lookup to find which file contains any fused partner
tensor, without re-scanning headers at runtime.

process.py

Updated process_file_microscale_scheme() with an optional
tensor_file_index parameter. When provided:

  • _fetch_fused_partners() is called to identify any fused set members
    missing from the current shard, then fetches only those specific tensors
    via partial safetensors reads (headers + target tensors only, not full files)
  • Fused global scale is computed locally using all members of the fused set
  • _belongs_to_shard() ensures only native tensors are written to the output
    shard — fetched partner tensors are used for scale computation only and
    never written to the wrong shard

__init__.py

Simplified back to one job per shard — full parallelism restored. For
microscale schemes, builds the tensor_file_index once from index.json
and passes it to each job. No union-find, no grouping logic needed.

validate.py

Removed NotImplementedError for cross-shard fused weights — the case is
now handled natively. Replaced with logger.debug noting that partner
tensors will be resolved via partial reads.

Latest Updates: Eliminate reindexing step via inverse_weights_map with unified job signatures

Approach

Each shard job receives a precomputed inverse_weights_map specifying exactly
which tensors to load from which files. For cross-shard fused weights, only the
shard owning the primary tensor (q_proj, gate_proj) fetches its partners —
preventing double reads. All jobs share a unified signature for both standard
and microscale schemes.

Changes

microscale.py

  • Refactor DEFAULT_FUSED_MAPPINGS from a list of lists to
    {primary_pattern: [partner_templates]} — only the primary-owning shard
    fetches its partners, preventing double reads for cross-shard fused weights
  • Move build_inverse_weights_map() here — uses regex match on primary
    patterns to construct partner names and locate them in other shards

process.py

  • Unified signature for validate_file, process_file, and
    process_file_microscale_scheme:
    (inverse_weights_map, save_path, scheme, ignore, device, converter)
  • All functions use safe_open + f.get_tensor() for true partial reads
  • Partner tensors re-saved into requesting shard's output; caller updates
    safetensors index to reflect new locations

__init__.py

  • Single _get_weights_map() helper handles both single-file and multi-file
    models (reads safetensors.index.json or scans file headers via safe_open)
  • Single _build_quantization_jobs() replaces separate standard/microscale
    builders — one job per shard with identical tuple structure for both
  • Validate jobs use *job[1:] for full future-proofing

helpers.py

  • Removed build_weights_map and build_inverse_weights_map (moved to
    microscale.py)

validate.py

  • Removed NotImplementedError for cross-shard fused weights — handled natively
  • Updated to reflect inverse_weights_map-based approach

Testing

  • pytest tests/llmcompressor/entrypoints/model_free/ — all passing locally
  • make style && make quality — all checks pass

Signed-off-by: David Zheng dqzheng1996@gmail.com

Closes #2497
Related to #2448

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refines the model_free_ptq quantization process for microscale schemes by integrating fusion-aware logic directly into job scheduling. This enhancement removes the previous requirement for a separate reindex_fused_weights preprocessing step, streamlining the workflow for models with fused weight sets split across multiple safetensors shards. The changes enable more efficient and accurate quantization by ensuring that all components of a fused weight set are processed together, even when distributed across different files. Additionally, the AWQ modifier has been updated to allow for configurable search observers, providing more flexibility and control over the quantization process.

Highlights

  • Fusion-Aware Job Scheduling: Introduced fusion-aware job scheduling in model_free_ptq for microscale quantization schemes (NVFP4, MXFP4), enabling joint processing of fused weight sets even when split across shards.
  • Elimination of Reindexing Step: Eliminated the need for the reindex_fused_weights preprocessing step by natively handling cross-shard fused weights within the model_free_ptq function.
  • New File Grouping Utility: Added a group_files_by_fused_weights utility function using a union-find algorithm to cluster related safetensors files that share fused weight sets.
  • Group Processing for Microscale Schemes: Implemented process_file_group_microscale_scheme to handle the joint processing of multiple safetensors files that contain cross-shard fused weights, ensuring correct global scale fusion.
  • Softened Validation Error: Softened the NotImplementedError in validate.py related to cross-shard fused weights to a debug log, as these cases are now handled automatically.
  • AWQ Modifier Enhancement: Enhanced the AWQModifier with a search_observer parameter, allowing configuration of the observer used during grid search for improved scale alignment.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully eliminates the reindex_fused_weights preprocessing step for microscale schemes by making model_free_ptq fusion-aware. The changes are well-structured, introducing new helper functions for file grouping and processing, which improves modularity. The logic for identifying and processing cross-shard fused weights seems correct. I've included a couple of suggestions to reduce code duplication, which would enhance maintainability. Additionally, the bundled changes to AWQModifier are beneficial, improving its flexibility and robustness.

@dzhengAP
Copy link
Copy Markdown
Contributor Author

dzhengAP commented Mar 20, 2026

Test Results

10 passed in 7.62s
Test Group Files By Fused Weights (7/7 passed)
Test Process File Group Microscale Scheme (3/3 passed)
@kylesayrs

@dzhengAP dzhengAP force-pushed the model-free-ptq-runtime-optimization branch from 0f959c2 to 787057c Compare March 20, 2026 20:58
@mergify mergify bot added the documentation Improvements or additions to documentation label Mar 20, 2026
@dzhengAP dzhengAP force-pushed the model-free-ptq-runtime-optimization branch 3 times, most recently from 11240f1 to cd0d7e1 Compare March 20, 2026 21:23
@dzhengAP
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant improvement by making the model_free_ptq process fusion-aware, which eliminates the need for a manual reindex_fused_weights preprocessing step for microscale schemes. The implementation is well-designed, utilizing a union-find algorithm for efficient file grouping and a new processing function to handle these groups. The code is well-structured and includes comprehensive new tests. My review includes a couple of suggestions to enhance code clarity and maintainability.

@dzhengAP dzhengAP force-pushed the model-free-ptq-runtime-optimization branch from 72ca14f to 7230bb1 Compare March 20, 2026 21:35
@kylesayrs
Copy link
Copy Markdown
Collaborator

kylesayrs commented Mar 20, 2026

Hi @dzhengAP, can you describe at a high level how your changes avoid the reindexing step? It seems like this doesn't handle the case where two processes end up writing to the same save file at the same time/ end up overwriting each other's changes.

You also need to make sure that parallel processes don't attempt to read while the others are writing.

@dzhengAP
Copy link
Copy Markdown
Contributor Author

Hi @dzhengAP, can you describe at a high level how your changes avoid the reindexing step? It seems like this doesn't handle the case where two processes end up writing to the same save file at the same time/ end up overwriting each other's changes.

You also need to make sure that parallel processes don't attempt to read while the others are writing.

@kylesayrs Great question! The key here is that, group jobs are never parallelized against each other — I added group_files_by_fused_weights to use union-find to cluster all shards that share fused weights into a single job, so a group of N shards is processed by exactly one process sequentially. There's no concurrent read/write between processes on the same files.

High-level flow wise:

  1. group_files_by_fused_weights reads index.json and unions any shards that share fused weight sets (q/k/v, gate/up) into one group
  2. Each group becomes a single job dispatched to one worker — process_file_group_microscale_scheme loads all shards in the group, processes them together in memory, then writes each tensor back to its original shard
  3. Groups are independent by construction (no shared tensors between groups), so parallel workers never touch the same files

So the invariant is: one job = one group = one worker = no concurrent access to the same shard. Does that address your concern, or are there edge cases you're thinking of that I'm missing?

@dzhengAP
Copy link
Copy Markdown
Contributor Author

dzhengAP commented Mar 21, 2026

@kylesayrs, I validated the concurrency safety claims from my previous comment using both determinism checks and high-concurrency stress testing to address concerns around race conditions and file access conflicts. I’ve also committed the test to the repo as a potential regression safeguard for future concurrency-related changes(e3b7d16)

Experiments

Experiment 1: Determinism Test

  • Ran identical quantization jobs with max_workers=1 vs max_workers=8
  • Result: SHA256 hashes of all output .safetensors files are bitwise identical
  • Implication: No concurrent write corruption or read-write races affecting output integrity

Experiment 2: High-Concurrency Stress Test

  • Forced max_workers=16 on 4 GPUs (high contention scenario)
  • Result: Completed successfully with no PermissionError, file lock errors, or crashes

Test Setup

Parameter Value
GPU 4× CUDA devices
Model TinyLlama/TinyLlama-1.1B-Chat-v1.0
Scheme w8a16 (weight-only, no calibration)
Branch model-free-ptq-runtime-optimization (commit 7230bb18)

Results Summary

Test Status Evidence
Worker 1 (Baseline) PASS Completed successfully
Worker 8 (Parallel) PASS Completed successfully
Hash Match (Determinism) PASS 444eee3d5e6e113a... identical across runs
Stress 16 (High Contention) PASS No errors or file lock conflicts

Conclusion

These results support the invariant:

one job = one group = one worker = no concurrent access to the same shard

The fusion-aware file grouping logic properly isolates file access across parallel workers. No race conditions detected.


Test Script

The validation test script has been added to this branch:

  • File: tests/test_concurrency_safety.py
  • Usage: python tests/test_concurrency_safety.py

dzhengAP added a commit to dzhengAP/llm-compressor that referenced this pull request Mar 21, 2026
- Validates fusion-aware file grouping prevents race conditions
- Tests determinism across 1, 8, and 16 workers
- Verifies SHA256 hash consistency under high concurrency
- Supports the 'one job = one group = one worker' invariant
dzhengAP added a commit to dzhengAP/llm-compressor that referenced this pull request Mar 21, 2026
- Validates fusion-aware file grouping prevents race conditions
- Tests determinism across 1, 8, and 16 workers
- Verifies SHA256 hash consistency under high concurrency
- Supports the 'one job = one group = one worker' invariant

Tested on: 4x CUDA GPUs
Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Scheme: w8a16 (weight-only)

Author: David Zheng (dqzheng1996@gmail.com)

Signed-off-by: David Zheng <dqzheng1996@gmail.com>
@dzhengAP dzhengAP force-pushed the model-free-ptq-runtime-optimization branch from accc40e to e3b7d16 Compare March 21, 2026 10:00
@kylesayrs
Copy link
Copy Markdown
Collaborator

I added group_files_by_fused_weights to use union-find to cluster all shards that share fused weights into a single job

I see! However, in the worst case, all shards are placed into the same group, right? Consider the case:

file0: (A.up_proj)
file1: (A.gate_proj, B.up_proj)
file0: (B.gate_proj, C.up_proj)
file0: (C.gate_proj, D.up_proj)
...

This solution inherently introduces some level of sequential processing, which introduces a lower bound on overall latency and reduces parallelism. Have you considered using a design where each thread reads the partitions that it needs, then writes the results independently?

Screenshot 2026-03-21 at 12 34 41

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 21, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

@dzhengAP
Copy link
Copy Markdown
Contributor Author

dzhengAP commented Mar 22, 2026

I added group_files_by_fused_weights to use union-find to cluster all shards that share fused weights into a single job

I see! However, in the worst case, all shards are placed into the same group, right? Consider the case:

file0: (A.up_proj)
file1: (A.gate_proj, B.up_proj)
file0: (B.gate_proj, C.up_proj)
file0: (C.gate_proj, D.up_proj)
...

This solution inherently introduces some level of sequential processing, which introduces a lower bound on overall latency and reduces parallelism. Have you considered using a design where each thread reads the partitions that it needs, then writes the results independently?

Screenshot 2026-03-21 at 12 34 41

Hi @kylesayrs, You're right: in the worst corner case in theory, union-find could collapse all shards into one group in pathological cases. I indeed considered the parallel reads/writes in the beginning. Here's why I think the current UF approach is still the right trade-off in practice:

  1. Real-world fused weights are localized – Even in worst-case chaining, groups are bounded by layer count (32-96 for LLMs), not model-wide. I am running some actual benchmarks on Llama-3-8B, Mixtral-8x7B, besides TinyLlama to get the max group sizes.

  2. The alternative (parallel reads/writes) introduces non-trivial complexity:

    • Cross-thread coordination for global scale computation (microscale schemes need scales across ALL tensors in a fused set)
    • Race conditions when multiple processes write to the same shard (would require file locking, killing performance)
    • Memory/coordination overhead for partial reads and two-pass processing
  3. Union-find guarantees single-writer per shard – No locks, no races, roughly identical throughput, simple to test and reason about, as I validated in the committed test. The bounded serialization penalty is minimal compared to quantization computation itself.

If we're concerned about pathological cases, we could add a hybrid fallback: process large groups with parallel partitions while keeping union-find for the common case. Happy to explore that if you think it's necessary. Does this address your concern, or do you see specific scenarios where the current approach would cause real performance issues?

@kylesayrs
Copy link
Copy Markdown
Collaborator

kylesayrs commented Mar 22, 2026

@dzhengAP

  1. The models that this entrypoint targets are often much larger, such as kimi-k2 or mistral large 3. I don't think you should test against these models, but these models should give you a sense of the scope of this problem. We want to support both small models and large models
  2. I think there might be some misunderstanding as to the algorithm I'm proposing. I'm not sure if I agree with the idea that what I'm proposing is more complex
  • There is no cross-thread coordination. Each thread independently reads the tensors that it needs from the source files
  • There is no write race condition. As shown in the diagram, each thread independently writes its own shard
  • There is no memory coordination required for partial reads, and there is no two-pass processing. There is a runtime cost from redundant reads of safetensors headers, but this is assumed to be minimal and could be eliminated via a cpu cache.
  1. I don't see a reason to not maximize possible parallelism and avoid additional time + space complexity in this case. I'll also note that the union method not only introduces runtime costs via sequential execution, but also introduces peak memory costs in the typical case (some processes will load 3 files at once). By contrast, in the typical case, the algorithm I'm proposing only loads its main shard plus 0-3 extra weights. We can even eliminate the redundant work of reading the extra weights by excluding these weights from other processes, meaning that each weight only gets loaded once.

Let me know what you think

@brian-dellabetta
Copy link
Copy Markdown
Collaborator

Hi @dzhengAP , I spoke to @kylesayrs about this, I need to handle reindexing in another flow in cases where a model is quantized to fp8 block but the weight and weight_scale tensors are split across file (my PR code here).

I agree re-indexing is a pain and it would be good to eliminate it. I spoke with @kylesayrs about an implementation based on his diagram posted above. If you'd like, I can look into adding it to my PR in a way that works with microscale and fp8 block, or I can rebase against your PR if you'd like to work on this. Just lmk how you'd like to proceed

@dzhengAP
Copy link
Copy Markdown
Contributor Author

dzhengAP commented Mar 23, 2026

@kylesayrs

Love the sketch! I agree that finer-grained parallelism offers real advantages — more parallelism, better speedup, and lower per-worker memory. I ran some simulations to benchmark and visualize the tradeoffs, and two concerns came up worth discussing, which are also what I was worried previously:

1. Scale consistency (NVFP4/MXFP4)
Fused Q/K/V projections require a single shared quantization scale. With grouping this is guaranteed; with fine-grained, per-thread scale computation risks scale_0 ≠ scale_1 due to FP rounding. Is the plan to pre-compute and broadcast the scale, or is some divergence acceptable?

2. Redundant tensor reads
In worst-case chain/star patterns at high shard counts (e.g., 50 shards), redundant reads can reach 98% — 150 GB of I/O for 3 GB of unique data. OS cache mitigates but doesn't solve it. My benchmarks show fine-grained wins runtime in 61% of configs and memory in 94%, but 17% of configs hit >500 redundant reads.

Proposed next steps:

  • I'll open the fine-grained PR I have ready for review — it's the better default and other PRs may depend on it @kylesayrs @brian-dellabetta
  • We can follow up with a hybrid fallback:
if max_group_size < 5 and estimated_redundancy < 100:
    return "fine_grained"   # maximize parallelism
else:
    return "grouping"       # predictable I/O

Please check the simulation results below. What's your take on the scale question?
image
image
image
image

@brian-dellabetta
Copy link
Copy Markdown
Collaborator

brian-dellabetta commented Mar 23, 2026

Thanks @dzhengAP ! Happy to review when ready. regarding your points

  1. for scale consistency, wouldn't that be handled outside of any re-indexing? I think the logic here could be completely agnostic to whether they're fused or not. The weights still appear in the checkpoint, we just stick to whatever convention already exists. If a user then runs model_free_ptq with NVFP4 format, we handle the weight fusing there.
  2. for redundant reads, i think this is fine as long as we're only reading the tensor names from the index file or the safetensors file, rather than redundantly reading the entire tensor. So while it will be significantly more reads, it should be minimal in terms of total GB of disk read, in what you're calling the "fine-grained" approach.

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 26, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

Copy link
Copy Markdown
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dzhengAP , one request on the process an validate job signatures, to keep them consistent for standard vs. microscale. No other comments beyond that, hopefully we can get this in soon. Thanks for all the work on this!

@mergify mergify bot removed the quality-failed label Mar 26, 2026
@dzhengAP dzhengAP force-pushed the model-free-ptq-runtime-optimization branch from ab15889 to 21931c9 Compare March 27, 2026 05:02
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 27, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

@dzhengAP
Copy link
Copy Markdown
Contributor Author

Updated per review from @kylesayrs and @brian-dellabetta:

microscale.py:

  • Refactor DEFAULT_FUSED_MAPPINGS to {primary_pattern: [partner_templates]}
    so only the primary-owning shard fetches its partners, preventing double
    reads for cross-shard fused weight sets
  • Move build_inverse_weights_map() here from helpers.py
  • build_inverse_weights_map() uses regex match on primary patterns to find
    partners, ensuring each fused set is fetched exactly once

process.py:

  • Unified signature for validate_file, process_file, process_file_microscale_scheme:
    (inverse_weights_map, save_path, scheme, ignore, device, converter)
  • All functions use safe_open for true partial reads
  • No backward compatibility code (internal functions)

init.py:

  • Single _get_weights_map() helper handles both single-file and multi-file models
  • Single _build_quantization_jobs() replaces separate standard/microscale builders
  • validate jobs use *job[1:] for consistent signature with process jobs
  • Remove unused _get_all_tensor_names() and job_fn=None argument

helpers.py:

  • Remove build_inverse_weights_map (moved to microscale.py)
  • Remove build_weights_map (no longer needed)

save_utils.py / init.py:

  • Fix import paths for compressed-tensors dev version

tests:

  • Update test_reindexing_elimination.py for new inverse_weights_map interface
  • Update test_model_free_validation.py for new validate_file signature

@mergify mergify bot removed the quality-failed label Mar 27, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 27, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

@brian-dellabetta brian-dellabetta added the ready When a PR is ready for review label Mar 27, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 27, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

Copy link
Copy Markdown
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I committed some changes to clean up the top level functions, so we have a single function to build jobs, and validate/process jobs have the same file signature regardless of whether they are standard or microscale jobs. @dzhengAP reviewed and tests are passing. This should be good to merge in now, and I will tackle the TODOs in my follow-up PR #2491

@mergify mergify bot removed the quality-failed label Mar 27, 2026
@dzhengAP dzhengAP force-pushed the model-free-ptq-runtime-optimization branch from b92af83 to 728d84a Compare March 28, 2026 00:52
…job signatures

Per review from @kylesayrs and @brian-dellabetta:

microscale.py:
- Refactor DEFAULT_FUSED_MAPPINGS to {primary_pattern: [partner_templates]}
  so only the primary-owning shard fetches its partners, preventing double
  reads for cross-shard fused weight sets
- build_inverse_weights_map uses re.match with named group substitution
  to construct partner names exactly as Kyle suggested

process.py:
- Unified signature for validate_file, process_file, process_file_microscale_scheme:
  (inverse_weights_map, save_path, scheme, ignore, device, converter)
- All functions use safe_open for true partial reads
- Remove assert on unmatched fused sets — non-primary shards legitimately
  have incomplete sets (k/v without q)

__init__.py:
- Single _get_weights_map() helper handles both single-file and multi-file models
- Single _build_quantization_jobs() with identical tuple structure for all jobs
- Fix import path for compressed-tensors dev version

helpers.py / validate.py:
- Remove build_inverse_weights_map (moved to microscale.py)
- Update validate.py to reflect inverse_weights_map approach

tests:
- Update test signatures for new inverse_weights_map interface

Closes vllm-project#2497

Signed-off-by: David Zheng <dqzheng1996@gmail.com>
@dzhengAP dzhengAP force-pushed the model-free-ptq-runtime-optimization branch from 728d84a to c045c63 Compare March 28, 2026 00:56
@dzhengAP
Copy link
Copy Markdown
Contributor Author

@kylesayrs

  • DEFAULT_FUSED_MAPPINGS refactored to {primary_pattern: [partner_templates]}
    so only the primary-owning shard fetches its partners, preventing double
    reads for cross-shard fused weight sets
  • build_inverse_weights_map uses re.match on primary patterns with named
    group substitution to construct partner names exactly as Kyle suggested
  • process_file_microscale_scheme: remove assert on unmatched fused sets —
    non-primary shards legitimately have k/v without q since only the primary
    shard fetches partners

@brian-dellabetta brian-dellabetta added the model_free_ptq For any PR/issue related to the `model_free_ptq` pathway label Mar 30, 2026
@dzhengAP dzhengAP changed the title [model_free_ptq] Eliminate reindexing step via fine-grained parallelized partial reads [Distributed] [model_free_ptq] Eliminate reindexing step via fine-grained parallelized partial reads Mar 30, 2026
Copy link
Copy Markdown
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job, thanks for being open to suggestions!

@dsikka dsikka merged commit 3544a0e into vllm-project:main Mar 30, 2026
14 of 17 checks passed
brian-dellabetta added a commit that referenced this pull request Mar 31, 2026
SUMMARY:
Follow-up to #2498 and pre-cursor to landing #2491. 

This PR cleans up a few things:

- [x] Use the same function signature for building standard jobs,
microscale jobs, and validation jobs. These will be needed in #2491.
- [x] Renamed microscale-specific `build_inverse_weights_map` ->
`build_microscale_inverse_weights_map` because other reindexing logic
will need different functionality when determining fused tensors.
- [x] Prunes unused `_get_all_tensor_names`
- [x] Breaks out loading logic for inverse_weights_map to a helper that
can be moved to CT in follow-up #2491


TEST PLAN:
No net new functionality, if all tests pass should be good to go

---------

Signed-off-by: David Zheng <dqzheng1996@gmail.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: David Zheng <dqzheng1996@gmail.com>
Co-authored-by: David Zheng <153074367+dzhengAP@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
brian-dellabetta added a commit that referenced this pull request Mar 31, 2026
…ld_inverse_weights_map (#2546)

## Purpose
Follow-up to #2498 addressing review comments from @kylesayrs.

## Changes
- `partner_shard is None`: log warning instead of silent skip —
indicates
unexpected model architecture where expected partner tensor doesn't
exist
- `partner_resolved is None`: raise ValueError instead of silent skip —
indicates corrupt or incomplete checkpoint and should surface as an
error

## Notes
These cases were flagged as defensive guards that either shouldn't exist
or should error loudly. The warning approach for partner_shard=None
handles
the edge case of non-standard model architectures gracefully while still
surfacing the issue.

## Test
Unit tests passed 35/35

Signed-off-by: David Zheng <dqzheng1996@gmail.com>
Signed-off-by: root <root@bolt-6jxv69gfv8-tqazfcpmhh.bolt-pods.turi-bolt.svc.kube.us-east-1d.k8s.cloud.apple.com>
Co-authored-by: root <root@bolt-6jxv69gfv8-tqazfcpmhh.bolt-pods.turi-bolt.svc.kube.us-east-1d.k8s.cloud.apple.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation model_free_ptq For any PR/issue related to the `model_free_ptq` pathway ready When a PR is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[model_free_ptq] Runtime optimization: meta device shape validation, multi-GPU compression, reindexing elimination

4 participants