[Distributed] [model_free_ptq] Eliminate reindexing step via fine-grained parallelized partial reads by dzhengAP · Pull Request #2498 · vllm-project/llm-compressor

dzhengAP · 2026-03-20T20:28:08Z

Purpose

Eliminates the reindex_fused_weights preprocessing step for microscale
schemes (NVFP4, MXFP4) by enabling each shard to be processed independently
with full parallelism, even when fused weight sets (q/k/v, gate/up) span
multiple shards.

Approach

Instead of grouping shards together (which reduces parallelism), each shard
process fetches only the specific fused partner tensors it needs from other
shards via targeted partial safetensors reads, computes the fused global
scale locally, and writes only its own output shard. No cross-process
coordination or file locking required.

Changes

`helpers.py`

Added build_tensor_file_index() — reads index.json once at startup and
builds a flat mapping of tensor_name → resolved_file_path. This gives each
worker process an O(1) lookup to find which file contains any fused partner
tensor, without re-scanning headers at runtime.

`process.py`

Updated process_file_microscale_scheme() with an optional
tensor_file_index parameter. When provided:

_fetch_fused_partners() is called to identify any fused set members
missing from the current shard, then fetches only those specific tensors
via partial safetensors reads (headers + target tensors only, not full files)
Fused global scale is computed locally using all members of the fused set
_belongs_to_shard() ensures only native tensors are written to the output
shard — fetched partner tensors are used for scale computation only and
never written to the wrong shard

`init.py`

Simplified back to one job per shard — full parallelism restored. For
microscale schemes, builds the tensor_file_index once from index.json
and passes it to each job. No union-find, no grouping logic needed.

`validate.py`

Removed NotImplementedError for cross-shard fused weights — the case is
now handled natively. Replaced with logger.debug noting that partner
tensors will be resolved via partial reads.

Latest Updates: Eliminate reindexing step via inverse_weights_map with unified job signatures

Approach

Each shard job receives a precomputed inverse_weights_map specifying exactly
which tensors to load from which files. For cross-shard fused weights, only the
shard owning the primary tensor (q_proj, gate_proj) fetches its partners —
preventing double reads. All jobs share a unified signature for both standard
and microscale schemes.

Changes

`microscale.py`

Refactor DEFAULT_FUSED_MAPPINGS from a list of lists to
{primary_pattern: [partner_templates]} — only the primary-owning shard
fetches its partners, preventing double reads for cross-shard fused weights
Move build_inverse_weights_map() here — uses regex match on primary
patterns to construct partner names and locate them in other shards

`process.py`

Unified signature for validate_file, process_file, and
process_file_microscale_scheme:
(inverse_weights_map, save_path, scheme, ignore, device, converter)
All functions use safe_open + f.get_tensor() for true partial reads
Partner tensors re-saved into requesting shard's output; caller updates
safetensors index to reflect new locations

`init.py`

Single _get_weights_map() helper handles both single-file and multi-file
models (reads safetensors.index.json or scans file headers via safe_open)
Single _build_quantization_jobs() replaces separate standard/microscale
builders — one job per shard with identical tuple structure for both
Validate jobs use *job[1:] for full future-proofing

`helpers.py`

Removed build_weights_map and build_inverse_weights_map (moved to
microscale.py)

`validate.py`

Removed NotImplementedError for cross-shard fused weights — handled natively
Updated to reflect inverse_weights_map-based approach

Testing

pytest tests/llmcompressor/entrypoints/model_free/ — all passing locally
make style && make quality — all checks pass

Signed-off-by: David Zheng dqzheng1996@gmail.com

Closes #2497
Related to #2448

github-actions · 2026-03-20T20:28:17Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2026-03-20T20:28:37Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refines the model_free_ptq quantization process for microscale schemes by integrating fusion-aware logic directly into job scheduling. This enhancement removes the previous requirement for a separate reindex_fused_weights preprocessing step, streamlining the workflow for models with fused weight sets split across multiple safetensors shards. The changes enable more efficient and accurate quantization by ensuring that all components of a fused weight set are processed together, even when distributed across different files. Additionally, the AWQ modifier has been updated to allow for configurable search observers, providing more flexibility and control over the quantization process.

Highlights

Fusion-Aware Job Scheduling: Introduced fusion-aware job scheduling in model_free_ptq for microscale quantization schemes (NVFP4, MXFP4), enabling joint processing of fused weight sets even when split across shards.
Elimination of Reindexing Step: Eliminated the need for the reindex_fused_weights preprocessing step by natively handling cross-shard fused weights within the model_free_ptq function.
New File Grouping Utility: Added a group_files_by_fused_weights utility function using a union-find algorithm to cluster related safetensors files that share fused weight sets.
Group Processing for Microscale Schemes: Implemented process_file_group_microscale_scheme to handle the joint processing of multiple safetensors files that contain cross-shard fused weights, ensuring correct global scale fusion.
Softened Validation Error: Softened the NotImplementedError in validate.py related to cross-shard fused weights to a debug log, as these cases are now handled automatically.
AWQ Modifier Enhancement: Enhanced the AWQModifier with a search_observer parameter, allowing configuration of the observer used during grid search for improved scale alignment.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request successfully eliminates the reindex_fused_weights preprocessing step for microscale schemes by making model_free_ptq fusion-aware. The changes are well-structured, introducing new helper functions for file grouping and processing, which improves modularity. The logic for identifying and processing cross-shard fused weights seems correct. I've included a couple of suggestions to reduce code duplication, which would enhance maintainability. Additionally, the bundled changes to AWQModifier are beneficial, improving its flexibility and robustness.

src/llmcompressor/entrypoints/model_free/__init__.py

src/llmcompressor/entrypoints/model_free/process.py

dzhengAP · 2026-03-20T20:56:39Z

Test Results

10 passed in 7.62s
Test Group Files By Fused Weights (7/7 passed)
Test Process File Group Microscale Scheme (3/3 passed)
@kylesayrs

dzhengAP · 2026-03-20T21:25:59Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant improvement by making the model_free_ptq process fusion-aware, which eliminates the need for a manual reindex_fused_weights preprocessing step for microscale schemes. The implementation is well-designed, utilizing a union-find algorithm for efficient file grouping and a new processing function to handle these groups. The code is well-structured and includes comprehensive new tests. My review includes a couple of suggestions to enhance code clarity and maintainability.

src/llmcompressor/entrypoints/model_free/__init__.py

src/llmcompressor/entrypoints/model_free/process.py

kylesayrs · 2026-03-20T22:56:42Z

Hi @dzhengAP, can you describe at a high level how your changes avoid the reindexing step? It seems like this doesn't handle the case where two processes end up writing to the same save file at the same time/ end up overwriting each other's changes.

You also need to make sure that parallel processes don't attempt to read while the others are writing.

dzhengAP · 2026-03-20T23:27:46Z

Hi @dzhengAP, can you describe at a high level how your changes avoid the reindexing step? It seems like this doesn't handle the case where two processes end up writing to the same save file at the same time/ end up overwriting each other's changes.

You also need to make sure that parallel processes don't attempt to read while the others are writing.

@kylesayrs Great question! The key here is that, group jobs are never parallelized against each other — I added group_files_by_fused_weights to use union-find to cluster all shards that share fused weights into a single job, so a group of N shards is processed by exactly one process sequentially. There's no concurrent read/write between processes on the same files.

High-level flow wise:

group_files_by_fused_weights reads index.json and unions any shards that share fused weight sets (q/k/v, gate/up) into one group
Each group becomes a single job dispatched to one worker — process_file_group_microscale_scheme loads all shards in the group, processes them together in memory, then writes each tensor back to its original shard
Groups are independent by construction (no shared tensors between groups), so parallel workers never touch the same files

So the invariant is: one job = one group = one worker = no concurrent access to the same shard. Does that address your concern, or are there edge cases you're thinking of that I'm missing?

dzhengAP · 2026-03-21T09:47:38Z

@kylesayrs, I validated the concurrency safety claims from my previous comment using both determinism checks and high-concurrency stress testing to address concerns around race conditions and file access conflicts. I’ve also committed the test to the repo as a potential regression safeguard for future concurrency-related changes(e3b7d16)

Experiments

Experiment 1: Determinism Test

Ran identical quantization jobs with max_workers=1 vs max_workers=8
Result: SHA256 hashes of all output .safetensors files are bitwise identical
Implication: No concurrent write corruption or read-write races affecting output integrity

Experiment 2: High-Concurrency Stress Test

Forced max_workers=16 on 4 GPUs (high contention scenario)
Result: Completed successfully with no PermissionError, file lock errors, or crashes

Test Setup

Parameter	Value
GPU	4× CUDA devices
Model	TinyLlama/TinyLlama-1.1B-Chat-v1.0
Scheme	`w8a16` (weight-only, no calibration)
Branch	`model-free-ptq-runtime-optimization` (commit `7230bb18`)

Results Summary

Test	Status	Evidence
Worker 1 (Baseline)	PASS	Completed successfully
Worker 8 (Parallel)	PASS	Completed successfully
Hash Match (Determinism)	PASS	`444eee3d5e6e113a...` identical across runs
Stress 16 (High Contention)	PASS	No errors or file lock conflicts

Conclusion

These results support the invariant:

one job = one group = one worker = no concurrent access to the same shard

The fusion-aware file grouping logic properly isolates file access across parallel workers. No race conditions detected.

Test Script

The validation test script has been added to this branch:

File: tests/test_concurrency_safety.py
Usage: python tests/test_concurrency_safety.py

- Validates fusion-aware file grouping prevents race conditions - Tests determinism across 1, 8, and 16 workers - Verifies SHA256 hash consistency under high concurrency - Supports the 'one job = one group = one worker' invariant

- Validates fusion-aware file grouping prevents race conditions - Tests determinism across 1, 8, and 16 workers - Verifies SHA256 hash consistency under high concurrency - Supports the 'one job = one group = one worker' invariant Tested on: 4x CUDA GPUs Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 Scheme: w8a16 (weight-only) Author: David Zheng (dqzheng1996@gmail.com) Signed-off-by: David Zheng <dqzheng1996@gmail.com>

kylesayrs · 2026-03-21T16:36:52Z

I added group_files_by_fused_weights to use union-find to cluster all shards that share fused weights into a single job

I see! However, in the worst case, all shards are placed into the same group, right? Consider the case:

file0: (A.up_proj)
file1: (A.gate_proj, B.up_proj)
file0: (B.gate_proj, C.up_proj)
file0: (C.gate_proj, D.up_proj)
...

This solution inherently introduces some level of sequential processing, which introduces a lower bound on overall latency and reduces parallelism. Have you considered using a design where each thread reads the partitions that it needs, then writes the results independently?

mergify · 2026-03-21T16:38:25Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

dzhengAP · 2026-03-22T05:11:56Z

I added group_files_by_fused_weights to use union-find to cluster all shards that share fused weights into a single job

I see! However, in the worst case, all shards are placed into the same group, right? Consider the case:
file0: (A.up_proj)
file1: (A.gate_proj, B.up_proj)
file0: (B.gate_proj, C.up_proj)
file0: (C.gate_proj, D.up_proj)
...
This solution inherently introduces some level of sequential processing, which introduces a lower bound on overall latency and reduces parallelism. Have you considered using a design where each thread reads the partitions that it needs, then writes the results independently?

Hi @kylesayrs, You're right: in the worst corner case in theory, union-find could collapse all shards into one group in pathological cases. I indeed considered the parallel reads/writes in the beginning. Here's why I think the current UF approach is still the right trade-off in practice:

Real-world fused weights are localized – Even in worst-case chaining, groups are bounded by layer count (32-96 for LLMs), not model-wide. I am running some actual benchmarks on Llama-3-8B, Mixtral-8x7B, besides TinyLlama to get the max group sizes.
The alternative (parallel reads/writes) introduces non-trivial complexity:
- Cross-thread coordination for global scale computation (microscale schemes need scales across ALL tensors in a fused set)
- Race conditions when multiple processes write to the same shard (would require file locking, killing performance)
- Memory/coordination overhead for partial reads and two-pass processing
Union-find guarantees single-writer per shard – No locks, no races, roughly identical throughput, simple to test and reason about, as I validated in the committed test. The bounded serialization penalty is minimal compared to quantization computation itself.

If we're concerned about pathological cases, we could add a hybrid fallback: process large groups with parallel partitions while keeping union-find for the common case. Happy to explore that if you think it's necessary. Does this address your concern, or do you see specific scenarios where the current approach would cause real performance issues?

kylesayrs · 2026-03-22T07:12:10Z

@dzhengAP

The models that this entrypoint targets are often much larger, such as kimi-k2 or mistral large 3. I don't think you should test against these models, but these models should give you a sense of the scope of this problem. We want to support both small models and large models
I think there might be some misunderstanding as to the algorithm I'm proposing. I'm not sure if I agree with the idea that what I'm proposing is more complex

There is no cross-thread coordination. Each thread independently reads the tensors that it needs from the source files
There is no write race condition. As shown in the diagram, each thread independently writes its own shard
There is no memory coordination required for partial reads, and there is no two-pass processing. There is a runtime cost from redundant reads of safetensors headers, but this is assumed to be minimal and could be eliminated via a cpu cache.

I don't see a reason to not maximize possible parallelism and avoid additional time + space complexity in this case. I'll also note that the union method not only introduces runtime costs via sequential execution, but also introduces peak memory costs in the typical case (some processes will load 3 files at once). By contrast, in the typical case, the algorithm I'm proposing only loads its main shard plus 0-3 extra weights. We can even eliminate the redundant work of reading the extra weights by excluding these weights from other processes, meaning that each weight only gets loaded once.

Let me know what you think

brian-dellabetta · 2026-03-23T17:17:21Z

Hi @dzhengAP , I spoke to @kylesayrs about this, I need to handle reindexing in another flow in cases where a model is quantized to fp8 block but the weight and weight_scale tensors are split across file (my PR code here).

I agree re-indexing is a pain and it would be good to eliminate it. I spoke with @kylesayrs about an implementation based on his diagram posted above. If you'd like, I can look into adding it to my PR in a way that works with microscale and fp8 block, or I can rebase against your PR if you'd like to work on this. Just lmk how you'd like to proceed

dzhengAP · 2026-03-23T18:42:08Z

@kylesayrs

Love the sketch! I agree that finer-grained parallelism offers real advantages — more parallelism, better speedup, and lower per-worker memory. I ran some simulations to benchmark and visualize the tradeoffs, and two concerns came up worth discussing, which are also what I was worried previously:

1. Scale consistency (NVFP4/MXFP4)
Fused Q/K/V projections require a single shared quantization scale. With grouping this is guaranteed; with fine-grained, per-thread scale computation risks scale_0 ≠ scale_1 due to FP rounding. Is the plan to pre-compute and broadcast the scale, or is some divergence acceptable?

2. Redundant tensor reads
In worst-case chain/star patterns at high shard counts (e.g., 50 shards), redundant reads can reach 98% — 150 GB of I/O for 3 GB of unique data. OS cache mitigates but doesn't solve it. My benchmarks show fine-grained wins runtime in 61% of configs and memory in 94%, but 17% of configs hit >500 redundant reads.

Proposed next steps:

I'll open the fine-grained PR I have ready for review — it's the better default and other PRs may depend on it @kylesayrs @brian-dellabetta
We can follow up with a hybrid fallback:

if max_group_size < 5 and estimated_redundancy < 100:
    return "fine_grained"   # maximize parallelism
else:
    return "grouping"       # predictable I/O

Please check the simulation results below. What's your take on the scale question?

brian-dellabetta · 2026-03-23T21:26:16Z

Thanks @dzhengAP ! Happy to review when ready. regarding your points

for scale consistency, wouldn't that be handled outside of any re-indexing? I think the logic here could be completely agnostic to whether they're fused or not. The weights still appear in the checkpoint, we just stick to whatever convention already exists. If a user then runs model_free_ptq with NVFP4 format, we handle the weight fusing there.
for redundant reads, i think this is fine as long as we're only reading the tensor names from the index file or the safetensors file, rather than redundantly reading the entire tensor. So while it will be significantly more reads, it should be minimal in terms of total GB of disk read, in what you're calling the "fine-grained" approach.

mergify · 2026-03-26T05:22:31Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

src/llmcompressor/entrypoints/model_free/__init__.py

brian-dellabetta

Thanks @dzhengAP , one request on the process an validate job signatures, to keep them consistent for standard vs. microscale. No other comments beyond that, hopefully we can get this in soon. Thanks for all the work on this!

src/llmcompressor/entrypoints/model_free/__init__.py

mergify · 2026-03-27T05:02:44Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

dzhengAP · 2026-03-27T05:03:05Z

Updated per review from @kylesayrs and @brian-dellabetta:

microscale.py:

Refactor DEFAULT_FUSED_MAPPINGS to {primary_pattern: [partner_templates]}
so only the primary-owning shard fetches its partners, preventing double
reads for cross-shard fused weight sets
Move build_inverse_weights_map() here from helpers.py
build_inverse_weights_map() uses regex match on primary patterns to find
partners, ensuring each fused set is fetched exactly once

process.py:

Unified signature for validate_file, process_file, process_file_microscale_scheme:
(inverse_weights_map, save_path, scheme, ignore, device, converter)
All functions use safe_open for true partial reads
No backward compatibility code (internal functions)

init.py:

Single _get_weights_map() helper handles both single-file and multi-file models
Single _build_quantization_jobs() replaces separate standard/microscale builders
validate jobs use *job[1:] for consistent signature with process jobs
Remove unused _get_all_tensor_names() and job_fn=None argument

helpers.py:

Remove build_inverse_weights_map (moved to microscale.py)
Remove build_weights_map (no longer needed)

save_utils.py / init.py:

Fix import paths for compressed-tensors dev version

tests:

Update test_reindexing_elimination.py for new inverse_weights_map interface
Update test_model_free_validation.py for new validate_file signature

mergify · 2026-03-27T16:04:39Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

mergify · 2026-03-27T19:32:19Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

brian-dellabetta

I committed some changes to clean up the top level functions, so we have a single function to build jobs, and validate/process jobs have the same file signature regardless of whether they are standard or microscale jobs. @dzhengAP reviewed and tests are passing. This should be good to merge in now, and I will tackle the TODOs in my follow-up PR #2491

@kylesayrs

…job signatures Per review from @kylesayrs and @brian-dellabetta: microscale.py: - Refactor DEFAULT_FUSED_MAPPINGS to {primary_pattern: [partner_templates]} so only the primary-owning shard fetches its partners, preventing double reads for cross-shard fused weight sets - build_inverse_weights_map uses re.match with named group substitution to construct partner names exactly as Kyle suggested process.py: - Unified signature for validate_file, process_file, process_file_microscale_scheme: (inverse_weights_map, save_path, scheme, ignore, device, converter) - All functions use safe_open for true partial reads - Remove assert on unmatched fused sets — non-primary shards legitimately have incomplete sets (k/v without q) __init__.py: - Single _get_weights_map() helper handles both single-file and multi-file models - Single _build_quantization_jobs() with identical tuple structure for all jobs - Fix import path for compressed-tensors dev version helpers.py / validate.py: - Remove build_inverse_weights_map (moved to microscale.py) - Update validate.py to reflect inverse_weights_map approach tests: - Update test signatures for new inverse_weights_map interface Closes vllm-project#2497 Signed-off-by: David Zheng <dqzheng1996@gmail.com>

dzhengAP · 2026-03-28T01:01:10Z

@kylesayrs

DEFAULT_FUSED_MAPPINGS refactored to {primary_pattern: [partner_templates]}
so only the primary-owning shard fetches its partners, preventing double
reads for cross-shard fused weight sets
build_inverse_weights_map uses re.match on primary patterns with named
group substitution to construct partner names exactly as Kyle suggested
process_file_microscale_scheme: remove assert on unmatched fused sets —
non-primary shards legitimately have k/v without q since only the primary
shard fetches partners

kylesayrs

Nice job, thanks for being open to suggestions!

src/llmcompressor/entrypoints/model_free/microscale.py

SUMMARY: Follow-up to #2498 and pre-cursor to landing #2491. This PR cleans up a few things: - [x] Use the same function signature for building standard jobs, microscale jobs, and validation jobs. These will be needed in #2491. - [x] Renamed microscale-specific `build_inverse_weights_map` -> `build_microscale_inverse_weights_map` because other reindexing logic will need different functionality when determining fused tensors. - [x] Prunes unused `_get_all_tensor_names` - [x] Breaks out loading logic for inverse_weights_map to a helper that can be moved to CT in follow-up #2491 TEST PLAN: No net new functionality, if all tests pass should be good to go --------- Signed-off-by: David Zheng <dqzheng1996@gmail.com> Signed-off-by: Brian Dellabetta <bdellabe@redhat.com> Signed-off-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Co-authored-by: David Zheng <dqzheng1996@gmail.com> Co-authored-by: David Zheng <153074367+dzhengAP@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

@kylesayrs

…ld_inverse_weights_map (#2546) ## Purpose Follow-up to #2498 addressing review comments from @kylesayrs. ## Changes - `partner_shard is None`: log warning instead of silent skip — indicates unexpected model architecture where expected partner tensor doesn't exist - `partner_resolved is None`: raise ValueError instead of silent skip — indicates corrupt or incomplete checkpoint and should surface as an error ## Notes These cases were flagged as defensive guards that either shouldn't exist or should error loudly. The warning approach for partner_shard=None handles the edge case of non-standard model architectures gracefully while still surfacing the issue. ## Test Unit tests passed 35/35 Signed-off-by: David Zheng <dqzheng1996@gmail.com> Signed-off-by: root <root@bolt-6jxv69gfv8-tqazfcpmhh.bolt-pods.turi-bolt.svc.kube.us-east-1d.k8s.cloud.apple.com> Co-authored-by: root <root@bolt-6jxv69gfv8-tqazfcpmhh.bolt-pods.turi-bolt.svc.kube.us-east-1d.k8s.cloud.apple.com> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>

dzhengAP requested review from HDCharles, brian-dellabetta, dsikka and kylesayrs as code owners March 20, 2026 20:28

gemini-code-assist bot reviewed Mar 20, 2026

View reviewed changes

src/llmcompressor/entrypoints/model_free/__init__.py Show resolved Hide resolved

src/llmcompressor/entrypoints/model_free/process.py Outdated Show resolved Hide resolved

dzhengAP force-pushed the model-free-ptq-runtime-optimization branch from 0f959c2 to 787057c Compare March 20, 2026 20:58

mergify bot added the documentation Improvements or additions to documentation label Mar 20, 2026

dzhengAP force-pushed the model-free-ptq-runtime-optimization branch 3 times, most recently from 11240f1 to cd0d7e1 Compare March 20, 2026 21:23

gemini-code-assist bot reviewed Mar 20, 2026

View reviewed changes

src/llmcompressor/entrypoints/model_free/__init__.py Outdated Show resolved Hide resolved

src/llmcompressor/entrypoints/model_free/process.py Outdated Show resolved Hide resolved

dzhengAP force-pushed the model-free-ptq-runtime-optimization branch from 72ca14f to 7230bb1 Compare March 20, 2026 21:35

dzhengAP force-pushed the model-free-ptq-runtime-optimization branch from accc40e to e3b7d16 Compare March 21, 2026 10:00

mergify bot added the quality-failed label Mar 21, 2026

mergify bot added the quality-failed label Mar 26, 2026

brian-dellabetta reviewed Mar 26, 2026

View reviewed changes

src/llmcompressor/entrypoints/model_free/__init__.py Outdated Show resolved Hide resolved

brian-dellabetta reviewed Mar 26, 2026

View reviewed changes

src/llmcompressor/entrypoints/model_free/__init__.py Show resolved Hide resolved

mergify bot removed the quality-failed label Mar 26, 2026

dzhengAP force-pushed the model-free-ptq-runtime-optimization branch from ab15889 to 21931c9 Compare March 27, 2026 05:02

mergify bot added the quality-failed label Mar 27, 2026

mergify bot removed the quality-failed label Mar 27, 2026

mergify bot added quality-failed and removed quality-failed labels Mar 27, 2026

brian-dellabetta added the ready When a PR is ready for review label Mar 27, 2026

mergify bot added the quality-failed label Mar 27, 2026

brian-dellabetta approved these changes Mar 27, 2026

View reviewed changes

mergify bot removed the quality-failed label Mar 27, 2026

dzhengAP force-pushed the model-free-ptq-runtime-optimization branch from b92af83 to 728d84a Compare March 28, 2026 00:52

dzhengAP force-pushed the model-free-ptq-runtime-optimization branch from 728d84a to c045c63 Compare March 28, 2026 00:56

brian-dellabetta added the model_free_ptq For any PR/issue related to the `model_free_ptq` pathway label Mar 30, 2026

dzhengAP changed the title ~~[model_free_ptq] Eliminate reindexing step via fine-grained parallelized partial reads~~ [Distributed] [model_free_ptq] Eliminate reindexing step via fine-grained parallelized partial reads Mar 30, 2026

kylesayrs approved these changes Mar 30, 2026

View reviewed changes

src/llmcompressor/entrypoints/model_free/microscale.py Show resolved Hide resolved

src/llmcompressor/entrypoints/model_free/microscale.py Show resolved Hide resolved

dsikka merged commit 3544a0e into vllm-project:main Mar 30, 2026
14 of 17 checks passed

brian-dellabetta mentioned this pull request Mar 30, 2026

[model_free_ptq] build job cleanup #2545

Merged

4 tasks

dzhengAP mentioned this pull request Mar 30, 2026

[model_free_ptq] Add warning/error for invalid partner tensors in build_inverse_weights_map #2546

Merged

Conversation

dzhengAP commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Approach

Changes

helpers.py

process.py

__init__.py

validate.py

Latest Updates: Eliminate reindexing step via inverse_weights_map with unified job signatures

Approach

Changes

microscale.py

process.py

__init__.py

helpers.py

validate.py

Testing

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

gemini-code-assist bot commented Mar 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

dzhengAP commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

dzhengAP commented Mar 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

kylesayrs commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dzhengAP commented Mar 20, 2026

Uh oh!

dzhengAP commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Experiments

Experiment 1: Determinism Test

Experiment 2: High-Concurrency Stress Test

Test Setup

Results Summary

Conclusion

Test Script

Uh oh!

kylesayrs commented Mar 21, 2026

Uh oh!

mergify bot commented Mar 21, 2026

Uh oh!

dzhengAP commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylesayrs commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brian-dellabetta commented Mar 23, 2026

Uh oh!

dzhengAP commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brian-dellabetta commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Mar 26, 2026

Uh oh!

dzhengAP commented Mar 20, 2026 •

edited

Loading

`helpers.py`

`process.py`

`init.py`

`validate.py`

`microscale.py`

`process.py`

`init.py`

`helpers.py`

`validate.py`

dzhengAP commented Mar 20, 2026 •

edited

Loading

kylesayrs commented Mar 20, 2026 •

edited

Loading

dzhengAP commented Mar 21, 2026 •

edited

Loading

dzhengAP commented Mar 22, 2026 •

edited

Loading

kylesayrs commented Mar 22, 2026 •

edited

Loading

dzhengAP commented Mar 23, 2026 •

edited

Loading

brian-dellabetta commented Mar 23, 2026 •

edited

Loading