-
Notifications
You must be signed in to change notification settings - Fork 19
[Quantization] Support more than one quant-compressor #415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -164,7 +164,7 @@ def from_pretrained_model( | |||
cls, | |||
model: Module, | |||
sparsity_config: Union[SparsityCompressionConfig, str, None] = None, | |||
quantization_format: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaict this is the only entrypoint for this function.
Why not just adjust the upstream function infer_quantization_format to infer the mixed value? Rather than supporting an extra data type (List[str]) which ideally should never actually appear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @kylesayrs on this, also if a list of quantization formats are passed in we override them to mixed precision format and then infer them again downstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. Separation of concern. The infer_quantization_format is responsible for inferring the formats in the model but what gets written to the config should be determined by the ModelCompressor class which is ultimately responsible for writing the quantization config
We dont infer again - we use the per module format attached to each scheme to compress each module.
See the updated llmcompressor functionality: vllm-project/llm-compressor#1713
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaict the only reason why we would need to infer the list of used quantization formats in a model is to write to the config. I since model_compressor is responsible for writing to the config, I would argue that the "infer global quantization tag for the purposes of writing to config" logic should exist in model compressor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to pass all available formats, why are we then re inferring afterwards via _fetch_unique_quantization_formats
? This seems like a potential conflict in source of truth.
Ideally scheme.format
should be the source of truth of formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice feature, agree with @kylesayrs 's recommendation, + updating docstrings and adding a test specifically for mixed precision compression/decompression
@@ -164,7 +164,7 @@ def from_pretrained_model( | |||
cls, | |||
model: Module, | |||
sparsity_config: Union[SparsityCompressionConfig, str, None] = None, | |||
quantization_format: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @kylesayrs on this, also if a list of quantization formats are passed in we override them to mixed precision format and then infer them again downstream?
self.quantization_compressor: Optional[ | ||
Union[BaseQuantizationCompressor, DenseCompressor] | ||
Dict[str, Union[BaseQuantizationCompressor, DenseCompressor]] | ||
] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we rename to self.quantization_compressors
to indicate this is now a dict? or is there some reason we can't because it's serialized etc.?
# Note - compress only supports one compression format atm | ||
quant_compressor = next(iter(self.quantization_compressor)) | ||
state_dict = quant_compressor.compress( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will we get around this constraint of compress
only supporting one format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will have to expand its functionality. This pathway is no longer used by llmcompressor so no immediate requirement.
src/compressed_tensors/compressors/model_compressors/model_compressor.py
Show resolved
Hide resolved
@@ -48,6 +49,7 @@ class QuantizationScheme(BaseModel): | |||
weights: Optional[QuantizationArgs] = None | |||
input_activations: Optional[QuantizationArgs] = None | |||
output_activations: Optional[QuantizationArgs] = None | |||
format: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, we should support users overriding format, right? This is a useful if, for example, we want to run a model which has been fake quantized for debugging purposes, and then also to support different packing formats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can still override using a user provided quant format. That has not changed.
We dont currently support users providing a per-module format to override. That would be a new feature add
@@ -164,7 +164,7 @@ def from_pretrained_model( | |||
cls, | |||
model: Module, | |||
sparsity_config: Union[SparsityCompressionConfig, str, None] = None, | |||
quantization_format: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to pass all available formats, why are we then re inferring afterwards via _fetch_unique_quantization_formats
? This seems like a potential conflict in source of truth.
Ideally scheme.format
should be the source of truth of formats.
cd324dd
to
8b5d4c9
Compare
Seems like there are 3 sources of truth for quantization format
It'd be nice if def get_model_compression_format(model: torch.nn.Module) -> Set[CompressionFormat]:
return set(
getattr_chain(module, "quantization_scheme.format", CompressionFormat.dense)
for module in model.modules()
)
|
We still support the global compression format to be overwritten but this is not a common pathway which is why it was not part of this PR change for the per-module case. Ideally, we can also update our preset schemes to include the compression formats as well. But again, not what this PR is targeting as not our typical user pathway. I agree we can remove |
Summary
ModelCompressor.quantization_compressor
to now be a dictionary, such that more than one quantization compressor can be supportedmixed-precision
as a newCompressionFormat
- if more than one format is found within the model,mixed-precision
is set as the model's global format in itsconfig.json
format
to theQuantizationScheme
and leverages this per-module format field in order to fetch the appropriate compressor to compress the modelModelCompressor.compress
andModelCompressor.decompress
- onlycompress_model
anddecompress_model
currently support this functionality as compress/decompress essentially only support global formatsTesting:
Next Steps:
Example Updates
New config: