-
Notifications
You must be signed in to change notification settings - Fork 315
Update quantization overview and contributor guide doc #2723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2723
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 538e24f with merge base 6cfa477 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
8d12652
to
d007a58
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Content looks great! I think we just need to make sure all the tables and code blocks render properly, e.g.
- https://docs-preview.pytorch.org/pytorch/ao/2723/quantization_overview.html
- https://docs-preview.pytorch.org/pytorch/ao/2723/contributor_guide.html
Also one thing I think is missing is a section or paragraph on the status of AffineQuantizedTensor
. This still powers most of our existing quantization configs but I think we want to move away from using this for new configs, is that right? Maybe we should clarify this distinction otherwise users may be confused which tensor subclass to use.
:toctree: generated/ | ||
:nosignatures: | ||
|
||
TorchAOBaseTensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR but should this be in torchao.core
instead? That's where we have AOBaseConfig
today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah probably, we can move this I think
|
||
Layout/TensorImpl | ||
~~~~~~~~~~~~~~~~~ | ||
KernelPreference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to document this in one of the api_refs as well
Int4Tensor scaled int4 plain (pack 2 adjacent int4 to a single int8 value) int4 weight only quantization | ||
Int4PreshuffledTensor scaled int4 preshuffled (special format to optimize for loading) float8 act + int4 weight dynamic quantization | ||
int4 weight only quantization | ||
====================== ============== ====================================================== =============================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this table isn't rendering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK metamate lied to me
|
||
.. note:: | ||
Note that we also don't use dynamic activation in the name, since we are talking about the weight tensor object, including information about activation in the tensor subclass name will be confusing, but | ||
we do implement both weight only and dynamic activation quantization in the same linear function implementation, without relying on additional abstractions, this keeps relevant quantization operations close |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean for to_linear_activation_quantized
? New features should not use that anymore right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we don't use this anymore, to reduce the number of abstractions, it only requires a few lines of additional code in each tensor subclass
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
To connect everything together, here is a more detailed walk through for float8 dynamic activation and float8 weight quantization in torchao (DEFAULT kernel preference, in H100, when fbgemm_gpu_genai library is installed): | ||
|
||
Quantization Flow: quantize_(model, Float8DynamicActivationFloat8WeightConfig()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not super related but I notice we have these two configs:
Float8DynamicActivationFloat8WeightConfig
Float8ActivationInt4WeightConfig
Should we drop "Dynamic" from the first one to be more consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, we should add Dynamic to the second one I feel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is static activation right now actually:
ao/torchao/quantization/quant_api.py
Line 1865 in 510e1b4
class Float8StaticActivationFloat8WeightConfig(AOBaseConfig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah just saw this, either way looks good to me as long as we are consistent
|
||
* ``torch.uint1`` to ``torch.uint7`` available in pytorch 2.3 and later | ||
* ``torch.int1`` to ``torch.int7`` available in pytorch 2.6 and later | ||
* ``torch.float4_e2m1fn_x2``, ``torch.float8_e4m3fn``, ``torch.float8_e4m3fnuz``, ``torch.float8_e5m2``, ``torch.float8_e5m2fnuz``, ``torch.float8_e8m0fnu`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also mention the MX dtypes here and mark them as prototype? (and in the ascii art)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MX is using torch.float8_e8m0fnu
for scale and then torch.float4_e2m1fn_x2
and torch.float8_e4m3fn
(or some other fp8 dtypes) for data I think. cc @drisspg to confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see, maybe just mention the high-level dtypes in the ascii art then (e.g. mxfp4, mxfp6, mxfp8, nvfp4)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these will be defined in tensor subclasses directly, we are listing the pytorch dtypes here, I can add a note about this though
2291c28
to
d1d4af8
Compare
Summary: We have recently updated our design for structuring tensor subclasses in torchao to remove unnecessary abstractions and reduce indirections and having a structuring that aligns better with people's intuitive understanding of different quantization use cases, examples using the new design are: pytorch#2463, pytorch#2687 Test Plan: check generated doc Reviewers: Subscribers: Tasks: Tags:
d1d4af8
to
538e24f
Compare
Summary:
We have recently updated our design for structuring tensor subclasses in torchao to remove unnecessary abstractions and reduce indirections and having a structuring that aligns better with people's intuitive understanding of different quantization use cases, examples using the new design are: #2463, #2687
Test Plan:
check generated doc
Reviewers:
Subscribers:
Tasks:
Tags: