Skip to content

Commit 661cf8f

Browse files
Improve superscript notes readability
Signed-off-by: Thara Palanivel <[email protected]>
1 parent e77c693 commit 661cf8f

File tree

1 file changed

+10
-6
lines changed

1 file changed

+10
-6
lines changed

examples/QAT_INT8/README.md

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -96,12 +96,10 @@ Checkout [Example Test Results](#example-test-results) to compare against your r
9696

9797
## Example Test Results
9898

99-
For comparison purposes, here are some of the results we found during testing when tested with PyTorch 2.3.1:
99+
For comparison purposes, here are some of the results we found during testing when tested with `PyTorch 2.3.1`:
100100

101-
- Accuracy could vary ~ +-0.2 from run to run.
102-
- `INT8` matmuls are ~2x faster than `FP16` matmuls, However, `INT8` models will have additional overhead compared to `FP16` models. For example, converting FP tensors to INT before INT matmul.
103-
- Each of these additional quantization operations is relatively 'cheap', but the overhead of launching each job is not negligible. Using `torch.compile` can fuse the Ops and reduce the total number of jobs launching.
104-
- `CUDAGRAPH` is the most effective way to minimize job launching overheads and can achieve ~2X end-to-end speed-up in this case. However, there seems to be bugs associated with this option at the moment. Further investigation is still on-going.
101+
> [!NOTE]
102+
> Accuracy could vary ~ +-0.2 from run to run.
105103
106104
|model|batchsize|torch.compile|accuracy(F1)|inference speed (msec)|
107105
|----|--:|---------:|----:|------------:|
@@ -110,7 +108,13 @@ For comparison purposes, here are some of the results we found during testing wh
110108
| |128|CUDAGRAPH | |71.13|
111109
|INT8|128|eager |88.33|329.45 <sup>1</sup>|
112110
| |128|Inductor |88.42|67.87 <sup>2</sup>|
113-
| |128|CUDAGRAPH |-- |*** <sup>3</sup>|
111+
| |128|CUDAGRAPH |-- |-- <sup>3</sup>|
112+
113+
<sup>1</sup> `INT8` matmuls are ~2x faster than `FP16` matmuls. However, `INT8` models will have additional overhead compared to `FP16` models. For example, converting FP tensors to INT before INT matmul.
114+
115+
<sup>2</sup> Each of these additional quantization operations is relatively 'cheap', but the overhead of launching each job is not negligible. Using `torch.compile` can fuse the Ops and reduce the total number of jobs being launched.
116+
117+
<sup>3</sup> `CUDAGRAPH` is the most effective way to minimize job launching overheads and can achieve ~2X end-to-end speed-up in this case. However, there seem to be bugs associated with this option at the moment. Further investigation is still on-going.
114118

115119
## Code Walkthrough
116120

0 commit comments

Comments
 (0)