You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/QAT_INT8/README.md
+10-6Lines changed: 10 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -96,12 +96,10 @@ Checkout [Example Test Results](#example-test-results) to compare against your r
96
96
97
97
## Example Test Results
98
98
99
-
For comparison purposes, here are some of the results we found during testing when tested with PyTorch 2.3.1:
99
+
For comparison purposes, here are some of the results we found during testing when tested with `PyTorch 2.3.1`:
100
100
101
-
- Accuracy could vary ~ +-0.2 from run to run.
102
-
-`INT8` matmuls are ~2x faster than `FP16` matmuls, However, `INT8` models will have additional overhead compared to `FP16` models. For example, converting FP tensors to INT before INT matmul.
103
-
- Each of these additional quantization operations is relatively 'cheap', but the overhead of launching each job is not negligible. Using `torch.compile` can fuse the Ops and reduce the total number of jobs launching.
104
-
-`CUDAGRAPH` is the most effective way to minimize job launching overheads and can achieve ~2X end-to-end speed-up in this case. However, there seems to be bugs associated with this option at the moment. Further investigation is still on-going.
@@ -110,7 +108,13 @@ For comparison purposes, here are some of the results we found during testing wh
110
108
||128|CUDAGRAPH ||71.13|
111
109
|INT8|128|eager |88.33|329.45 <sup>1</sup>|
112
110
||128|Inductor |88.42|67.87 <sup>2</sup>|
113
-
||128|CUDAGRAPH |-- |*** <sup>3</sup>|
111
+
||128|CUDAGRAPH |-- |-- <sup>3</sup>|
112
+
113
+
<sup>1</sup> `INT8` matmuls are ~2x faster than `FP16` matmuls. However, `INT8` models will have additional overhead compared to `FP16` models. For example, converting FP tensors to INT before INT matmul.
114
+
115
+
<sup>2</sup> Each of these additional quantization operations is relatively 'cheap', but the overhead of launching each job is not negligible. Using `torch.compile` can fuse the Ops and reduce the total number of jobs being launched.
116
+
117
+
<sup>3</sup> `CUDAGRAPH` is the most effective way to minimize job launching overheads and can achieve ~2X end-to-end speed-up in this case. However, there seem to be bugs associated with this option at the moment. Further investigation is still on-going.
0 commit comments