You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This adds the pyspelling spell check automation tool.
It is a wrapper around CLI of Aspell or Hunspell which are
spell checker tools. The PR specifies that pyspelling uses
Aspell as the spell checker tools can differ in output. Therefore
by speficying Aspell, it will mean consistency.
Closes#31
Signed-off-by: Martin Hickey <[email protected]>
Copy file name to clipboardExpand all lines: docs/fms_mo_design.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,9 +37,9 @@ The quantization process can be illustrated in the following plots:
37
37
38
38
### Quantization-aware training (QAT)
39
39
40
-
In order to accommodate the quantization errors, one straightfoward technique is to take quantization/dequantization into account during the training process, hence the name quantization-aware training [(QAT)](https://arxiv.org/pdf/1712.05877), as illustrated by Step 1 of the following figure. The training optimizer will then adjust the parameters of the model, e.g. weights, accordingly so that the resulting accuracy will be comparable to the original FP32 model.
40
+
In order to accommodate the quantization errors, one straightforward technique is to take quantization/dequantization into account during the training process, hence the name quantization-aware training [(QAT)](https://arxiv.org/pdf/1712.05877), as illustrated by Step 1 of the following figure. The training optimizer will then adjust the parameters of the model, e.g. weights, accordingly so that the resulting accuracy will be comparable to the original FP32 model.
41
41
42
-
There are many other techniques, such as post-training quantization ([PTQ](https://arxiv.org/abs/2102.05426)), that can achieve similar outcome. Users will need to pick the proper method for their specific task based on model size, dataset size, resource available, and other consideraions.
42
+
There are many other techniques, such as post-training quantization ([PTQ](https://arxiv.org/abs/2102.05426)), that can achieve similar outcome. Users will need to pick the proper method for their specific task based on model size, dataset size, resource available, and other considerations.
43
43
44
44

45
45
@@ -91,7 +91,7 @@ For generative LLMs, very often the bottleneck of inference is no longer the com
91
91
92
92
The key architectural components are:
93
93
1.**`model_analyzer`**, which traces the model and identifies the layers/operations to be quantized or to be skipped. It will try to recognize several well-known structures and configure based on best practice. However, users could also choose to bypass the tracing and manually specify the desired configuration with full flexibility.
94
-
2.**A set of `wrappers`**. As shown in the figure above, the preparation for QAT and deployment can be viewed as a "layer swapping" process. One could identify a desired `torch.nn.Linear` layer to be quantized, e.g. Linear1 in the plot, and replace it with a `QLinear` wrapper, which contains a set of `quantizers` that can quantize/dequantize the inputs and weights before the Linear operation. Similarly, the `QLinear` wrapper for deployment stage will quantize the inputs, perform INT matmul, then dequantize the outcome. It is mathmatically equivalanet to the wrapper used in QAT, but it can utilize the INT compute engine.
94
+
2.**A set of `wrappers`**. As shown in the figure above, the preparation for QAT and deployment can be viewed as a "layer swapping" process. One could identify a desired `torch.nn.Linear` layer to be quantized, e.g. Linear1 in the plot, and replace it with a `QLinear` wrapper, which contains a set of `quantizers` that can quantize/dequantize the inputs and weights before the Linear operation. Similarly, the `QLinear` wrapper for deployment stage will quantize the inputs, perform INT matmul, then dequantize the outcome. It is mathematically equivalent to the wrapper used in QAT, but it can utilize the INT compute engine.
Copy file name to clipboardExpand all lines: examples/FP8_QUANT/README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@ This is an example of mature FP8, which under the hood leverages some functional
24
24
> [!CAUTION]
25
25
>`vllm` may require a specific PyTorch version that is different from what is installed in your current environment and it may force install without asking. Make sure it's compatible with your settings or create a new environment if needed.
26
26
27
-
## Quickstart
27
+
## QuickStart
28
28
This end-to-end example utilizes the common set of interfaces provided by `fms_mo` for easily applying multiple quantization algorithms with FP8 being the focus of this example. The steps involved are:
29
29
30
30
1. **FP8 quantization through CLI**. Other arguments could be found here [FP8Args](../../fms_mo/training_args.py#L84).
@@ -88,7 +88,7 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
88
88
|||none | 5|perplexity|↓ |3.8915|± |0.3727|
89
89
```
90
90
91
-
## Code Walkthrough
91
+
## Code Walk-through
92
92
93
93
1. The non-quantized pre-trained model is loaded using model wrapper from `llm-compressor`. The corresponding tokenizer is constructed as well.
Copy file name to clipboardExpand all lines: examples/GPTQ/README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ For generative LLMs, very often the bottleneck of inference is no longer the com
13
13
```
14
14
15
15
16
-
## Quickstart
16
+
## QuickStart
17
17
This end-to-end example utilizes the common set of interfaces provided by `fms_mo` for easily applying multiple quantization algorithms with GPTQ being the focus of this example. The steps involved are:
18
18
19
19
1. **Convert the dataset into its tokenized form.** An example of tokenization using `LLAMA-3-8B`'s tokenizer is below.
@@ -109,7 +109,7 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
109
109
> There is some randomness in generating the model and data, the resulting accuracy may vary ~$\pm$ 0.05.
110
110
111
111
112
-
## Code Walkthrough
112
+
## Code Walk-through
113
113
114
114
1. Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py)
Copy file name to clipboardExpand all lines: examples/PTQ_INT8/README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,7 +15,7 @@ This is an example of [block sequential PTQ](https://arxiv.org/abs/2102.05426).
15
15
-`PyTorch 2.3.1` (as newer version will cause issue for the custom CUDA kernel)
16
16
17
17
18
-
## Quickstart
18
+
## QuickStart
19
19
20
20
> [!NOTE]
21
21
> This example is based on the HuggingFace [Transformers Question answering example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering). Unlike our [QAT example](../QAT_INT8/README.md), which utilizes the training loop of the original code, our PTQ function will control the loop and the program will end before entering the original loop. Make sure the model doesn't get "tuned" twice!
@@ -106,7 +106,7 @@ The table below shows results obtained for the conditions listed:
106
106
`Nouterloop` and `ptq_nbatch` are PTQ specific hyper-parameter.
107
107
Above experiments were run on v100 machine.
108
108
109
-
## Code Walkthrough
109
+
## Code Walk-through
110
110
111
111
In this section, we will deep dive into what happens during the example steps.
Copy file name to clipboardExpand all lines: examples/QAT_INT8/README.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,7 +23,7 @@ In the following example, we will first create a fine-tuned FP16 model, and then
23
23
-`PyTorch 2.3.1` (as newer version will cause issue for the custom CUDA kernel)
24
24
25
25
26
-
## Quickstart
26
+
## QuickStart
27
27
28
28
> [!NOTE]
29
29
> This example is based on the HuggingFace [Transformers Question answering example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering).
@@ -101,7 +101,7 @@ For comparison purposes, here are some of the results we found during testing wh
@@ -116,7 +116,7 @@ For comparison purposes, here are some of the results we found during testing wh
116
116
117
117
<sup>3</sup> `CUDAGRAPH` is the most effective way to minimize job launching overheads and can achieve ~2X end-to-end speed-up in this case. However, there seem to be bugs associated with this option at the moment. Further investigation is still on-going.
118
118
119
-
## Code Walkthrough
119
+
## Code Walk-through
120
120
121
121
In this section, we will deep dive into what happens during the example steps.
0 commit comments