Skip to content

Commit 9172b2b

Browse files
Merge branch 'main' into dependabot/pip/torch-gte-2.2.0-and-lt-2.8
Signed-off-by: chichun-charlie-liu <[email protected]>
2 parents 044fb1f + c920911 commit 9172b2b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

63 files changed

+6547
-1037
lines changed

.github/pull_request_template.md

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,28 @@
44

55
<!-- Please summarize the changes -->
66

7-
### Related issue number
7+
### Related issues or PRs
88

9-
<!-- For example: "Closes #1234" -->
9+
<!-- For example: "Closes #1234" or "Fixes bug introduced in #5678 -->
1010

1111
### How to verify the PR
1212

13-
<!-- Please provide instruction or screenshots on how to verify the PR.-->
13+
<!-- Please provide instruction or screenshots on how to verify the PR if unit tests do not provide coverage.-->
1414

1515
### Was the PR tested
1616

1717
<!-- Describe how PR was tested -->
18-
- [ ] I have added >=1 unit test(s) for every new method I have added.
19-
- [ ] I have ensured all unit tests pass
18+
- [ ] I have added >=1 unit test(s) for every new method I have added (if that coverage is difficult, please briefly explain the reason)
19+
- [ ] I have ensured all unit tests pass
20+
21+
### Checklist for passing CI/CD:
22+
23+
<!-- Mark completed tasks with "- [x]" -->
24+
- [ ] All commits are signed showing "Signed-off-by: Name \<[email protected]\>" with `git commit -signoff` or equivalent
25+
- [ ] PR title and commit messages adhere to [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/)
26+
- [ ] Contribution is formatted with `tox -e fix`
27+
- [ ] Contribution passes linting with `tox -e lint`
28+
- [ ] Contribution passes spellcheck with `tox -e spellcheck`
29+
- [ ] Contribution passes all unit tests with `tox -e unit`
30+
31+
Note: CI/CD performs unit tests on multiple versions of Python from a fresh install. There may be differences with your local environment and the test environment.

.gitignore

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,8 @@ htmlcov/
1414
durations/*
1515
coverage*.xml
1616
qcfg.json
17-
models
1817
configs
18+
pytest.out
1919

2020
# IDEs
2121
.vscode/
@@ -45,4 +45,9 @@ fms_mo.log
4545
data*_train/
4646
data*_test/
4747
act_scales/
48-
examples/
48+
examples/**/*.json
49+
examples/**/*.safetensors
50+
examples/**/*.log
51+
examples/**/*.sh
52+
examples/**/*.pt
53+
examples/**/*.arrow

.pylintrc

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -63,9 +63,9 @@ ignore-patterns=^\.#
6363
# (useful for modules/projects where namespaces are manipulated during runtime
6464
# and thus existing member attributes cannot be deduced by static analysis). It
6565
# supports qualified module names, as well as Unix pattern matching.
66-
ignored-modules=auto_gptq,
67-
exllama_kernels,
68-
exllamav2_kernels,
66+
ignored-modules=gptqmodel,
67+
gptqmodel_exllama_kernels,
68+
gptqmodel_exllamav2_kernels,
6969
llmcompressor,
7070
cutlass_mm,
7171
pygraphviz,
@@ -94,7 +94,7 @@ persistent=yes
9494

9595
# Minimum Python version to use for version dependent checks. Will default to
9696
# the version used to run pylint.
97-
py-version=3.9
97+
py-version=3.10
9898

9999
# Discover python modules and packages in the file system subtree.
100100
recursive=no

.spellcheck-en-custom.txt

Lines changed: 22 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,11 @@
11
activations
22
acc
33
ADR
4+
aiu
5+
AIU
6+
Spyre
7+
spyre
48
Args
5-
AutoGPTQ
69
autoregressive
710
backpropagation
811
bmm
@@ -23,17 +26,20 @@ dequantization
2326
dq
2427
DQ
2528
dev
29+
dtype
2630
eval
2731
fms
32+
fmsmo
2833
fp
2934
FP
3035
FP8Arguments
3136
frac
3237
gptq
3338
GPTQ
3439
GPTQArguments
40+
GPTQModel
41+
gptqmodel
3542
graphviz
36-
GPTQ
3743
hyperparameters
3844
Inductor
3945
inferenced
@@ -91,8 +97,11 @@ quantizes
9197
Quantizing
9298
QW
9399
rceil
100+
recomputation
94101
repo
95102
representable
103+
roberta
104+
RoBERTa
96105
runtime
97106
Runtime
98107
SAWB
@@ -112,9 +121,19 @@ Tokenizer
112121
toml
113122
triton
114123
Unquantized
124+
utils
115125
vals
116126
venv
117127
vllm
118128
xs
119129
zp
120-
130+
microxcaling
131+
Microscaling
132+
microscaling
133+
MX
134+
mx
135+
MXINT
136+
mxint
137+
MXFP
138+
mxfp
139+
OCP

README.md

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ FMS Model Optimizer is a framework for developing reduced precision neural netwo
4242
*Optional packages based on optimization functionality required:*
4343

4444
- **GPTQ** is a popular compression method for LLMs:
45-
- [auto_gptq](https://pypi.org/project/auto-gptq/) or build from [source](https://github.com/AutoGPTQ/AutoGPTQ)
45+
- [gptqmodel](https://pypi.org/project/gptqmodel/) or build from [source](https://github.com/ModelCloud/GPTQModel)
4646
- If you want to experiment with **INT8** deployment in [QAT](./examples/QAT_INT8/) and [PTQ](./examples/PTQ_INT8/) examples:
4747
- Nvidia GPU with compute capability > 8.0 (A100 family or higher)
4848
- Option 1:
@@ -98,6 +98,29 @@ cd fms-model-optimizer
9898
pip install -e .
9999
```
100100

101+
#### Optional Dependencies
102+
The following optional dependencies are available:
103+
- `fp8`: `llmcompressor` package for fp8 quantization
104+
- `gptq`: `GPTQModel` package for W4A16 quantization
105+
- `mx`: `microxcaling` package for MX quantization
106+
- `opt`: Shortcut for `fp8`, `gptq`, and `mx` installs
107+
- `aiu`: `ibm-fms` package for AIU model deployment
108+
- `torchvision`: `torch` package for image recognition training and inference
109+
- `triton`: `triton` package for matrix multiplication kernels
110+
- `examples`: Dependencies needed for examples
111+
- `visualize`: Dependencies for visualizing models and performance data
112+
- `test`: Dependencies needed for unit testing
113+
- `dev`: Dependencies needed for development
114+
115+
To install an optional dependency, modify the `pip install` commands above with a list of these names enclosed in brackets. The example below installs `llm-compressor` and `torchvision` with FMS Model Optimizer:
116+
117+
```shell
118+
pip install fms-model-optimizer[fp8,torchvision]
119+
120+
pip install -e .[fp8,torchvision]
121+
```
122+
If you have already installed FMS Model Optimizer, then only the optional packages will be installed.
123+
101124
### Try It Out!
102125

103126
To help you get up and running as quickly as possible with the FMS Model Optimizer framework, check out the following resources which demonstrate how to use the framework with different quantization techniques:

docs/fms_mo_design.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ FMS Model Optimizer supports FP8 in two ways:
8282

8383
### GPTQ (weight-only compression, or sometimes referred to as W4A16)
8484

85-
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed. (Some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this method simply by utilizing `auto_gptq` package. See this [example](../examples/GPTQ/)
85+
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed. (Some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this method simply by utilizing `gptqmodel` package. See this [example](../examples/GPTQ/)
8686

8787

8888
## Specification

examples/AIU_CONVERSION/README.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# Train and prepare INT8 checkpoint for the AIU using Direct Quantization
2+
This example builds on the [Direct Quantization (DQ) example](../DQ_SQ/README.md). We assume the user is already familiar with the DQ quantization process and would like to generate an INT8-quantized checkpoint that is made compliant with the requirements of the AIU/Spire accelerator.
3+
4+
Once created, this checkpoint can be run on the AIU by using an inference script from [aiu-fms-testing-utils](https://github.com/foundation-model-stack/aiu-fms-testing-utils).
5+
6+
For more information on the AIU/Spyre accelerator, see the following blogs:
7+
- [Introducing the IBM Spyre AI Accelerator chip](https://research.ibm.com/blog/spyre-for-z)
8+
- [IBM Power modernizes infrastructure and accelerates innovation with AI in the year ahead](https://newsroom.ibm.com/blog-ibm-power-modernizes-infrastructure-and-accelerates-innovation-with-ai-in-the-year-ahead)
9+
10+
## Requirements
11+
- [FMS Model Optimizer requirements](../../README.md#requirements)
12+
13+
## QuickStart
14+
15+
**1. Prepare Data** as per DQ quantization process ([link](../DQ_SQ/README.md)). In this example, we assume the user wants to quantized RoBERTa-base model and has thus prepared the DQ data for it, stored under the folder `data_train` and `data_test`, by adapting the DQ example accordingly.
16+
17+
**2. Apply DQ with conversion** by providing the desired quantization parameters, as well as the flags `--save_ckpt_for_aiu` and `--recompute_narrow_weights`.
18+
19+
```bash
20+
python -m fms_mo.run_quant \
21+
--model_name_or_path "roberta-base" \
22+
--training_data_path data_train \
23+
--test_data_path data_test \
24+
--torch_dtype "float16" \
25+
--quant_method dq \
26+
--nbits_w 8 \
27+
--nbits_a 8 \
28+
--nbits_kvcache 32 \
29+
--qa_mode "pertokenmax"\
30+
--qw_mode "maxperCh" \
31+
--qmodel_calibration_new 1 \
32+
--output_dir "dq_test" \
33+
--save_ckpt_for_aiu \
34+
--recompute_narrow_weights
35+
```
36+
> [!TIP]
37+
> - In this example, we are not evaluating the perplexity of the quantized model, but, if so desired, the user can add the `--eval_ppl` flag.
38+
> - We set a single calibration example because the quantizers in use do not need calibration: weights remain static during DQ, so a single example will initialize the quantizer correctly, and the activation quantizer `pertokenmax` will dynamically recompute the quantization range at inference time, when running on the AIU.
39+
40+
**3. Reload checkpoint for testing** and validate its content (optional).
41+
42+
```python
43+
sd = torch.load("dq_test/qmodel_for_aiu.pt", weights_only=True)
44+
```
45+
46+
Check that all quantized layers have been converted to `torch.int8`, while the rest are `torch.float16`.
47+
48+
```python
49+
# select quantized layers by name
50+
roberta_qlayers = ["attention.self.query", "attention.self.key", "attention.self.value", "attention.output.dense", "intermediate.dense", "output.dense"]
51+
# assert all quantized weights are int8
52+
assert all(v.dtype == torch.int8 for k,v in sd.items() if any(n in k for n in roberta_qlayers) and k.endswith(".weight"))
53+
# assert all other parameters are fp16
54+
assert all(v.dtype == torch.float16 for k,v in sd.items() if all(n not in k for n in roberta_qlayers) or not k.endswith(".weight"))
55+
```
56+
57+
> [!TIP]
58+
> - We have trained the model with symmetric quantizer for activations (`qa_mode`). If an asymmetric quantizer is used, then the checkpoint will also carry a `zero_shift` parameters which is torch.float32, so this validation step should be modified accordingly.
59+
60+
Because we have used the `narrow_weight_recomputation` option along with a `maxperCh` (max per-channel) quantizer for weights, the INT weight matrices distributions have been widened. Most values of standard deviation (per channel) should surpass the empirical threshold of 20.
61+
62+
```python
63+
[f"{v.to(torch.float32).std(dim=-1).mean():.4f}" for k,v in sd.items() if k.endswith(".weight") and any(n in k for n in roberta_qlayers)]
64+
```
65+
66+
> [!TIP]
67+
> - We cast the torch.int8 weights to torch.float32 to be able to apply the torch.std function.
68+
> - For per-channel weights, the recomputation is applied per-channel. Here we print a mean across channels for help of visualization.
69+
> - It is not a guarantee that the recomputed weights will exceed the empirical threshold after recomputation, but it is the case for several common models of BERT, RoBERTa, Llama, and Granite families.

examples/FP8_QUANT/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,8 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
9292

9393
```python
9494
from llmcompressor.modifiers.quantization import QuantizationModifier
95-
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
95+
from llmcompressor.transformers import SparseAutoModelForCausalLM
96+
from llmcompressor import oneshot
9697
9798
model = SparseAutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, torch_dtype=model_args.torch_dtype)
9899
tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)

0 commit comments

Comments
 (0)