Skip to content

Commit 46be7f7

Browse files
Merge branch 'main' into ci/add-mypy-fixes
Signed-off-by: chichun-charlie-liu <[email protected]>
2 parents b0ea939 + 50b6ce3 commit 46be7f7

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+1723
-360
lines changed

.github/workflows/labelpr.yaml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
name: Label PRs
2+
3+
on:
4+
pull_request_target:
5+
types: [opened, edited, synchronize, reopened]
6+
7+
jobs:
8+
label_pr:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- uses: actions/github-script@v7
12+
with:
13+
github-token: ${{ secrets.GITHUB_TOKEN }}
14+
script: |
15+
// https://github.com/commitizen/conventional-commit-types
16+
const valid_pr_types = ['feat', 'fix', 'docs', 'style', 'refactor', 'perf', 'test', 'build', 'ci', 'chore', 'revert', 'dependencies'];
17+
18+
19+
const title = context.payload.pull_request.title;
20+
const results = /^(\w+)(\(\w+\))?!?:/.exec(title);
21+
if (results === null) return core.setFailed(`The title does not follow conventional commits spec: https://www.conventionalcommits.org/en/v1.0.0/#summary Title: ${title}`);
22+
23+
const pr_type = results[1];
24+
core.info(`pr_type: ${pr_type}`);
25+
26+
if (!valid_pr_types.includes(pr_type)) return core.setFailed(`Unknown pull request type: ${pr_type}`);
27+
28+
const labels = context.payload.pull_request.labels;
29+
const new_labels = labels.filter(label => !valid_pr_types.includes(label.name)); // keep all labels that are not in valid_pr_types
30+
new_labels.push({name: pr_type});
31+
await github.rest.issues.update({ ...context.repo, issue_number: context.payload.number, labels: new_labels });

.github/workflows/pypi.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ jobs:
4444
# for setuptools-scm
4545
fetch-depth: 0
4646

47-
- uses: hynek/build-and-inspect-python-package@f01e4d047aadcc0c054c95ec9900da3ec3fc7a0f # v2.10.0
47+
- uses: hynek/build-and-inspect-python-package@b5076c307dc91924a82ad150cdd1533b444d3310 # v2.12.0
4848

4949
# push to Test PyPI on
5050
# - a new GitHub release is published
@@ -77,7 +77,7 @@ jobs:
7777
path: dist
7878

7979
- name: Upload to Test PyPI
80-
uses: pypa/gh-action-pypi-publish@15c56dba361d8335944d31a2ecd17d700fc7bcbc # v1.12.2
80+
uses: pypa/gh-action-pypi-publish@76f52bc884231f62b9a034ebfe128415bbaabdfc # v1.12.4
8181
with:
8282
repository-url: https://test.pypi.org/legacy/
8383

@@ -122,4 +122,4 @@ jobs:
122122
run: rm ./dist/*.sigstore.json
123123

124124
- name: Upload to PyPI
125-
uses: pypa/gh-action-pypi-publish@15c56dba361d8335944d31a2ecd17d700fc7bcbc # v1.12.2
125+
uses: pypa/gh-action-pypi-publish@76f52bc884231f62b9a034ebfe128415bbaabdfc # v1.12.4

.github/workflows/test.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,9 @@ jobs:
4040
strategy:
4141
matrix:
4242
python:
43-
- "3.9"
4443
- "3.10"
4544
- "3.11"
45+
- "3.12"
4646
platform:
4747
- "ubuntu-latest"
4848

.gitignore

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ error.log
4242

4343
# Files generated from running examples
4444
fms_mo.log
45-
data_train/
46-
data_test/
45+
data*_train/
46+
data*_test/
4747
act_scales/
48+
examples/

.spellcheck-en-custom.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
activations
2+
acc
23
ADR
34
Args
45
AutoGPTQ
@@ -38,6 +39,7 @@ Inductor
3839
inferenced
3940
inferencing
4041
isort
42+
JIT
4143
Jupyter
4244
Kubernetes
4345
KV
@@ -66,6 +68,7 @@ NLP
6668
Nouterloop
6769
Nvidia
6870
Nvidia's
71+
openai
6972
orchestrator
7073
param
7174
pre
@@ -98,13 +101,16 @@ SmoothQuant
98101
socio
99102
sparsification
100103
SQuAD
104+
stderr
105+
Stderr
101106
straightforward
102107
tokenization
103108
tokenized
104109
Tokenized
105110
tokenizer
106111
Tokenizer
107112
toml
113+
triton
108114
Unquantized
109115
vals
110116
venv

README.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -36,9 +36,7 @@ FMS Model Optimizer is a framework for developing reduced precision neural netwo
3636
### Requirements
3737

3838
1. **🐧 Linux system with Nvidia GPU (V100/A100/H100)**
39-
2. Python 3.9 to Python 3.11
40-
41-
📋 Python 3.12 is currently not supported due to PyTorch Dynamo constraint
39+
2. Python 3.10 to Python 3.12
4240
3. CUDA >=12
4341

4442
*Optional packages based on optimization functionality required:*
@@ -47,9 +45,12 @@ FMS Model Optimizer is a framework for developing reduced precision neural netwo
4745
- [auto_gptq](https://pypi.org/project/auto-gptq/) or build from [source](https://github.com/AutoGPTQ/AutoGPTQ)
4846
- If you want to experiment with **INT8** deployment in [QAT](./examples/QAT_INT8/) and [PTQ](./examples/PTQ_INT8/) examples:
4947
- Nvidia GPU with compute capability > 8.0 (A100 family or higher)
50-
- [Ninja](https://ninja-build.org/)
51-
- Clone the [CUTLASS](https://github.com/NVIDIA/cutlass) repository
52-
- `PyTorch 2.3.1` (as newer version will cause issue for the custom CUDA kernel used in these examples)
48+
- Option 1:
49+
- [Ninja](https://ninja-build.org/)
50+
- Clone the [CUTLASS](https://github.com/NVIDIA/cutlass) repository
51+
- `PyTorch 2.3.1` (as newer version will cause issue for the custom CUDA kernel used in these examples)
52+
- Option 2:
53+
- use triton kernel included. But this kernel is currently not faster than FP16.
5354
- **FP8** is a reduced precision format like **INT8**:
5455
- Nvidia A100 family or higher
5556
- [llm-compressor](https://github.com/vllm-project/llm-compressor)

examples/FP8_QUANT/README.md

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -73,20 +73,18 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
7373

7474
## Example Test Results
7575
- BF16 (not quantized) LLAMA3-8B model.
76-
``` bash
77-
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
78-
|--------------|------:|------|-----:|----------|---|-----:|---|-----:|
79-
|lambada_openai| 1|none | 5|acc ||0.7120|± |0.0287|
80-
| | |none | 5|perplexity||3.8683|± |0.3716|
81-
```
76+
77+
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
78+
|--------------|------:|------|-----:|----------|---|-----:|---|-----:|
79+
|lambada_openai| 1|none | 5|acc ||0.7120|± |0.0287|
80+
| | |none | 5|perplexity||3.8683|± |0.3716|
8281

8382
- FP8 quantized LLAMA3-8B model.
84-
``` bash
85-
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
86-
|--------------|------:|------|-----:|----------|---|-----:|---|-----:|
87-
|lambada_openai| 1|none | 5|acc ||0.7160|± |0.0286|
88-
| | |none | 5|perplexity||3.8915|± |0.3727|
89-
```
83+
84+
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
85+
|--------------|------:|------|-----:|----------|---|-----:|---|-----:|
86+
|lambada_openai| 1|none | 5|acc ||0.7160|± |0.0286|
87+
| | |none | 5|perplexity||3.8915|± |0.3727|
9088

9189
## Code Walk-through
9290

examples/QAT_INT8/README.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -87,16 +87,16 @@ python run_qa_no_trainer_qat.py \
8787
--max_seq_length 384 \
8888
--doc_stride 128 \
8989
--attn_impl eager \
90-
--do_lowering
90+
--do_lowering <cutlass or triton>
9191
```
9292

93-
This script uses an "external kernel" instead of the `torch.matmul` kernel to perform real `INT8` matmuls. This kernel is written for Nvidia's CUDA/CUTLASS library and is compiled once just ahead of the run. The compiled artifacts are usually stored in `~/.cache/torch_extensions/`. Remove this folder if a fresh recompile of the kernel is needed.
93+
This script uses an "external kernel" instead of the `torch.matmul` kernel to perform real `INT8` matmuls. We have two options for INT kernel, one is written using Nvidia's CUDA/CUTLASS library and one is in Triton. Both will be compiled once just ahead of the run (i.e., just-in-time, JIT, compilation). The compiled artifacts are usually stored in `~/.cache/torch_extensions/`. Remove this folder if a fresh recompile of the kernel is needed.
9494

9595
Checkout [Example Test Results](#example-test-results) to compare against your results.
9696

9797
## Example Test Results
9898

99-
For comparison purposes, here are some of the results we found during testing when tested with `PyTorch 2.3.1`:
99+
For comparison purposes, here are some of the results from an A100. CUTLASS results were obtained with `PyTorch 2.3.1` while Triton results were obtained using `PyTorch 2.4.1`:
100100

101101
> [!NOTE]
102102
> Accuracy could vary ~ +-0.2 from run to run.
@@ -106,16 +106,21 @@ For comparison purposes, here are some of the results we found during testing wh
106106
|fp16|128|eager |88.21 (as fine-tuned) |126.38|
107107
| |128|Inductor | |71.59|
108108
| |128|CUDAGRAPH | |71.13|
109-
|INT8|128|eager |88.33|329.45 <sup>1</sup>|
109+
|INT8 CUTLASS|128|eager |88.33|329.45 <sup>1</sup>|
110110
| |128|Inductor |88.42|67.87 <sup>2</sup>|
111111
| |128|CUDAGRAPH |-- |-- <sup>3</sup>|
112+
|INT8 triton|128|eager |88.10|358.51|
113+
| |128|Inductor |88.13|99.91 <sup>4</sup>|
114+
| |128|CUDAGRAPH |88.13|100.21 <sup>4</sup>|
112115

113116
<sup>1</sup> `INT8` matmuls are ~2x faster than `FP16` matmuls. However, `INT8` models will have additional overhead compared to `FP16` models. For example, converting FP tensors to INT before INT matmul.
114117

115118
<sup>2</sup> Each of these additional quantization operations is relatively 'cheap', but the overhead of launching each job is not negligible. Using `torch.compile` can fuse the Ops and reduce the total number of jobs being launched.
116119

117120
<sup>3</sup> `CUDAGRAPH` is the most effective way to minimize job launching overheads and can achieve ~2X end-to-end speed-up in this case. However, there seem to be bugs associated with this option at the moment. Further investigation is still on-going.
118121

122+
<sup>4</sup> Unlike our CUTLASS `INT8` kernel, which is ~2x faster than `FP16` matmul, our Triton `INT8` is not as optimized and performs only comparable with `FP16` on mid-to-large tensor sizes.
123+
119124
## Code Walk-through
120125

121126
In this section, we will deep dive into what happens during the example steps.

examples/QAT_INT8/run_qa_no_trainer_qat.py

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -388,8 +388,10 @@ def parse_args():
388388
)
389389
parser.add_argument(
390390
"--do_lowering",
391-
action="store_true",
392-
help="convert QAT model to utilize real INT8 GPU kernel",
391+
choices=["cutlass", "triton"],
392+
type=str,
393+
default=None,
394+
help="convert QAT model to utilize real INT8 GPU kernel, 'cutlass' or 'triton'",
393395
)
394396

395397
args = parser.parse_args()
@@ -1086,7 +1088,7 @@ def squad_eval(model, keep_model_in_eval_mode=True):
10861088
qmodel_prep(model, exam_inp, qcfg, optimizer, use_dynamo=True)
10871089

10881090
# ---- [fms_mo] the following code are performing speed tests ----
1089-
elif args.do_lowering:
1091+
elif args.do_lowering in ["cutlass", "triton"]:
10901092
# Standard
10911093
from copy import deepcopy
10921094
import time
@@ -1134,7 +1136,7 @@ def speedtest(model, exam_inp, Ntest=100):
11341136
logger.info(
11351137
f"\n {label} {'with' if comp_mode else 'without'} torch.compile"
11361138
)
1137-
model_copy = deepcopy(model)
1139+
model_copy = deepcopy(model).half()
11381140

11391141
if label == "int8":
11401142
qcfg = qconfig_init(recipe="qat_int8", args=args)
@@ -1158,7 +1160,11 @@ def speedtest(model, exam_inp, Ntest=100):
11581160
parent_mod = model_copy.get_submodule(parent_name)
11591161
qmod = getattr(parent_mod, module_name)
11601162
setattr(
1161-
parent_mod, module_name, QLinearINT8Deploy.from_fms_mo(qmod)
1163+
parent_mod,
1164+
module_name,
1165+
QLinearINT8Deploy.from_fms_mo(
1166+
qmod, use_int_kernel=args.do_lowering
1167+
),
11621168
)
11631169

11641170
if comp_mode is not False:
@@ -1172,7 +1178,7 @@ def speedtest(model, exam_inp, Ntest=100):
11721178

11731179
# Median runtime using fixed input (in msec)
11741180
med_runtime = speedtest(model_copy, exam_inp)
1175-
metrics = squad_eval(model_copy) if label == "int8" else {"f1": None}
1181+
metrics = squad_eval(model_copy) # if label == "int8" else {"f1": None}
11761182

11771183
summary["precision"].append(label)
11781184
summary["compile mode"].append(comp_mode)
@@ -1385,6 +1391,7 @@ def speedtest(model, exam_inp, Ntest=100):
13851391
)
13861392
logger.info(f"Predict metrics: {predict_metric}")
13871393

1394+
log = {}
13881395
if args.with_tracking:
13891396
log = {
13901397
"squad_v2" if args.version_2_with_negative else "squad": eval_metric,

fms_mo/aiu_addons/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)