Skip to content

Commit 1980b35

Browse files
Merge branch 'main' into qbmm_fix_amend
Signed-off-by: chichun-charlie-liu <[email protected]>
2 parents 79851eb + 1e7856e commit 1980b35

File tree

18 files changed

+626
-91
lines changed

18 files changed

+626
-91
lines changed

.spellcheck-en-custom.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,11 @@ eval
2626
fms
2727
fp
2828
FP
29+
FP8Arguments
2930
frac
3031
gptq
3132
GPTQ
32-
GPTQArgs
33+
GPTQArguments
3334
graphviz
3435
GPTQ
3536
hyperparameters

examples/FP8_QUANT/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ This is an example of mature FP8, which under the hood leverages some functional
2727
## QuickStart
2828
This end-to-end example utilizes the common set of interfaces provided by `fms_mo` for easily applying multiple quantization algorithms with FP8 being the focus of this example. The steps involved are:
2929
30-
1. **FP8 quantization through CLI**. Other arguments could be found here [FP8Args](../../fms_mo/training_args.py#L84).
30+
1. **FP8 quantization through CLI**. Other arguments could be found here [FP8Arguments](../../fms_mo/training_args.py#L84).
3131
3232
```bash
3333
python -m fms_mo.run_quant \
@@ -100,7 +100,7 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
100100
tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
101101
```
102102

103-
2. Quantization setting is provided using `QuantizationModifier`, additional settings can be found in [FP8Args](../../fms_mo/training_args.py#L84).
103+
2. Quantization setting is provided using `QuantizationModifier`, additional settings can be found in [FP8Arguments](../../fms_mo/training_args.py#L84).
104104

105105
```python
106106
recipe = QuantizationModifier(

examples/GPTQ/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
3232
> - Tokenized data will be saved in `<path_to_save>_train` and `<path_to_save>_test`
3333
> - If you have trouble downloading Llama family of models from Hugging Face ([LLama models require access](https://www.llama.com/docs/getting-the-models/hugging-face/)), you can use `ibm-granite/granite-8b-code` instead
3434
35-
2. **Quantize the model** using the data generated above, the following command will kick off the quantization job (by invoking `auto_gptq` under the hood.) Additional acceptable arguments can be found here in [GPTQArgs](../../fms_mo/training_args.py#L127).
35+
2. **Quantize the model** using the data generated above, the following command will kick off the quantization job (by invoking `auto_gptq` under the hood.) Additional acceptable arguments can be found here in [GPTQArguments](../../fms_mo/training_args.py#L127).
3636
3737
```bash
3838
python -m fms_mo.run_quant \

fms_mo/dq.py

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@
5151
logger = logging.getLogger(__name__)
5252

5353

54-
def run_dq(model_args, data_args, fms_mo_args, output_dir):
54+
def run_dq(model_args, data_args, opt_args, fms_mo_args):
5555
"""
5656
For direct quantization LLMs without optimization:
5757
Models are directly quantized into INT8 or FP8 precisions using
@@ -63,12 +63,15 @@ def run_dq(model_args, data_args, fms_mo_args, output_dir):
6363
the model
6464
data_args (fms_mo.training_args.DataArguments): Data arguments to be used when loading the
6565
tokenized dataset
66+
opt_args (fms_mo.training_args.OptArguments): Generic optimization arguments to be used
67+
during DQ
6668
fms_mo_args (fms_mo.training_args.FMSMOArguments): Parameters to use for DQ quantization
67-
output_dir (str) Output directory to write to
69+
6870
NOTE:
6971
use dynamo tracing instead of torchscript by default. if torchscript is needed, change
7072
1) config_kwarks and 2) use_dynamo in qmodel_prep()
71-
"""
73+
74+
"""
7275
# for attention or kv-cache quantization, need to use eager attention
7376
attn_bits = [
7477
fms_mo_args.nbits_bmm1,
@@ -225,9 +228,9 @@ def run_dq(model_args, data_args, fms_mo_args, output_dir):
225228
with patch_torch_bmm(qcfg):
226229
model(**data_mb)
227230

228-
logger.info(f"Saving quantized model and tokenizer to {output_dir}")
229-
model.save_pretrained(output_dir, use_safetensors=True)
230-
tokenizer.save_pretrained(output_dir)
231+
logger.info(f"Saving quantized model and tokenizer to {opt_args.output_dir}")
232+
model.save_pretrained(opt_args.output_dir, use_safetensors=True)
233+
tokenizer.save_pretrained(opt_args.output_dir)
231234

232235
if fms_mo_args.eval_ppl:
233236
path_test = Path(data_args.test_data_path)

0 commit comments

Comments
 (0)