foundation-model-stack
diff --git a/‎.github/pull_request_template.md‎
Lines changed: 17 additions & 5 deletions b/‎.github/pull_request_template.md‎
Lines changed: 17 additions & 5 deletions
diff --git a/‎.gitignore‎
Lines changed: 7 additions & 2 deletions b/‎.gitignore‎
Lines changed: 7 additions & 2 deletions
diff --git a/‎.pylintrc‎
Lines changed: 4 additions & 4 deletions b/‎.pylintrc‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎.spellcheck-en-custom.txt‎
Lines changed: 22 additions & 3 deletions b/‎.spellcheck-en-custom.txt‎
Lines changed: 22 additions & 3 deletions
diff --git a/‎README.md‎
Lines changed: 24 additions & 1 deletion b/‎README.md‎
Lines changed: 24 additions & 1 deletion
diff --git a/‎docs/fms_mo_design.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/fms_mo_design.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/AIU_CONVERSION/README.md‎
Lines changed: 69 additions & 0 deletions b/‎examples/AIU_CONVERSION/README.md‎
Lines changed: 69 additions & 0 deletions
diff --git a/‎examples/FP8_QUANT/README.md‎
Lines changed: 2 additions & 1 deletion b/‎examples/FP8_QUANT/README.md‎
Lines changed: 2 additions & 1 deletion
@@ -4,16 +4,28 @@
 
 <!-- Please summarize the changes -->
 
-### Related issue number
+### Related issues or PRs
 
-<!-- For example: "Closes #1234" -->
+<!-- For example: "Closes #1234" or "Fixes bug introduced in #5678 -->
 
 ### How to verify the PR
 
-<!-- Please provide instruction or screenshots on how to verify the PR.-->
+<!-- Please provide instruction or screenshots on how to verify the PR if unit tests do not provide coverage.-->
 
 ### Was the PR tested
 
 <!-- Describe how PR was tested -->
-- [ ] I have added >=1 unit test(s) for every new method I have added.
-- [ ] I have ensured all unit tests pass
+- [ ] I have added >=1 unit test(s) for every new method I have added (if that coverage is difficult, please briefly explain the reason)
+- [ ] I have ensured all unit tests pass
+
+### Checklist for passing CI/CD:
+
+<!-- Mark completed tasks with "- [x]" -->
+- [ ] All commits are signed showing "Signed-off-by: Name \<[email protected]\>" with `git commit -signoff` or equivalent
+- [ ] PR title and commit messages adhere to [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/)
+- [ ] Contribution is formatted with `tox -e fix`
+- [ ] Contribution passes linting with `tox -e lint`
+- [ ] Contribution passes spellcheck with `tox -e spellcheck`
+- [ ] Contribution passes all unit tests with `tox -e unit`
+
+Note: CI/CD performs unit tests on multiple versions of Python from a fresh install.  There may be differences with your local environment and the test environment.
@@ -14,8 +14,8 @@ htmlcov/
 durations/*
 coverage*.xml
 qcfg.json
-models
 configs
+pytest.out
 
 # IDEs
 .vscode/
@@ -45,4 +45,9 @@ fms_mo.log
 data*_train/
 data*_test/
 act_scales/
-examples/
+examples/**/*.json
+examples/**/*.safetensors
+examples/**/*.log
+examples/**/*.sh
+examples/**/*.pt
+examples/**/*.arrow
@@ -63,9 +63,9 @@ ignore-patterns=^\.#
 # (useful for modules/projects where namespaces are manipulated during runtime
 # and thus existing member attributes cannot be deduced by static analysis). It
 # supports qualified module names, as well as Unix pattern matching.
-ignored-modules=auto_gptq,                
-                exllama_kernels,
-                exllamav2_kernels,
+ignored-modules=gptqmodel,                
+                gptqmodel_exllama_kernels,
+                gptqmodel_exllamav2_kernels,
                 llmcompressor,
                 cutlass_mm,
                 pygraphviz,
@@ -94,7 +94,7 @@ persistent=yes
 
 # Minimum Python version to use for version dependent checks. Will default to
 # the version used to run pylint.
-py-version=3.9
+py-version=3.10
 
 # Discover python modules and packages in the file system subtree.
 recursive=no
 
@@ -1,8 +1,11 @@
 activations
 acc
 ADR
+aiu
+AIU
+Spyre
+spyre
 Args
-AutoGPTQ
 autoregressive
 backpropagation
 bmm
@@ -23,17 +26,20 @@ dequantization
 dq
 DQ
 dev
+dtype
 eval
 fms
+fmsmo
 fp
 FP
 FP8Arguments
 frac
 gptq
 GPTQ
 GPTQArguments
+GPTQModel
+gptqmodel
 graphviz
-GPTQ
 hyperparameters
 Inductor
 inferenced
@@ -91,8 +97,11 @@ quantizes
 Quantizing
 QW
 rceil
+recomputation
 repo
 representable
+roberta
+RoBERTa
 runtime
 Runtime
 SAWB
@@ -112,9 +121,19 @@ Tokenizer
 toml
 triton
 Unquantized
+utils
 vals
 venv
 vllm
 xs
 zp
-
+microxcaling
+Microscaling
+microscaling
+MX
+mx
+MXINT
+mxint
+MXFP
+mxfp
+OCP
@@ -42,7 +42,7 @@ FMS Model Optimizer is a framework for developing reduced precision neural netwo
 *Optional packages based on optimization functionality required:*
 
 - **GPTQ** is a popular compression method for LLMs: 
-    - [auto_gptq](https://pypi.org/project/auto-gptq/) or build from [source](https://github.com/AutoGPTQ/AutoGPTQ)
+    - [gptqmodel](https://pypi.org/project/gptqmodel/) or build from [source](https://github.com/ModelCloud/GPTQModel)
 - If you want to experiment with **INT8** deployment in [QAT](./examples/QAT_INT8/) and [PTQ](./examples/PTQ_INT8/) examples:
     - Nvidia GPU with compute capability > 8.0 (A100 family or higher)
     - Option 1:
@@ -98,6 +98,29 @@ cd fms-model-optimizer
 pip install -e .
 ```
 
+#### Optional Dependencies
+The following optional dependencies are available:
+- `fp8`: `llmcompressor` package for fp8 quantization
+- `gptq`: `GPTQModel` package for W4A16 quantization
+- `mx`: `microxcaling` package for MX quantization
+- `opt`: Shortcut for `fp8`, `gptq`, and `mx` installs
+- `aiu`: `ibm-fms` package for AIU model deployment
+- `torchvision`: `torch` package for image recognition training and inference
+- `triton`: `triton` package for matrix multiplication kernels
+- `examples`: Dependencies needed for examples
+- `visualize`: Dependencies for visualizing models and performance data
+- `test`: Dependencies needed for unit testing
+- `dev`: Dependencies needed for development
+
+To install an optional dependency, modify the `pip install` commands above with a list of these names enclosed in brackets.  The example below installs `llm-compressor` and `torchvision` with FMS Model Optimizer:
+
+```shell
+pip install fms-model-optimizer[fp8,torchvision]
+
+pip install -e .[fp8,torchvision]
+```
+If you have already installed FMS Model Optimizer, then only the optional packages will be installed.
+
 ### Try It Out!
 
 To help you get up and running as quickly as possible with the FMS Model Optimizer framework, check out the following resources which demonstrate how to use the framework with different quantization techniques:
 
@@ -82,7 +82,7 @@ FMS Model Optimizer supports FP8 in two ways:
 
 ### GPTQ (weight-only compression, or sometimes referred to as W4A16)
 
-For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed. (Some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this method simply by utilizing `auto_gptq` package. See this [example](../examples/GPTQ/)
+For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed. (Some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this method simply by utilizing `gptqmodel` package. See this [example](../examples/GPTQ/)
 
 
 ## Specification
 
@@ -0,0 +1,69 @@
+# Train and prepare INT8 checkpoint for the AIU using Direct Quantization
+This example builds on the [Direct Quantization (DQ) example](../DQ_SQ/README.md). We assume the user is already familiar with the DQ quantization process and would like to generate an INT8-quantized checkpoint that is made compliant with the requirements of the AIU/Spire accelerator.
+
+Once created, this checkpoint can be run on the AIU by using an inference script from [aiu-fms-testing-utils](https://github.com/foundation-model-stack/aiu-fms-testing-utils).
+
+For more information on the AIU/Spyre accelerator, see the following blogs:
+- [Introducing the IBM Spyre AI Accelerator chip](https://research.ibm.com/blog/spyre-for-z)
+- [IBM Power modernizes infrastructure and accelerates innovation with AI in the year ahead](https://newsroom.ibm.com/blog-ibm-power-modernizes-infrastructure-and-accelerates-innovation-with-ai-in-the-year-ahead)
+
+## Requirements
+- [FMS Model Optimizer requirements](../../README.md#requirements)
+
+## QuickStart
+
+**1. Prepare Data** as per DQ quantization process ([link](../DQ_SQ/README.md)). In this example, we assume the user wants to quantized RoBERTa-base model and has thus prepared the DQ data for it, stored under the folder `data_train` and `data_test`, by adapting the DQ example accordingly.
+
+**2. Apply DQ with conversion** by providing the desired quantization parameters, as well as the flags `--save_ckpt_for_aiu` and `--recompute_narrow_weights`.
+
+```bash
+python  -m fms_mo.run_quant \
+        --model_name_or_path "roberta-base" \
+        --training_data_path data_train \
+        --test_data_path data_test \
+        --torch_dtype "float16" \
+        --quant_method dq \
+        --nbits_w 8 \
+        --nbits_a 8 \
+        --nbits_kvcache 32 \
+        --qa_mode "pertokenmax"\
+        --qw_mode "maxperCh" \
+        --qmodel_calibration_new 1 \
+        --output_dir "dq_test" \
+        --save_ckpt_for_aiu \
+        --recompute_narrow_weights
+```
+> [!TIP]
+> - In this example, we are not evaluating the perplexity of the quantized model, but, if so desired, the user can add the `--eval_ppl` flag.
+> - We set a single calibration example because the quantizers in use do not need calibration: weights remain static during DQ, so a single example will initialize the quantizer correctly, and the activation quantizer `pertokenmax` will dynamically recompute the quantization range at inference time, when running on the AIU.
+
+**3. Reload checkpoint for testing** and validate its content (optional).
+
+```python
+sd = torch.load("dq_test/qmodel_for_aiu.pt", weights_only=True)
+```
+
+Check that all quantized layers have been converted to `torch.int8`, while the rest are `torch.float16`.
+
+```python
+# select quantized layers by name
+roberta_qlayers = ["attention.self.query", "attention.self.key", "attention.self.value", "attention.output.dense", "intermediate.dense", "output.dense"]
+# assert all quantized weights are int8
+assert all(v.dtype == torch.int8 for k,v in sd.items() if any(n in k for n in roberta_qlayers) and k.endswith(".weight"))
+# assert all other parameters are fp16
+assert all(v.dtype == torch.float16 for k,v in sd.items() if all(n not in k for n in roberta_qlayers) or not k.endswith(".weight"))
+```
+
+> [!TIP]
+> - We have trained the model with symmetric quantizer for activations (`qa_mode`). If an asymmetric quantizer is used, then the checkpoint will also carry a `zero_shift` parameters which is torch.float32, so this validation step should be modified accordingly.
+
+Because we have used the `narrow_weight_recomputation` option along with a `maxperCh` (max per-channel) quantizer for weights, the INT weight matrices distributions have been widened. Most values of standard deviation (per channel) should surpass the empirical threshold of 20.
+
+```python
+[f"{v.to(torch.float32).std(dim=-1).mean():.4f}" for k,v in sd.items() if k.endswith(".weight") and any(n in k for n in roberta_qlayers)]
+```
+
+> [!TIP]
+> - We cast the torch.int8 weights to torch.float32 to be able to apply the torch.std function.
+> - For per-channel weights, the recomputation is applied per-channel. Here we print a mean across channels for help of visualization.
+> - It is not a guarantee that the recomputed weights will exceed the empirical threshold after recomputation, but it is the case for several common models of BERT, RoBERTa, Llama, and Granite families.
@@ -92,7 +92,8 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
 
     ```python
     from llmcompressor.modifiers.quantization import QuantizationModifier
-    from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
+    from llmcompressor.transformers import SparseAutoModelForCausalLM
+    from llmcompressor import oneshot
 
     model = SparseAutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, torch_dtype=model_args.torch_dtype)
     tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)