foundation-model-stack
diff --git a/‎README.md‎
Lines changed: 129 additions & 2 deletions b/‎README.md‎
Lines changed: 129 additions & 2 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 2 additions & 1 deletion b/‎pyproject.toml‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎scripts/run_inference.py‎
Lines changed: 75 additions & 17 deletions b/‎scripts/run_inference.py‎
Lines changed: 75 additions & 17 deletions
@@ -9,6 +9,7 @@
   - [Tips on Parameters to Set](#tips-on-parameters-to-set)
 - [Tuning Techniques](#tuning-techniques)
   - [LoRA Tuning Example](#lora-tuning-example)
+  - [Activated LoRA Tuning Example](#activated-lora-tuning-example)
   - [GPTQ-LoRA with AutoGPTQ Tuning Example](#gptq-lora-with-autogptq-tuning-example)
   - [Fine Tuning](#fine-tuning)
   - [FMS Acceleration](#fms-acceleration)
@@ -454,7 +455,7 @@ To summarize you can pick either python for single-GPU jobs or use accelerate la
 
 ### Tips on Parameters to Set
 
-#### Saving checkpoints while training
+#### Saving checkpoints while training (does not apply to Activated LoRA)
 
 By default, [`save_strategy`](tuning/config/configs.py) is set to `"epoch"` in the TrainingArguments. This means that checkpoints will be saved on each epoch. This can also be set to `"steps"` to save on every `"save_steps"` or `"no"` to not save any checkpoints.
 
@@ -700,6 +701,132 @@ post_process_vLLM_adapters_new_tokens(
 
 _________________________
 
+### Activated LoRA Tuning Example
+
+Activated LoRA (aLoRA) is a new low rank adapter architecture that allows for reusing existing base model KV cache for more efficient inference. This approach is best suited for inference pipelines which rely on the base model for most tasks/generations, but use aLoRA adapter(s) to perform specialized task(s) within the chain. For example, checking or rewriting generated outputs of the base model.
+
+[Paper](https://arxiv.org/abs/2504.12397)
+
+[IBM Research Blogpost](https://research.ibm.com/blog/inference-friendly-aloras)
+
+[Github](https://github.com/IBM/activated-lora)
+
+**Usage** Usage is very similar to standard LoRA, with the key difference that an invocation_string must be specified so that the model knows when to turn on i.e "activate" the adapter weights. The model will scan any input strings (during training or at test time) for this invocation_string, and activate the adapter weights 1 token after the start of the sequence. If there are multiple instances of the invocation_string in the same input, it will activate at the last such instance.
+
+**Note** Often (not always) aLoRA requires higher rank (r) than LoRA. r=32 can be a good starting point for challenging tasks.
+
+**Installation** The Activated LoRA requirements are an optional install in pyproject.toml (activated-lora)
+
+Set `peft_method` to `"alora"`. 
+
+You *must* pass in an invocation_string argument. This invocation_string *must be present* in both training data inputs and the input at test time. A good solution is to set invocation_string = response_template, this will ensure that every training input will have the invocation_string present. We keep these separate arguments for flexibility. It is most robust if the invocation_string begins and ends with special tokens.
+
+You can additionally pass any arguments from [aLoraConfig](https://github.com/IBM/activated-lora/blob/fms-hf-tuning/alora/config.py#L35), see the LoRA section for examples.
+
+Example command to run, here using the ([Granite Instruct response template](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct/blob/main/tokenizer_config.json#L188)) as the invocation sequence:
+
+```bash
+python tuning/sft_trainer.py \
+--model_name_or_path $MODEL_PATH \
+--tokenizer_name_or_path $MODEL_PATH \ # This field is optional and if not specified, tokenizer from model_name_or_path will be used
+--training_data_path $TRAIN_DATA_PATH \
+--output_dir $OUTPUT_PATH \
+--num_train_epochs 40 \
+--per_device_train_batch_size 4 \
+---learning_rate 1e-4 \
+--response_template "<|start_of_role|>assistant<|end_of_role|>" \ #this example uses special tokens in the Granite tokenizer, adjust for other models
+--invocation_string "<|start_of_role|>assistant<|end_of_role|>" \
+--dataset_text_field "output" \
+--peft_method "alora" \
+--r 32 \
+--lora_dropout 0.05 \
+--lora_alpha 16 \
+--target_modules q_proj k_proj v_proj
+```
+
+Equally you can pass in a JSON configuration for running tuning. See [build doc](./build/README.md) for more details. The above can also be passed in as JSON:
+```json
+{
+    "model_name_or_path": $MODEL_PATH,
+    "training_data_path": $TRAIN_DATA_PATH,
+    "output_dir": $OUTPUT_PATH,
+    "num_train_epochs": 40.0,
+    "per_device_train_batch_size": 4,
+    "learning_rate": 1e-4,
+    "response_template": "<|start_of_role|>assistant<|end_of_role|>",
+    "invocation_string": "<|start_of_role|>assistant<|end_of_role|>",
+    "dataset_text_field": "output",
+    "peft_method": "alora",
+    "r": 32,
+    "lora_dropout": 0.05,
+    "lora_alpha": 16,
+    "target_modules": ["q_proj", "k_proj", "v_proj"]
+}
+```
+
+Notice the `target_modules` are the names of the modules to apply the adapter to.
+- If this is specified, only the modules with the specified names will be replaced. When passing a list of strings, either an exact match will be performed or it is checked if the name of the module ends with any of the passed strings. If this is specified as `all-linear`, then all linear/Conv1D modules are chosen, excluding the output layer. 
+- If this is not specified, modules will be chosen according to the model architecture. If the architecture is not known, an error will be raised — in this case, you should specify the target modules manually. See [HuggingFace docs](https://huggingface.co/docs/peft/en/package_reference/lora#peft.LoraConfig) for more details.
+
+
+#### How to get list of aLoRA target_modules of a model
+See [How to get list of LoRA target_modules of a model](#how-to-get-list-of-lora-target_modules-of-a-model). 
+
+#### Recommended target modules per model architecture 
+As per [aLoRA paper](https://arxiv.org/abs/2504.12397), by using the key, query and value projection matrices, we can achieve good quality with efficient GPU utilization. Hence, while thinking about what aLoRA adapters to specify, we recommend starting with key, query and value matrices. 
+
+#### Intermediate checkpoint saving
+Note that `sft_trainer.py` will always save the final trained model for you. If you want to save intermediate checkpoints from within the training process, the below applies.
+
+For now, `save_strategy` is not supported (it is always reset to `none`). You can either save the model once training is complete, or pass in a custom callback in `additional_callbacks` directly to `tuning.sft_trainer.train` to perform saving. For example the following (from [alora github](https://github.com/IBM/activated-lora/blob/fms-hf-tuning/train_scripts/finetune_example_callback.py)) saves and updates the best performing model so far, checking whenever eval is called according to `eval_strategy`:
+```py
+class SaveBestModelCallback(TrainerCallback):
+    def __init__(self):
+        self.best_eval_loss = float("inf")  # Track best loss
+
+    def on_evaluate(self, args, state, control, **kwargs):
+        """Save the best model manually during evaluation."""
+
+        model = kwargs["model"]
+        metrics = kwargs["metrics"]
+        
+        eval_loss = metrics.get("eval_loss")
+        if eval_loss is not None and eval_loss < self.best_eval_loss:
+            self.best_eval_loss = eval_loss  # Update best loss
+
+            # Manually save best model
+            model.save_pretrained(args.output_dir)
+```
+#### Inference with aLoRA models
+*Important* Inference with aLoRA models requires nsuring that the invocation string is present in the input (usually the end).
+
+Example inference:
+```py
+# Load the model
+loaded_model = TunedCausalLM.load(ALORA_MODEL, BASE_MODEL_NAME, use_alora=True)
+
+# Retrieve the invocation string from the model config
+invocation_string = loaded_model.peft_model.peft_config[
+    loaded_model.peft_model.active_adapter
+].invocation_string
+
+# In this case, we have the invocation string at the end of the input 
+input_string = "Simply put, the theory of relativity states that \n" + invocation_string
+
+# Run inference on the text
+output_inference = loaded_model.run(
+    input_string, 
+    max_new_tokens=50,
+)
+```
+
+#### Running aLoRA models on VLLM
+
+Coming soon! For now, there is inference support in this package, or see [aLoRA github](https://github.com/IBM/activated-lora/experiments/inference_example.py) for example code demonstrating KV cache reuse from prior base model calls.
+
+__________
+
+
 
 ### GPTQ-LoRA with AutoGPTQ Tuning Example
 
@@ -1037,4 +1164,4 @@ Further details on enabling and using the trackers mentioned above can be found
 
 ## More Examples
 
-A good simple example can be found [here](examples/kfto-kueue-sft-trainer.yaml) which launches a Kubernetes-native `PyTorchJob` using the [Kubeflow Training Operator](https://github.com/kubeflow/training-operator/) with [Kueue](https://github.com/kubernetes-sigs/kueue) for the queue management of tuning jobs.
+A good simple example can be found [here](examples/kfto-kueue-sft-trainer.yaml) which launches a Kubernetes-native `PyTorchJob` using the [Kubeflow Training Operator](https://github.com/kubeflow/training-operator/) with [Kueue](https://github.com/kubernetes-sigs/kueue) for the queue management of tuning jobs.
@@ -35,7 +35,7 @@ dependencies = [
 "tokenizers>=0.13.3,<1.0",
 "tqdm>=4.66.2,<5.0",
 "trl>=0.13,<0.18",
-"peft>=0.8.0,<0.14",
+"peft>=0.8.0,<=0.14",
 "protobuf>=5.28.0,<6.0.0",
 "datasets>=3.5.0,<4.0",
 "simpleeval>=0.9.13,<2.0",
@@ -51,6 +51,7 @@ fms-accel = ["fms-acceleration>=0.6"]
 gptq-dev = ["auto_gptq>0.4.2", "optimum>=1.15.0"]
 mamba = ["mamba_ssm[causal-conv1d]>=2.0.0,<3.0.0"]
 scanner-dev = ["HFResourceScanner>=0.1.0"]
+activated-lora = ["alora>=0.1.0"]
 
 
 [tool.setuptools.packages.find]
 
@@ -138,17 +138,19 @@ def __exit__(self, exc_type, exc_value, exc_tb):
 
 ### Funcs for loading and running models
 class TunedCausalLM:
-    def __init__(self, model, tokenizer, device):
+    def __init__(self, model, tokenizer, device, use_alora=False):
         self.peft_model = model
         self.tokenizer = tokenizer
         self.device = device
+        self.use_alora = use_alora
 
     @classmethod
     def load(
         cls,
         checkpoint_path: str,
         base_model_name_or_path: str = None,
         use_flash_attn: bool = False,
+        use_alora: bool = False,
     ) -> "TunedCausalLM":
         """Loads an instance of this model.
 
@@ -222,14 +224,36 @@ def load(
                     tokenizer_and_embedding_resize(
                         {}, tokenizer=tokenizer, model=base_model
                     )
-                    model = PeftModel.from_pretrained(
-                        base_model,
-                        checkpoint_path,
-                        attn_implementation="flash_attention_2"
-                        if use_flash_attn
-                        else None,
-                        torch_dtype=torch.bfloat16 if use_flash_attn else None,
-                    )
+                    if use_alora:
+                        # Third Party
+                        try:
+                            # Third Party
+                            from alora.peft_model_alora import (  # pylint: disable=import-outside-toplevel
+                                aLoRAPeftModelForCausalLM,
+                            )
+
+                            model = aLoRAPeftModelForCausalLM.from_pretrained(
+                                base_model,
+                                checkpoint_path,
+                                attn_implementation="flash_attention_2"
+                                if use_flash_attn
+                                else None,
+                                torch_dtype=torch.bfloat16 if use_flash_attn else None,
+                            )
+                        except ImportError as exc:
+                            raise ImportError(
+                                "The alora package is required for this operation. "
+                                "Please install it with pip install alora."
+                            ) from exc
+                    else:
+                        model = PeftModel.from_pretrained(
+                            base_model,
+                            checkpoint_path,
+                            attn_implementation="flash_attention_2"
+                            if use_flash_attn
+                            else None,
+                            torch_dtype=torch.bfloat16 if use_flash_attn else None,
+                        )
                 except (OSError, ValueError) as e:
                     print("Failed to initialize checkpoint model!")
                     raise e
@@ -259,10 +283,14 @@ def load(
                 )
 
         model.to(device)
-        return cls(model, tokenizer, device)
+        return cls(model, tokenizer, device, use_alora)
 
     def run(
-        self, text: str, *, max_new_tokens: int, ret_gen_text_only: bool = False
+        self,
+        text: str,
+        *,
+        max_new_tokens: int,
+        ret_gen_text_only: bool = False,
     ) -> str:
         """Runs inference on an instance of this model.
 
@@ -279,12 +307,36 @@ def run(
             str
                 Text generation result.
         """
-        tok_res = self.tokenizer(text, return_tensors="pt")
-        input_ids = tok_res.input_ids.to(self.device)
-
-        peft_outputs = self.peft_model.generate(
-            input_ids=input_ids, max_new_tokens=max_new_tokens
-        )
+        if not self.use_alora:
+            tok_res = self.tokenizer(text, return_tensors="pt")
+            input_ids = tok_res.input_ids.to(self.device)
+            peft_outputs = self.peft_model.generate(
+                input_ids=input_ids, max_new_tokens=max_new_tokens
+            )
+        else:  # pass in alora_offsets needed for alora model
+            # Retrieve invocation string
+            invocation_string = self.peft_model.peft_config[
+                self.peft_model.active_adapter
+            ].invocation_string
+            # Find the invocation string in input
+            if invocation_string in text:
+                before, after = text.rsplit(invocation_string, 1)
+                after = invocation_string + after
+            else:
+                raise ValueError(
+                    f"aLoRA invocation string '{invocation_string}' not found in input '{text}'."
+                )
+            # Tokenize separately to enforce correct token boundary
+            before_ids = self.tokenizer(before, return_tensors="pt").input_ids
+            after_ids = self.tokenizer(invocation_string, return_tensors="pt").input_ids
+            alora_offsets = [after_ids.shape[1] - 1]
+            input_ids = torch.cat([before_ids, after_ids], dim=1).to(self.device)
+
+            peft_outputs = self.peft_model.generate(
+                input_ids=input_ids,
+                max_new_tokens=max_new_tokens,
+                alora_offsets=alora_offsets,
+            )
         if ret_gen_text_only:
             tok_to_decode = peft_outputs[:, input_ids.shape[1] :]
         else:
@@ -308,6 +360,11 @@ def main():
         help="JSON file to write results to",
         default="inference_result.json",
     )
+    parser.add_argument(
+        "--use_alora",
+        help="Whether to use alora",
+        default=False,
+    )
     parser.add_argument(
         "--base_model_name_or_path",
         help="Override for base model to be used for non-merged models \
@@ -341,6 +398,7 @@ def main():
         checkpoint_path=args.model,
         base_model_name_or_path=args.base_model_name_or_path,
         use_flash_attn=args.use_flash_attn,
+        use_alora=args.use_alora,
     )
 
     # Run inference on the text; if multiple were provided, process them all