foundation-model-stack
diff --git a/‎README.md‎
Lines changed: 25 additions & 2 deletions b/‎README.md‎
Lines changed: 25 additions & 2 deletions
diff --git a/‎build/Dockerfile‎
Lines changed: 13 additions & 2 deletions b/‎build/Dockerfile‎
Lines changed: 13 additions & 2 deletions
diff --git a/‎docs/advanced-data-preprocessing.md‎
Lines changed: 0 additions & 38 deletions b/‎docs/advanced-data-preprocessing.md‎
Lines changed: 0 additions & 38 deletions
diff --git a/‎docs/offline-data-preprocessing.md‎
Lines changed: 178 additions & 0 deletions b/‎docs/offline-data-preprocessing.md‎
Lines changed: 178 additions & 0 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 1 addition & 0 deletions b/‎pyproject.toml‎
Lines changed: 1 addition & 0 deletions
@@ -46,6 +46,14 @@ pip install fms-hf-tuning[flash-attn]
 ```
 [FlashAttention](https://github.com/Dao-AILab/flash-attention) requires the [CUDA Toolit](https://developer.nvidia.com/cuda-toolkit) to be pre-installed.
 
+*Debug recommendation:* While training, if you encounter flash-attn errors such as `undefined symbol`, you can follow the below steps for clean installation of flash binaries. This may occur when having multiple environments sharing the pip cache directory or torch version is updated.
+
+```
+pip uninstall flash-attn
+pip cache purge
+pip install fms-hf-tuning[flash-attn]
+```
+
 ### Using FMS-Acceleration
 
 If you wish to use [fms-acceleration](https://github.com/foundation-model-stack/fms-acceleration), you need to install it. 
@@ -215,6 +223,16 @@ python tuning/sft_trainer.py ... --training_data_path twitter_complaints_tokeniz
 
 For advanced data preprocessing support including mixing and custom preprocessing of datasets please see [this document](./docs/advanced-data-preprocessing.md).
 
+## Offline Data Preprocessing
+
+We also provide a script for the user to perform standalone data preprocessing. Our script for standalone data processing decoupled from the `tuning/training` is [offline_data_processing.py](./scripts/offline_data_processing.py). This script is especially useful if:
+
+1. The user is working with a large dataset and wants to perform the processing in one shot and then train the model directly on the processed dataset.
+
+2. The user wants to test out the data preprocessing outcome before training.
+
+Please refer to [this document](docs/offline-data-preprocessing.md) for details on how to use the offline data processing script.
+
 ## Supported Models
 
 - For each tuning technique, we run testing on a single large model of each architecture type and claim support for the smaller models. For example, with QLoRA technique, we tested on granite-34b GPTBigCode and claim support for granite-20b-multilingual.
@@ -782,7 +800,7 @@ The list of configurations for various `fms_acceleration` plugins:
   - `--padding_free`: technique to process multiple examples in single batch without adding padding tokens that waste compute.
   - `--multipack`: technique for *multi-gpu training* to balance out number of tokens processed in each device, to minimize waiting time.
 - [fast_moe_config](./tuning/config/acceleration_configs/fast_moe.py) (experimental):
-  - `--fast_moe`: trains MoE models in parallel, increasing throughput and decreasing memory usage.
+  - `--fast_moe`: trains MoE models in parallel with [Scatter MoE kernels](https://github.com/foundation-model-stack/fms-acceleration/tree/main/plugins/accelerated-moe#fms-acceleration-for-mixture-of-experts), increasing throughput and decreasing memory usage.
 
 Notes: 
  * `quantized_lora_config` requires that it be used along with LoRA tuning technique. See [LoRA tuning section](https://github.com/foundation-model-stack/fms-hf-tuning/tree/main?tab=readme-ov-file#lora-tuning-example) on the LoRA parameters to pass.
@@ -802,8 +820,13 @@ Notes:
     - works only for *multi-gpu*.
     - currently only includes the version of *multipack* optimized for linear attention implementations like *flash-attn*.
  * Notes on Fast MoE
-    - `--fast_moe` is an integer value that configures the amount of expert parallel sharding (ep_degree).
+    - `--fast_moe` takes either an integer or boolean value.
+      - When an integer `n` is passed, it enables expert parallel sharding with the expert parallel degree as `n` along with Scatter MoE kernels enabled.
+      - When a boolean is passed, the expert parallel degree defaults to 1 and further the behaviour would be as follows:
+          - if True, it is Scatter MoE Kernels with experts sharded based on the top level sharding protocol (e.g. FSDP).
+          - if False, Scatter MoE Kernels with complete replication of experts across ranks.
     - `world_size` must be divisible by the `ep_degree`
+    - `number of experts` in the MoE module must be divisible by the `ep_degree`
     - Running fast moe modifies the state dict of the model, and must be post-processed which happens automatically and the converted checkpoint can be found at `hf_converted_checkpoint` folder within every saved checkpoint directory. Alternatively, we can perform similar option manually through [checkpoint utils](https://github.com/foundation-model-stack/fms-acceleration/blob/main/plugins/accelerated-moe/src/fms_acceleration_moe/utils/checkpoint_utils.py) script.
       - The typical usecase for this script is to run:
         ```
 
@@ -88,7 +88,8 @@ ENV NV_CUDA_CUDART_DEV_VERSION=12.1.55-1 \
     NV_NVML_DEV_VERSION=12.1.55-1 \
     NV_LIBCUBLAS_DEV_VERSION=12.1.0.26-1 \
     NV_LIBNPP_DEV_VERSION=12.0.2.50-1 \
-    NV_LIBNCCL_DEV_PACKAGE_VERSION=2.18.3-1+cuda12.1
+    NV_LIBNCCL_DEV_PACKAGE_VERSION=2.18.3-1+cuda12.1 \
+    NV_CUDNN9_CUDA_VERSION=9.6.0.74-1
 
 RUN dnf config-manager \
        --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo \
@@ -103,6 +104,15 @@ RUN dnf config-manager \
         libnccl-devel-${NV_LIBNCCL_DEV_PACKAGE_VERSION} \
     && dnf clean all
 
+# opening connection for too long in one go was resulting in timeouts
+RUN dnf config-manager \
+       --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo \
+    && dnf clean packages \
+    && dnf install -y \
+        libcusparselt0 libcusparselt-devel \
+        cudnn9-cuda-12-6-${NV_CUDNN9_CUDA_VERSION} \
+    && dnf clean all
+
 ENV LIBRARY_PATH="$CUDA_HOME/lib64/stubs"
 
 FROM cuda-devel AS python-installations
@@ -138,7 +148,8 @@ RUN if [[ -z "${WHEEL_VERSION}" ]]; \
 RUN --mount=type=cache,target=/home/${USER}/.cache/pip,uid=${USER_UID} \
     python -m pip install --user wheel && \
     python -m pip install --user "$(head bdist_name)" && \
-    python -m pip install --user "$(head bdist_name)[flash-attn]"
+    python -m pip install --user "$(head bdist_name)[flash-attn]" && \
+    python -m pip install --user "$(head bdist_name)[mamba]"
 
 # fms_acceleration_peft = PEFT-training, e.g., 4bit QLoRA
 # fms_acceleration_foak = Fused LoRA and triton kernels
 
@@ -279,42 +279,4 @@ If the dataset size is known to the user, `max_steps` can be calculated as the t
 
 We provide some example data configs [here](../tests/artifacts/predefined_data_configs/)
 
-## Offline Data preprocessing
-
-[This script](../scripts/offline_data_processing.py) provides the capability for users to perform standalone data 
-preprocessing, decoupled from the tuning/training part. It processes raw datasets, performs data preprocessing, and 
-saves the train and validation datasets (in shards if `--num_dataset_shards` if passed) in parquet format inside the specified `output_dir`. 
-A data config YAML file can be used to pass configuration to this script. Example command to run this script:
-
-```
-python scripts/offline_data_processing.py \
---data_config_path  /path/to/data_config.yaml \
---model_name_or_path "model_name"  \
---max_seq_length 4096 \
---output_dir /path/to/output/directory  \
---log_level info \
---num_dataset_shards 3
-```
-
-Example data config file:
-
-```
-dataprocessor:
-    type: default
-    sampling_stopping_strategy: first_exhausted
-    seed: 66
-datasets:
-  - name: dataset_1
-    data_paths:
-      - tests/artifacts/testdata/jsonl/twitter_complaints_input_output.jsonl
-    data_handlers:
-      - name: tokenize_and_apply_input_masking
-        arguments:
-          remove_columns: all
-          batched: false
-          fn_kwargs:
-            input_field_name: input
-            output_field_name: output
-```
-
 
@@ -0,0 +1,178 @@
+# Offline Data Preprocessing
+
+Our library provides a [script](../scripts/offline_data_processing.py) that allows users to perform standalone data preprocessing, independent of tuning/training. This script enables users to process raw datasets, apply basic/advanced data preprocessing, and save the train and validation datasets in Parquet format inside the specified `output_dir`. When the `--num_dataset_shards` argument is specified, the datasets are divided and saved into multiple shards.
+
+Users can pass any data config to this script. The goal of the script is to take the provided data config and generate a dataset that can be used directly for training, without requiring any online processing. As an example see this data config below:
+
+```yaml
+dataprocessor:
+    type: default
+    sampling_stopping_strategy: first_exhausted
+    seed: 66
+datasets:
+  - name: dataset_1
+    data_paths:
+      - tests/artifacts/testdata/jsonl/twitter_complaints_input_output.jsonl
+    data_handlers:
+      - name: tokenize_and_apply_input_masking
+        arguments:
+          remove_columns: all
+          batched: false
+          fn_kwargs:
+            input_field_name: input
+            output_field_name: output
+```
+
+After preparing the data configuration YAML file, run the script with the following example command to perform offline data preprocessing:   
+
+```
+python scripts/offline_data_processing.py \
+--data_config_path  /path/to/data_config.yaml \
+--model_name_or_path "model_name"  \
+--max_seq_length 4096 \
+--output_dir /path/to/output/directory  \
+--log_level info \
+--num_dataset_shards 3
+```
+
+Additionally, once the offline data processing is complete, users can leverage the shards stored in `output_dir` for tuning by passing it through the `--training_data_path` flag or passing it via `data_paths` argument in data config yaml, provided they find the sharded datasets beneficial for training.
+
+## Example Usage
+### Applying Chat Template
+
+This is a sample use case of the offline processing script being applied to a dataset with a chat template, after which the offline processed dataset is used to train a model.
+
+In this use case, the chat template is applied to a dataset using the `apply_tokenizer_chat_template` handler, followed by additional data transformation handlers. 
+
+**NOTE**: Streaming of the dataset is not supported when running the offline data preprocessing script. Therefore, in the data config, the `streaming` argument should either be set to `False` or left unassigned. 
+
+```yaml
+dataprocessor:
+  type: default
+  sampling_stopping_strategy: first_exhausted
+  seed: 66
+  streaming: False
+  chat_template: |
+   {%- for message in messages['messages'] %}
+    {%- if message['role'] == 'system' %}
+      {{ '<|start_of_role|>system<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}
+    {%- elif message['role'] == 'user' %}
+      {{ '<|start_of_role|>user<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}
+    {%- elif message['role'] == 'assistant' %}
+      {{ '<|start_of_role|>assistant<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}
+    {%- elif message['role'] == 'tools' %}
+      {{ '<|start_of_role|>tools<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}
+    {%- elif message['role'] == 'tool' %}
+      {{ '<|start_of_role|>tool<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}
+    {%- elif message['role'] == 'documents' %}
+      {{ '<|start_of_role|>documents<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}
+    {%- else %}
+      {{ '<|start_of_role|>unknown<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }} 
+    {%- endif %}
+   {%- endfor %}
+datasets:
+ - name: dataset_1
+    retain_columns:
+     - "formatted_chat"
+   data_paths:
+    - "/app/arb30_100.jsonl"
+   data_handlers:
+    - name: apply_tokenizer_chat_template
+      arguments:
+        fn_kwargs:
+          dataset_text_field: "formatted_chat"
+    - name: tokenize
+      arguments:
+        batched: false
+        fn_kwargs:
+          dataset_text_field: "formatted_chat"
+          truncation: False
+          max_length: 4096
+    - name: skip_large_text
+      arguments:
+        fn_kwargs:
+          column_name: "input_ids"
+          max_length: 4096
+    - name: retain_columns
+      arguments:
+        columns:
+        - "formatted_chat"
+```
+
+Command to run the offline data processing script:
+
+```yaml
+python scripts/offline_data_processing.py \
+--data_config_path "data_config.yaml" \
+--instruction_template "<|start_of_role|>user<|end_of_role|>" \
+--max_seq_length "8192" \
+--model_name_or_path "/test/models/granite-3.1-8b-instruct" \
+--output_dir "/test/data/offline_processing_shards" \
+--packing "False" \
+--response_template "<|start_of_role|>assistant<|end_of_role|>" \
+--split_batches "true" \
+--use_flash_attn "true" \
+--num_dataset_shards "10"
+```
+
+The resulting shards are saved in the directory `/test/data/offline_processing_shards`, as specified by the `--output_dir` argument. These shards can then be used for tuning the model by pointing the `training_data_path` argument to the directory where the shards are stored—in this example, 
+`/test/data/offline_processing_shards`.
+
+Command to run the tuning:
+
+```yaml
+accelerate launch \
+  --num_processes=8 \
+  --dynamo_backend="no" \
+  --fsdp_auto_wrap_policy="TRANSFORMER_BASED_WRAP" \
+  --fsdp_cpu_ram_efficient_loading="true" \
+  --fsdp_forward_prefetch="false" \
+  --fsdp_offload_params="false" \
+  --fsdp_sharding_strategy="HYBRID_SHARD" \
+  --fsdp_state_dict_type="FULL_STATE_DICT" \
+  --fsdp_sync_module_states="true" \
+  --machine_rank="${RANK}" \
+  --main_process_ip="${MASTER_ADDR}" \
+  --main_process_port="${MASTER_PORT}" \
+  --mixed_precision="no" \
+  --num_machines="${WORLD_SIZE}" \
+  --rdzv_backend="static" \
+  --same_network \
+  --use_fsdp \
+  -m tuning.sft_trainer \
+  --training_data_path "/test/data/offline_processing_shards" \
+  --adam_beta1="0.9" \
+  --adam_beta2="0.98" \
+  --adam_epsilon="1e-10" \
+  --aim_repo="${AIMSTACK_DB}" \
+  --dataloader_drop_last="true" \
+  --dataset_text_field="random" \
+  --evaluation_strategy="no" \
+  --experiment="train-nb-g8b-r26-e0e88b40-dbd8-41ae-a744-c853959495f2" \
+  --gradient_accumulation_steps="1" \
+  --gradient_checkpointing="true" \
+  --include_tokens_per_second="false" \
+  --instruction_template="<|start_of_role|>user<|end_of_role|>" \
+  --learning_rate="1e-06" \
+  --logging_steps="1" \
+  --logging_strategy="steps" \
+  --lr_scheduler_type="cosine" \
+  --max_seq_length="8192" \
+  --max_steps="12400" \
+  --model_name_or_path="/test/models/granite-3.1-8b-instruct" \
+  --num_train_epochs="3" \
+  --optim="adamw_torch" \
+  --output_dir="/hfcache/data_mixing/data_mixing/wca_summ/run26_rb_mix" \
+  --packing="False" \
+  --per_device_train_batch_size="32" \
+  --response_template="<|start_of_role|>assistant<|end_of_role|>" \
+  --save_steps="100" \
+  --save_strategy="steps" \
+  --split_batches="true" \
+  --torch_dtype="bfloat16" \
+  --use_flash_attn="true" \
+  --use_reentrant="true" \
+  --warmup_ratio="0.1" \
+  --warmup_steps="200" \
+  --weight_decay="0.1"
+```
@@ -48,6 +48,7 @@ aim = ["aim>=3.19.0,<4.0"]
 mlflow = ["mlflow"]
 fms-accel = ["fms-acceleration>=0.6"]
 gptq-dev = ["auto_gptq>0.4.2", "optimum>=1.15.0"]
+mamba = ["mamba_ssm[causal-conv1d]>=2.0.0,<3.0.0"]
 scanner-dev = ["HFResourceScanner>=0.1.0"]