Skip to content

Commit 4cdc134

Browse files
authored
Merge branch 'main' into Ssukriti-patch-3
2 parents 42e0e57 + cc7a9e8 commit 4cdc134

File tree

12 files changed

+366
-107
lines changed

12 files changed

+366
-107
lines changed

README.md

Lines changed: 25 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,14 @@ pip install fms-hf-tuning[flash-attn]
4646
```
4747
[FlashAttention](https://github.com/Dao-AILab/flash-attention) requires the [CUDA Toolit](https://developer.nvidia.com/cuda-toolkit) to be pre-installed.
4848

49+
*Debug recommendation:* While training, if you encounter flash-attn errors such as `undefined symbol`, you can follow the below steps for clean installation of flash binaries. This may occur when having multiple environments sharing the pip cache directory or torch version is updated.
50+
51+
```
52+
pip uninstall flash-attn
53+
pip cache purge
54+
pip install fms-hf-tuning[flash-attn]
55+
```
56+
4957
### Using FMS-Acceleration
5058

5159
If you wish to use [fms-acceleration](https://github.com/foundation-model-stack/fms-acceleration), you need to install it.
@@ -215,6 +223,16 @@ python tuning/sft_trainer.py ... --training_data_path twitter_complaints_tokeniz
215223

216224
For advanced data preprocessing support including mixing and custom preprocessing of datasets please see [this document](./docs/advanced-data-preprocessing.md).
217225

226+
## Offline Data Preprocessing
227+
228+
We also provide a script for the user to perform standalone data preprocessing. Our script for standalone data processing decoupled from the `tuning/training` is [offline_data_processing.py](./scripts/offline_data_processing.py). This script is especially useful if:
229+
230+
1. The user is working with a large dataset and wants to perform the processing in one shot and then train the model directly on the processed dataset.
231+
232+
2. The user wants to test out the data preprocessing outcome before training.
233+
234+
Please refer to [this document](docs/offline-data-preprocessing.md) for details on how to use the offline data processing script.
235+
218236
## Supported Models
219237

220238
- For each tuning technique, we run testing on a single large model of each architecture type and claim support for the smaller models. For example, with QLoRA technique, we tested on granite-34b GPTBigCode and claim support for granite-20b-multilingual.
@@ -782,7 +800,7 @@ The list of configurations for various `fms_acceleration` plugins:
782800
- `--padding_free`: technique to process multiple examples in single batch without adding padding tokens that waste compute.
783801
- `--multipack`: technique for *multi-gpu training* to balance out number of tokens processed in each device, to minimize waiting time.
784802
- [fast_moe_config](./tuning/config/acceleration_configs/fast_moe.py) (experimental):
785-
- `--fast_moe`: trains MoE models in parallel, increasing throughput and decreasing memory usage.
803+
- `--fast_moe`: trains MoE models in parallel with [Scatter MoE kernels](https://github.com/foundation-model-stack/fms-acceleration/tree/main/plugins/accelerated-moe#fms-acceleration-for-mixture-of-experts), increasing throughput and decreasing memory usage.
786804

787805
Notes:
788806
* `quantized_lora_config` requires that it be used along with LoRA tuning technique. See [LoRA tuning section](https://github.com/foundation-model-stack/fms-hf-tuning/tree/main?tab=readme-ov-file#lora-tuning-example) on the LoRA parameters to pass.
@@ -802,8 +820,13 @@ Notes:
802820
- works only for *multi-gpu*.
803821
- currently only includes the version of *multipack* optimized for linear attention implementations like *flash-attn*.
804822
* Notes on Fast MoE
805-
- `--fast_moe` is an integer value that configures the amount of expert parallel sharding (ep_degree).
823+
- `--fast_moe` takes either an integer or boolean value.
824+
- When an integer `n` is passed, it enables expert parallel sharding with the expert parallel degree as `n` along with Scatter MoE kernels enabled.
825+
- When a boolean is passed, the expert parallel degree defaults to 1 and further the behaviour would be as follows:
826+
- if True, it is Scatter MoE Kernels with experts sharded based on the top level sharding protocol (e.g. FSDP).
827+
- if False, Scatter MoE Kernels with complete replication of experts across ranks.
806828
- `world_size` must be divisible by the `ep_degree`
829+
- `number of experts` in the MoE module must be divisible by the `ep_degree`
807830
- Running fast moe modifies the state dict of the model, and must be post-processed which happens automatically and the converted checkpoint can be found at `hf_converted_checkpoint` folder within every saved checkpoint directory. Alternatively, we can perform similar option manually through [checkpoint utils](https://github.com/foundation-model-stack/fms-acceleration/blob/main/plugins/accelerated-moe/src/fms_acceleration_moe/utils/checkpoint_utils.py) script.
808831
- The typical usecase for this script is to run:
809832
```

build/Dockerfile

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,8 @@ ENV NV_CUDA_CUDART_DEV_VERSION=12.1.55-1 \
8888
NV_NVML_DEV_VERSION=12.1.55-1 \
8989
NV_LIBCUBLAS_DEV_VERSION=12.1.0.26-1 \
9090
NV_LIBNPP_DEV_VERSION=12.0.2.50-1 \
91-
NV_LIBNCCL_DEV_PACKAGE_VERSION=2.18.3-1+cuda12.1
91+
NV_LIBNCCL_DEV_PACKAGE_VERSION=2.18.3-1+cuda12.1 \
92+
NV_CUDNN9_CUDA_VERSION=9.6.0.74-1
9293

9394
RUN dnf config-manager \
9495
--add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo \
@@ -103,6 +104,15 @@ RUN dnf config-manager \
103104
libnccl-devel-${NV_LIBNCCL_DEV_PACKAGE_VERSION} \
104105
&& dnf clean all
105106

107+
# opening connection for too long in one go was resulting in timeouts
108+
RUN dnf config-manager \
109+
--add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo \
110+
&& dnf clean packages \
111+
&& dnf install -y \
112+
libcusparselt0 libcusparselt-devel \
113+
cudnn9-cuda-12-6-${NV_CUDNN9_CUDA_VERSION} \
114+
&& dnf clean all
115+
106116
ENV LIBRARY_PATH="$CUDA_HOME/lib64/stubs"
107117

108118
FROM cuda-devel AS python-installations
@@ -138,7 +148,8 @@ RUN if [[ -z "${WHEEL_VERSION}" ]]; \
138148
RUN --mount=type=cache,target=/home/${USER}/.cache/pip,uid=${USER_UID} \
139149
python -m pip install --user wheel && \
140150
python -m pip install --user "$(head bdist_name)" && \
141-
python -m pip install --user "$(head bdist_name)[flash-attn]"
151+
python -m pip install --user "$(head bdist_name)[flash-attn]" && \
152+
python -m pip install --user "$(head bdist_name)[mamba]"
142153

143154
# fms_acceleration_peft = PEFT-training, e.g., 4bit QLoRA
144155
# fms_acceleration_foak = Fused LoRA and triton kernels

docs/advanced-data-preprocessing.md

Lines changed: 0 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -279,42 +279,4 @@ If the dataset size is known to the user, `max_steps` can be calculated as the t
279279
280280
We provide some example data configs [here](../tests/artifacts/predefined_data_configs/)
281281
282-
## Offline Data preprocessing
283-
284-
[This script](../scripts/offline_data_processing.py) provides the capability for users to perform standalone data
285-
preprocessing, decoupled from the tuning/training part. It processes raw datasets, performs data preprocessing, and
286-
saves the train and validation datasets (in shards if `--num_dataset_shards` if passed) in parquet format inside the specified `output_dir`.
287-
A data config YAML file can be used to pass configuration to this script. Example command to run this script:
288-
289-
```
290-
python scripts/offline_data_processing.py \
291-
--data_config_path /path/to/data_config.yaml \
292-
--model_name_or_path "model_name" \
293-
--max_seq_length 4096 \
294-
--output_dir /path/to/output/directory \
295-
--log_level info \
296-
--num_dataset_shards 3
297-
```
298-
299-
Example data config file:
300-
301-
```
302-
dataprocessor:
303-
type: default
304-
sampling_stopping_strategy: first_exhausted
305-
seed: 66
306-
datasets:
307-
- name: dataset_1
308-
data_paths:
309-
- tests/artifacts/testdata/jsonl/twitter_complaints_input_output.jsonl
310-
data_handlers:
311-
- name: tokenize_and_apply_input_masking
312-
arguments:
313-
remove_columns: all
314-
batched: false
315-
fn_kwargs:
316-
input_field_name: input
317-
output_field_name: output
318-
```
319-
320282

docs/offline-data-preprocessing.md

Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
# Offline Data Preprocessing
2+
3+
Our library provides a [script](../scripts/offline_data_processing.py) that allows users to perform standalone data preprocessing, independent of tuning/training. This script enables users to process raw datasets, apply basic/advanced data preprocessing, and save the train and validation datasets in Parquet format inside the specified `output_dir`. When the `--num_dataset_shards` argument is specified, the datasets are divided and saved into multiple shards.
4+
5+
Users can pass any data config to this script. The goal of the script is to take the provided data config and generate a dataset that can be used directly for training, without requiring any online processing. As an example see this data config below:
6+
7+
```yaml
8+
dataprocessor:
9+
type: default
10+
sampling_stopping_strategy: first_exhausted
11+
seed: 66
12+
datasets:
13+
- name: dataset_1
14+
data_paths:
15+
- tests/artifacts/testdata/jsonl/twitter_complaints_input_output.jsonl
16+
data_handlers:
17+
- name: tokenize_and_apply_input_masking
18+
arguments:
19+
remove_columns: all
20+
batched: false
21+
fn_kwargs:
22+
input_field_name: input
23+
output_field_name: output
24+
```
25+
26+
After preparing the data configuration YAML file, run the script with the following example command to perform offline data preprocessing:
27+
28+
```
29+
python scripts/offline_data_processing.py \
30+
--data_config_path /path/to/data_config.yaml \
31+
--model_name_or_path "model_name" \
32+
--max_seq_length 4096 \
33+
--output_dir /path/to/output/directory \
34+
--log_level info \
35+
--num_dataset_shards 3
36+
```
37+
38+
Additionally, once the offline data processing is complete, users can leverage the shards stored in `output_dir` for tuning by passing it through the `--training_data_path` flag or passing it via `data_paths` argument in data config yaml, provided they find the sharded datasets beneficial for training.
39+
40+
## Example Usage
41+
### Applying Chat Template
42+
43+
This is a sample use case of the offline processing script being applied to a dataset with a chat template, after which the offline processed dataset is used to train a model.
44+
45+
In this use case, the chat template is applied to a dataset using the `apply_tokenizer_chat_template` handler, followed by additional data transformation handlers.
46+
47+
**NOTE**: Streaming of the dataset is not supported when running the offline data preprocessing script. Therefore, in the data config, the `streaming` argument should either be set to `False` or left unassigned.
48+
49+
```yaml
50+
dataprocessor:
51+
type: default
52+
sampling_stopping_strategy: first_exhausted
53+
seed: 66
54+
streaming: False
55+
chat_template: |
56+
{%- for message in messages['messages'] %}
57+
{%- if message['role'] == 'system' %}
58+
{{ '<|start_of_role|>system<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}
59+
{%- elif message['role'] == 'user' %}
60+
{{ '<|start_of_role|>user<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}
61+
{%- elif message['role'] == 'assistant' %}
62+
{{ '<|start_of_role|>assistant<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}
63+
{%- elif message['role'] == 'tools' %}
64+
{{ '<|start_of_role|>tools<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}
65+
{%- elif message['role'] == 'tool' %}
66+
{{ '<|start_of_role|>tool<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}
67+
{%- elif message['role'] == 'documents' %}
68+
{{ '<|start_of_role|>documents<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}
69+
{%- else %}
70+
{{ '<|start_of_role|>unknown<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}
71+
{%- endif %}
72+
{%- endfor %}
73+
datasets:
74+
- name: dataset_1
75+
retain_columns:
76+
- "formatted_chat"
77+
data_paths:
78+
- "/app/arb30_100.jsonl"
79+
data_handlers:
80+
- name: apply_tokenizer_chat_template
81+
arguments:
82+
fn_kwargs:
83+
dataset_text_field: "formatted_chat"
84+
- name: tokenize
85+
arguments:
86+
batched: false
87+
fn_kwargs:
88+
dataset_text_field: "formatted_chat"
89+
truncation: False
90+
max_length: 4096
91+
- name: skip_large_text
92+
arguments:
93+
fn_kwargs:
94+
column_name: "input_ids"
95+
max_length: 4096
96+
- name: retain_columns
97+
arguments:
98+
columns:
99+
- "formatted_chat"
100+
```
101+
102+
Command to run the offline data processing script:
103+
104+
```yaml
105+
python scripts/offline_data_processing.py \
106+
--data_config_path "data_config.yaml" \
107+
--instruction_template "<|start_of_role|>user<|end_of_role|>" \
108+
--max_seq_length "8192" \
109+
--model_name_or_path "/test/models/granite-3.1-8b-instruct" \
110+
--output_dir "/test/data/offline_processing_shards" \
111+
--packing "False" \
112+
--response_template "<|start_of_role|>assistant<|end_of_role|>" \
113+
--split_batches "true" \
114+
--use_flash_attn "true" \
115+
--num_dataset_shards "10"
116+
```
117+
118+
The resulting shards are saved in the directory `/test/data/offline_processing_shards`, as specified by the `--output_dir` argument. These shards can then be used for tuning the model by pointing the `training_data_path` argument to the directory where the shards are stored—in this example,
119+
`/test/data/offline_processing_shards`.
120+
121+
Command to run the tuning:
122+
123+
```yaml
124+
accelerate launch \
125+
--num_processes=8 \
126+
--dynamo_backend="no" \
127+
--fsdp_auto_wrap_policy="TRANSFORMER_BASED_WRAP" \
128+
--fsdp_cpu_ram_efficient_loading="true" \
129+
--fsdp_forward_prefetch="false" \
130+
--fsdp_offload_params="false" \
131+
--fsdp_sharding_strategy="HYBRID_SHARD" \
132+
--fsdp_state_dict_type="FULL_STATE_DICT" \
133+
--fsdp_sync_module_states="true" \
134+
--machine_rank="${RANK}" \
135+
--main_process_ip="${MASTER_ADDR}" \
136+
--main_process_port="${MASTER_PORT}" \
137+
--mixed_precision="no" \
138+
--num_machines="${WORLD_SIZE}" \
139+
--rdzv_backend="static" \
140+
--same_network \
141+
--use_fsdp \
142+
-m tuning.sft_trainer \
143+
--training_data_path "/test/data/offline_processing_shards" \
144+
--adam_beta1="0.9" \
145+
--adam_beta2="0.98" \
146+
--adam_epsilon="1e-10" \
147+
--aim_repo="${AIMSTACK_DB}" \
148+
--dataloader_drop_last="true" \
149+
--dataset_text_field="random" \
150+
--evaluation_strategy="no" \
151+
--experiment="train-nb-g8b-r26-e0e88b40-dbd8-41ae-a744-c853959495f2" \
152+
--gradient_accumulation_steps="1" \
153+
--gradient_checkpointing="true" \
154+
--include_tokens_per_second="false" \
155+
--instruction_template="<|start_of_role|>user<|end_of_role|>" \
156+
--learning_rate="1e-06" \
157+
--logging_steps="1" \
158+
--logging_strategy="steps" \
159+
--lr_scheduler_type="cosine" \
160+
--max_seq_length="8192" \
161+
--max_steps="12400" \
162+
--model_name_or_path="/test/models/granite-3.1-8b-instruct" \
163+
--num_train_epochs="3" \
164+
--optim="adamw_torch" \
165+
--output_dir="/hfcache/data_mixing/data_mixing/wca_summ/run26_rb_mix" \
166+
--packing="False" \
167+
--per_device_train_batch_size="32" \
168+
--response_template="<|start_of_role|>assistant<|end_of_role|>" \
169+
--save_steps="100" \
170+
--save_strategy="steps" \
171+
--split_batches="true" \
172+
--torch_dtype="bfloat16" \
173+
--use_flash_attn="true" \
174+
--use_reentrant="true" \
175+
--warmup_ratio="0.1" \
176+
--warmup_steps="200" \
177+
--weight_decay="0.1"
178+
```

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ aim = ["aim>=3.19.0,<4.0"]
4848
mlflow = ["mlflow"]
4949
fms-accel = ["fms-acceleration>=0.6"]
5050
gptq-dev = ["auto_gptq>0.4.2", "optimum>=1.15.0"]
51+
mamba = ["mamba_ssm[causal-conv1d]>=2.0.0,<3.0.0"]
5152
scanner-dev = ["HFResourceScanner>=0.1.0"]
5253

5354

0 commit comments

Comments
 (0)