Skip to content

Commit 6f25eed

Browse files
authored
Merge pull request #504 from foundation-model-stack/v2.7.0-rc.4_branch
chore(release): merge set of changes for v2.7.0
2 parents 53f2bab + c963595 commit 6f25eed

33 files changed

+1837
-141
lines changed
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
name: "Free up disk space"
2+
description: "Removes non-essential tools, libraries and cached files from GitHub action runner node"
3+
4+
runs:
5+
using: "composite"
6+
steps:
7+
- name: "Remove non-essential tools and libraries"
8+
shell: bash
9+
run: |
10+
# https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
11+
echo "Disk usage before cleanup:"
12+
df -h
13+
echo "Removing non-essential tools and libraries ..."
14+
sudo rm -rf /opt/ghc
15+
sudo rm -rf /usr/local/.ghcup
16+
sudo rm -rf /usr/share/dotnet
17+
# sudo rm -rf /usr/local/share/boost
18+
echo "Deleting libraries for Android (12G), CodeQL (5.3G), PowerShell (1.3G), Swift (1.7G) ..."
19+
sudo rm -rf /usr/local/lib/android
20+
sudo rm -rf "${AGENT_TOOLSDIRECTORY}/CodeQL"
21+
sudo rm -rf /usr/local/share/powershell
22+
sudo rm -rf /usr/share/swift
23+
# ref: https://github.com/jlumbroso/free-disk-space/blob/main/action.yml
24+
echo "Deleting some larger apt packages:"
25+
sudo apt-get remove -y '^aspnetcore-.*' || echo "::warning::The command [sudo apt-get remove -y '^aspnetcore-.*'] failed to complete successfully. Proceeding..."
26+
sudo apt-get remove -y '^dotnet-.*' --fix-missing || echo "::warning::The command [sudo apt-get remove -y '^dotnet-.*' --fix-missing] failed to complete successfully. Proceeding..."
27+
sudo apt-get remove -y '^llvm-.*' --fix-missing || echo "::warning::The command [sudo apt-get remove -y '^llvm-.*' --fix-missing] failed to complete successfully. Proceeding..."
28+
sudo apt-get remove -y 'php.*' --fix-missing || echo "::warning::The command [sudo apt-get remove -y 'php.*' --fix-missing] failed to complete successfully. Proceeding..."
29+
sudo apt-get remove -y '^mongodb-.*' --fix-missing || echo "::warning::The command [sudo apt-get remove -y '^mongodb-.*' --fix-missing] failed to complete successfully. Proceeding..."
30+
sudo apt-get remove -y '^mysql-.*' --fix-missing || echo "::warning::The command [sudo apt-get remove -y '^mysql-.*' --fix-missing] failed to complete successfully. Proceeding..."
31+
sudo apt-get remove -y azure-cli google-chrome-stable firefox powershell mono-devel libgl1-mesa-dri --fix-missing || echo "::warning::The command [sudo apt-get remove -y azure-cli google-chrome-stable firefox powershell mono-devel libgl1-mesa-dri --fix-missing] failed to complete successfully. Proceeding..."
32+
sudo apt-get remove -y google-cloud-sdk --fix-missing || echo "::debug::The command [sudo apt-get remove -y google-cloud-sdk --fix-missing] failed to complete successfully. Proceeding..."
33+
sudo apt-get remove -y google-cloud-cli --fix-missing || echo "::debug::The command [sudo apt-get remove -y google-cloud-cli --fix-missing] failed to complete successfully. Proceeding..."
34+
sudo apt-get autoremove -y || echo "::warning::The command [sudo apt-get autoremove -y] failed to complete successfully. Proceeding..."
35+
sudo apt-get clean || echo "::warning::The command [sudo apt-get clean] failed to complete successfully. Proceeding..."
36+
echo "Disk usage after cleanup:"
37+
df -h
38+
39+
- name: "Prune docker images"
40+
shell: bash
41+
run: |
42+
echo "Pruning docker images ..."
43+
docker image prune -a -f
44+
docker system df
45+
echo "Disk usage after pruning docker images:"
46+
df -h

.github/workflows/image.yaml

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,8 @@ jobs:
1010
runs-on: ubuntu-latest
1111
steps:
1212
- uses: actions/checkout@v4
13-
- name: Free disk space
14-
run: |
15-
sudo swapoff -a
16-
sudo rm -f /swapfile
17-
sudo apt clean
18-
if [ "$(docker image ls -q)" ]; then docker rmi $(docker image ls -aq); fi
19-
df -h
13+
- name: "Free up disk space"
14+
uses: ./.github/actions/free-up-disk-space
2015
- name: Build image
2116
run: |
2217
docker build -t fms-hf-tuning:dev . -f build/Dockerfile

README.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -187,6 +187,19 @@ Here are some scenarios addressed in the flow chart:
187187
3. There might be special tokens used in chat template which the tokenizer might be unaware of, for example `<|start_of_role|>` which can cause issues during tokenization as it might not be treated as a single token
188188

189189

190+
#### Add Special Tokens
191+
Working with multi-turn chat data might require the tokenizer to use a few new control tokens ( ex: `<|assistant|>`, `[SYS]` ) as described above in the guidelines. These special tokens might not be present in the tokenizer's vocabulary if the user is using base model.
192+
193+
Users can pass `--add_special_tokens` argument which would add the required tokens to the tokenizer's vocabulary.
194+
For example required special tokens used in `--instruction_template`/`--response_template` can be passed as follows:
195+
196+
```
197+
python -m tuning.sft_trainer \
198+
...
199+
--add_special_tokens "<|start_of_role|>" "<|end_of_role|>" \
200+
--instruction_template "<|start_of_role|>user<|end_of_role|>" \
201+
--response_template "<|start_of_role|>assistant<|end_of_role|>"
202+
```
190203

191204
### 4. Pre tokenized datasets.
192205

@@ -791,7 +804,7 @@ Notes:
791804
* Notes on Fast MoE
792805
- `--fast_moe` is an integer value that configures the amount of expert parallel sharding (ep_degree).
793806
- `world_size` must be divisible by the `ep_degree`
794-
- Running fast moe modifies the state dict of the model, and must be post-processed using [checkpoint utils](https://github.com/foundation-model-stack/fms-acceleration/blob/main/plugins/accelerated-moe/src/fms_acceleration_moe/utils/checkpoint_utils.py) to run inference (HF, vLLM, etc.).
807+
- Running fast moe modifies the state dict of the model, and must be post-processed which happens automatically and the converted checkpoint can be found at `hf_converted_checkpoint` folder within every saved checkpoint directory. Alternatively, we can perform similar option manually through [checkpoint utils](https://github.com/foundation-model-stack/fms-acceleration/blob/main/plugins/accelerated-moe/src/fms_acceleration_moe/utils/checkpoint_utils.py) script.
795808
- The typical usecase for this script is to run:
796809
```
797810
python -m fms_acceleration_moe.utils.checkpoint_utils \

build/Dockerfile

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,8 @@ ENV NV_CUDA_CUDART_DEV_VERSION=12.1.55-1 \
8888
NV_NVML_DEV_VERSION=12.1.55-1 \
8989
NV_LIBCUBLAS_DEV_VERSION=12.1.0.26-1 \
9090
NV_LIBNPP_DEV_VERSION=12.0.2.50-1 \
91-
NV_LIBNCCL_DEV_PACKAGE_VERSION=2.18.3-1+cuda12.1
91+
NV_LIBNCCL_DEV_PACKAGE_VERSION=2.18.3-1+cuda12.1 \
92+
NV_CUDNN9_CUDA_VERSION=9.6.0.74-1
9293

9394
RUN dnf config-manager \
9495
--add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo \
@@ -103,6 +104,15 @@ RUN dnf config-manager \
103104
libnccl-devel-${NV_LIBNCCL_DEV_PACKAGE_VERSION} \
104105
&& dnf clean all
105106

107+
# opening connection for too long in one go was resulting in timeouts
108+
RUN dnf config-manager \
109+
--add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo \
110+
&& dnf clean packages \
111+
&& dnf install -y \
112+
libcusparselt0 libcusparselt-devel \
113+
cudnn9-cuda-12-6-${NV_CUDNN9_CUDA_VERSION} \
114+
&& dnf clean all
115+
106116
ENV LIBRARY_PATH="$CUDA_HOME/lib64/stubs"
107117

108118
FROM cuda-devel AS python-installations
@@ -138,7 +148,8 @@ RUN if [[ -z "${WHEEL_VERSION}" ]]; \
138148
RUN --mount=type=cache,target=/home/${USER}/.cache/pip,uid=${USER_UID} \
139149
python -m pip install --user wheel && \
140150
python -m pip install --user "$(head bdist_name)" && \
141-
python -m pip install --user "$(head bdist_name)[flash-attn]"
151+
python -m pip install --user "$(head bdist_name)[flash-attn]" && \
152+
python -m pip install --user "$(head bdist_name)[mamba]"
142153

143154
# fms_acceleration_peft = PEFT-training, e.g., 4bit QLoRA
144155
# fms_acceleration_foak = Fused LoRA and triton kernels

docs/advanced-data-preprocessing.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,8 @@ definitions:
4747
type: string
4848
seed:
4949
type: integer
50+
chat_template:
51+
type: string
5052
required:
5153
- type
5254
title: Dataprocessor
@@ -115,8 +117,10 @@ Users can create a data config file in any of YAML or JSON format they choose (w
115117
116118
`datapreprocessor`:
117119
- `type` (optional, str): Type of data preprocessor, `default` is currently the only supported type.
120+
- `streaming` (optional, bool): Stream datasets using [IterableDatasets](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.IterableDataset).
118121
- `sampling_stopping_strategy` (optional, str): Dataset interleave stopping strategy in case of choosing to mix multiple datasets by weight, supported values are [`all_exhausted` or `first_exhausted`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.stopping_strategy), defaults to `all_exhausted`.
119122
- `sampling_seed` (optional, int): [Sampling seed](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.seed) to use for interleaving datasets, for reproducibility choose same value, defaults to 42.
123+
- `chat_template` (optional, str): pass `chat_template` via data_config for multi-turn data, replaces existing default chat template.
120124

121125
`datasets` (list):
122126
- `name` (optional, str): A unique identifier for the dataset.
@@ -229,6 +233,8 @@ This library currently supports the following [preexisting data handlers](https:
229233
Uses a tokenizer's chat template to preprocess dataset elements, good for single/multi turn chat templates.
230234
- `duplicate_columns`:
231235
Duplicates one column of the dataset to another column.
236+
- `tokenize`:
237+
Tokenizes one column of the dataset passed as input `dataset_text_field`.
232238

233239
These handlers could be requested by their same name and users can lookup the function args from [here](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tuning/data/data_handlers.py)
234240

@@ -251,6 +257,64 @@ We also allow users to pass a [`seed`](https://huggingface.co/docs/datasets/v3.2
251257

252258
`Note: If a user specifies data sampling they can expect the datasets to be mixed and individual samples in the dataset to not be broken unless the max_seq_len argument is smaller than the length of individual samples in the dataset`
253259

260+
### Data Streaming
261+
Dataset streaming allows users to utilize the functionality of iterable datasets to pass in data piece by piece, avoiding memory constraints with large datasets for use-cases like extended pre-training.
262+
263+
Users can use streaming by setting `streaming` to `true` in the `datapreprocessor` config. This top-level variable must be set for all datasets in the config, and cannot differ from dataset to dataset. When `streaming` is `true`, the dataset is loaded as an `IterableDataset` ([docs](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.IterableDataset)) instead of a regular `Dataset`, this means the dataset is loaded chunk-by-chunk rather than all at once and is processed lazily. For more details on the differences, see the [HF Blog](https://huggingface.co/docs/datasets/en/about_mapstyle_vs_iterable).
264+
265+
In a data config this looks like (see [ept document](./ept.md#large-non-tokenized-dataset) for a more in-depth example):
266+
```
267+
dataprocessor:
268+
type: default
269+
streaming: true
270+
```
271+
272+
When using streaming, `split_batches` in the `TrainingArguments` will automatically be set to `True`, by doing so, the main process will fetch a full batch and slice it into `num_processes` batches for each process. This means that `num_processes` must be divisible by `batch_size`. This will replace the global batch size.
273+
274+
**When using streaming, the user must set `max_steps` in the `TrainingArguments` instead of `num_train_epochs`.** Since iterable datasets are loaded chunk-by-chunk, data cannot run through epochs in a typical fashion as the **Trainer** can not know length of the dataset as it is being passed through. If both `max_steps` and `num_train_epochs` are given in a training config, `max_steps` will overwrite `num_train_epochs` since `max_steps` directly specifies the total number of optimization steps, which is needed when dataset length cannot be known.
275+
276+
If the dataset size is known to the user, `max_steps` can be calculated as the total number of samples divided by the batch size.
277+
254278
### Example data configs.
255279
256280
We provide some example data configs [here](../tests/artifacts/predefined_data_configs/)
281+
282+
## Offline Data preprocessing
283+
284+
[This script](../scripts/offline_data_processing.py) provides the capability for users to perform standalone data
285+
preprocessing, decoupled from the tuning/training part. It processes raw datasets, performs data preprocessing, and
286+
saves the train and validation datasets (in shards if `--num_dataset_shards` if passed) in parquet format inside the specified `output_dir`.
287+
A data config YAML file can be used to pass configuration to this script. Example command to run this script:
288+
289+
```
290+
python scripts/offline_data_processing.py \
291+
--data_config_path /path/to/data_config.yaml \
292+
--model_name_or_path "model_name" \
293+
--max_seq_length 4096 \
294+
--output_dir /path/to/output/directory \
295+
--log_level info \
296+
--num_dataset_shards 3
297+
```
298+
299+
Example data config file:
300+
301+
```
302+
dataprocessor:
303+
type: default
304+
sampling_stopping_strategy: first_exhausted
305+
seed: 66
306+
datasets:
307+
- name: dataset_1
308+
data_paths:
309+
- tests/artifacts/testdata/jsonl/twitter_complaints_input_output.jsonl
310+
data_handlers:
311+
- name: tokenize_and_apply_input_masking
312+
arguments:
313+
remove_columns: all
314+
batched: false
315+
fn_kwargs:
316+
input_field_name: input
317+
output_field_name: output
318+
```
319+
320+

docs/ept.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,35 @@ Here also the command line arguments would be
107107

108108
The code again would add `EOS_TOKEN` to the non tokenized data before using it and also note that the `dataset_text_field` is assumed to be same across all datasets for now.
109109

110+
### Large Non-Tokenized Dataset
111+
Let's say you have a large JSONL data file that cannot all fit into memory at once and you want to perform EPT on it, you can use the streaming feature to efficiently load and process data in chunks. To enable streaming, you can define a data_config as follows:
112+
113+
Sample data config for the above use case.
114+
```
115+
dataprocessor:
116+
type: default
117+
streaming: true
118+
datasets:
119+
- name: non_tokenized_text_dataset
120+
data_paths:
121+
- "<path-to-the-jsonl-dataset>"
122+
data_handlers:
123+
- name: add_tokenizer_eos_token
124+
arguments:
125+
remove_columns: all
126+
batched: false
127+
fn_kwargs:
128+
dataset_text_field: "dataset_text_field"
129+
```
130+
131+
The command-line arguments passed to the library should include the following:
132+
133+
```
134+
--data_config <path to the data config> --packing=True --max_seq_len 8192 --max_steps <num training steps>
135+
```
136+
137+
Please note when using streaming, user must pass `max_steps` instead of `num_train_epochs`. See advanced data preprocessing [document](./advanced-data-preprocessing.md#data-streaming) for more info.
138+
110139
### Additional Information
111140
This feature is supported post [v2.3.1](https://github.com/foundation-model-stack/fms-hf-tuning/releases/tag/v2.3.1) of this library.
112141
Post Last Updated On: 12-02-2025

pyproject.toml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -29,16 +29,16 @@ classifiers=[
2929
dependencies = [
3030
"numpy>=1.26.4,<2.0",
3131
"accelerate>=0.20.3,!=0.34,<1.1",
32-
"transformers>=4.46,<4.48.2",
32+
"transformers>=4.49,<5.0",
3333
"torch>=2.2.0,<2.5",
3434
"sentencepiece>=0.1.99,<0.3",
3535
"tokenizers>=0.13.3,<1.0",
3636
"tqdm>=4.66.2,<5.0",
3737
"trl>=0.13,<0.15",
3838
"peft>=0.8.0,<0.14",
3939
"protobuf>=5.28.0,<6.0.0",
40-
"datasets>=2.15.0,<3.0",
41-
"simpleeval>=0.9.13,<1.0",
40+
"datasets>=2.15.0,<4.0",
41+
"simpleeval>=0.9.13,<2.0",
4242
]
4343

4444
[project.optional-dependencies]
@@ -48,6 +48,7 @@ aim = ["aim>=3.19.0,<4.0"]
4848
mlflow = ["mlflow"]
4949
fms-accel = ["fms-acceleration>=0.6"]
5050
gptq-dev = ["auto_gptq>0.4.2", "optimum>=1.15.0"]
51+
mamba = ["mamba_ssm[causal-conv1d] @ git+https://github.com/state-spaces/mamba.git"]
5152
scanner-dev = ["HFResourceScanner>=0.1.0"]
5253

5354

0 commit comments

Comments
 (0)