Skip to content

Commit 6f9bab2

Browse files
authored
Merge pull request #453 from foundation-model-stack/2_5_release_prep
chore: Merge in changes for v2.5.0 release
2 parents 76bd76d + 8d5fbf0 commit 6f9bab2

File tree

15 files changed

+296
-20
lines changed

15 files changed

+296
-20
lines changed

.github/workflows/build-and-publish.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,8 @@ jobs:
2424
strategy:
2525
matrix:
2626
python-version:
27-
- setup: "3.11"
28-
tox: "py311"
27+
- setup: "3.12"
28+
tox: "py312"
2929

3030
environment:
3131
name: pypi

.github/workflows/coverage.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,10 @@ jobs:
1010
runs-on: ubuntu-latest
1111
steps:
1212
- uses: actions/checkout@v4
13-
- name: Set up Python 3.11
13+
- name: Set up Python 3.12
1414
uses: actions/setup-python@v4
1515
with:
16-
python-version: 3.11
16+
python-version: 3.12
1717
- name: Install dependencies
1818
run: |
1919
python -m pip install --upgrade pip

.github/workflows/format.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,10 @@ jobs:
2525
runs-on: ubuntu-latest
2626
steps:
2727
- uses: actions/checkout@v4
28-
- name: Set up Python 3.9
28+
- name: Set up Python 3.12
2929
uses: actions/setup-python@v4
3030
with:
31-
python-version: 3.9
31+
python-version: 3.12
3232
- name: Install dependencies
3333
run: |
3434
python -m pip install --upgrade pip

.github/workflows/test.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ jobs:
1717
tox: "py310"
1818
- setup: "3.11"
1919
tox: "py311"
20+
- setup: "3.12"
21+
tox: "py312"
2022
steps:
2123
- uses: actions/checkout@v4
2224
- name: Install dependencies

README.md

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -64,10 +64,11 @@ For more details on how to enable and use the trackers, Please see, [the experim
6464
## Data Support
6565
Users can pass training data as either a single file or a Hugging Face dataset ID using the `--training_data_path` argument along with other arguments required for various [use cases](#use-cases-supported-with-training_data_path-argument) (see details below). If user choose to pass a file, it can be in any of the [supported formats](#supported-data-formats). Alternatively, you can use our powerful [data preprocessing backend](./docs/advanced-data-preprocessing.md) to preprocess datasets on the fly.
6666

67-
6867
Below, we mention the list of supported data usecases via `--training_data_path` argument. For details of our advanced data preprocessing see more details in [Advanced Data Preprocessing](./docs/advanced-data-preprocessing.md).
6968

70-
## Supported Data Formats
69+
EOS tokens are added to all data formats listed below (EOS token is appended to the end of each data point, like a sentence or paragraph within the dataset), except for pretokenized data format at this time. For more info, see [pretokenized](#4-pre-tokenized-datasets).
70+
71+
## Supported Data File Formats
7172
We support the following file formats via `--training_data_path` argument
7273

7374
Data Format | Tested Support
@@ -79,6 +80,11 @@ ARROW | ✅
7980

8081
As iterated above, we also support passing a HF dataset ID directly via `--training_data_path` argument.
8182

83+
**NOTE**: Due to the variety of supported data formats and file types, `--training_data_path` is handled as follows:
84+
- If `--training_data_path` ends in a valid file extension (e.g., .json, .csv), it is treated as a file.
85+
- If `--training_data_path` points to a valid folder, it is treated as a folder.
86+
- If neither of these are true, the data preprocessor tries to load `--training_data_path` as a Hugging Face (HF) dataset ID.
87+
8288
## Use cases supported with `training_data_path` argument
8389

8490
### 1. Data formats with a single sequence and a specified response_template to use for masking on completion.
@@ -169,15 +175,29 @@ For the [granite model above](https://huggingface.co/ibm-granite/granite-3.0-8b-
169175

170176
The code internally uses [`DataCollatorForCompletionOnlyLM`](https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L93) to perform masking of text ensuring model learns only on the `assistant` responses for both single and multi turn chat.
171177

172-
### 3. Pre tokenized datasets.
178+
Depending on various scenarios users might need to decide on how to use chat template with their data or which chat template to use for their use case.
179+
180+
Following are the Guidelines from us in a flow chart :
181+
![guidelines for chat template](docs/images/chat_template_guide.jpg)
182+
183+
Here are some scenarios addressed in the flow chart:
184+
1. Depending on the model the tokenizer for the model may or may not have a chat template
185+
2. If the template is available then the `json object schema` of the dataset might not match the chat template's `string format`
186+
3. There might be special tokens used in chat template which the tokenizer might be unaware of, for example `<|start_of_role|>` which can cause issues during tokenization as it might not be treated as a single token
187+
188+
189+
190+
### 4. Pre tokenized datasets.
173191

174192
Users can also pass a pretokenized dataset (containing `input_ids` and `labels` columns) as `--training_data_path` argument e.g.
175193

194+
At this time, the data preprocessor does not add EOS tokens to pretokenized datasets, users must ensure EOS tokens are included in their pretokenized data if needed.
195+
176196
```
177197
python tuning/sft_trainer.py ... --training_data_path twitter_complaints_tokenized_with_maykeye_tinyllama_v0.arrow
178198
```
179199

180-
### 4. Advanced data preprocessing.
200+
### Advanced data preprocessing.
181201

182202
For advanced data preprocessing support including mixing and custom preprocessing of datasets please see [this document](./docs/advanced-data-preprocessing.md).
183203

build/Dockerfile

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,13 @@
1717
ARG BASE_UBI_IMAGE_TAG=latest
1818
ARG USER=tuning
1919
ARG USER_UID=1000
20-
ARG PYTHON_VERSION=3.11
20+
ARG PYTHON_VERSION=3.12
2121
ARG WHEEL_VERSION=""
2222
## Enable Aimstack or MLflow if requested via ENABLE_AIM/MLFLOW set to "true"
2323
ARG ENABLE_AIM=false
2424
ARG ENABLE_MLFLOW=false
2525
ARG ENABLE_FMS_ACCELERATION=true
26+
ARG ENABLE_SCANNER=false
2627

2728
## Base Layer ##################################################################
2829
FROM registry.access.redhat.com/ubi9/ubi:${BASE_UBI_IMAGE_TAG} AS base
@@ -31,7 +32,7 @@ ARG PYTHON_VERSION
3132
ARG USER
3233
ARG USER_UID
3334

34-
# Note this works for 3.9, 3.11, 3.12
35+
# Note this is tested to be working for version 3.9, 3.11, 3.12
3536
RUN dnf remove -y --disableplugin=subscription-manager \
3637
subscription-manager \
3738
&& dnf install -y python${PYTHON_VERSION} procps g++ python${PYTHON_VERSION}-devel \
@@ -51,7 +52,7 @@ RUN useradd -u $USER_UID ${USER} -m -g 0 --system && \
5152
## Used as base of the Release stage to removed unrelated the packages and CVEs
5253
FROM base AS release-base
5354

54-
# Removes the python3.9 code to eliminate possible CVEs. Also removes dnf
55+
# Removes the python code to eliminate possible CVEs. Also removes dnf
5556
RUN rpm -e $(dnf repoquery python3-* -q --installed) dnf python3 yum crypto-policies-scripts
5657

5758

@@ -111,6 +112,7 @@ ARG USER
111112
ARG USER_UID
112113
ARG ENABLE_FMS_ACCELERATION
113114
ARG ENABLE_AIM
115+
ARG ENABLE_SCANNER
114116

115117
RUN dnf install -y git && \
116118
# perl-Net-SSLeay.x86_64 and server_key.pem are installed with git as dependencies
@@ -154,7 +156,11 @@ RUN if [[ "${ENABLE_AIM}" == "true" ]]; then \
154156

155157
RUN if [[ "${ENABLE_MLFLOW}" == "true" ]]; then \
156158
python -m pip install --user "$(head bdist_name)[mlflow]"; \
157-
fi
159+
fi
160+
161+
RUN if [[ "${ENABLE_SCANNER}" == "true" ]]; then \
162+
python -m pip install --user "$(head bdist_name)[scanner-dev]"; \
163+
fi
158164

159165
# Clean up the wheel module. It's only needed by flash-attn install
160166
RUN python -m pip uninstall wheel build -y && \
995 KB
Loading

pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ classifiers=[
2424
"Programming Language :: Python :: 3.9",
2525
"Programming Language :: Python :: 3.10",
2626
"Programming Language :: Python :: 3.11",
27+
"Programming Language :: Python :: 3.12"
2728
]
2829
dependencies = [
2930
"numpy>=1.26.4,<2.0",
@@ -47,6 +48,7 @@ aim = ["aim>=3.19.0,<4.0"]
4748
mlflow = ["mlflow"]
4849
fms-accel = ["fms-acceleration>=0.6"]
4950
gptq-dev = ["auto_gptq>0.4.2", "optimum>=1.15.0"]
51+
scanner-dev = ["HFResourceScanner>=0.1.0"]
5052

5153

5254
[tool.setuptools.packages.find]

tests/build/test_launch_script.py

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,14 @@
1616
"""
1717

1818
# Standard
19+
import json
1920
import os
2021
import tempfile
2122
import glob
2223

2324
# Third Party
2425
import pytest
26+
from transformers.utils.import_utils import _is_package_available
2527

2628
# First Party
2729
from build.accelerate_launch import main
@@ -31,7 +33,10 @@
3133
USER_ERROR_EXIT_CODE,
3234
INTERNAL_ERROR_EXIT_CODE,
3335
)
34-
from tuning.config.tracker_configs import FileLoggingTrackerConfig
36+
from tuning.config.tracker_configs import (
37+
FileLoggingTrackerConfig,
38+
HFResourceScannerConfig,
39+
)
3540

3641
SCRIPT = "tuning/sft_trainer.py"
3742
MODEL_NAME = "Maykeye/TinyLLama-v0"
@@ -246,6 +251,38 @@ def test_lora_with_lora_post_process_for_vllm_set_to_true():
246251
assert os.path.exists(new_embeddings_file_path)
247252

248253

254+
@pytest.mark.skipif(
255+
not _is_package_available("HFResourceScanner"),
256+
reason="Only runs if HFResourceScanner is installed",
257+
)
258+
def test_launch_with_HFResourceScanner_enabled():
259+
with tempfile.TemporaryDirectory() as tempdir:
260+
setup_env(tempdir)
261+
scanner_outfile = os.path.join(
262+
tempdir, HFResourceScannerConfig.scanner_output_filename
263+
)
264+
TRAIN_KWARGS = {
265+
**BASE_LORA_KWARGS,
266+
**{
267+
"output_dir": tempdir,
268+
"save_model_dir": tempdir,
269+
"lora_post_process_for_vllm": True,
270+
"gradient_accumulation_steps": 1,
271+
"trackers": ["hf_resource_scanner"],
272+
"scanner_output_filename": scanner_outfile,
273+
},
274+
}
275+
serialized_args = serialize_args(TRAIN_KWARGS)
276+
os.environ["SFT_TRAINER_CONFIG_JSON_ENV_VAR"] = serialized_args
277+
278+
assert main() == 0
279+
assert os.path.exists(scanner_outfile) is True
280+
with open(scanner_outfile, "r", encoding="utf-8") as f:
281+
scanner_res = json.load(f)
282+
assert scanner_res["time_data"] is not None
283+
assert scanner_res["mem_data"] is not None
284+
285+
249286
def test_bad_script_path():
250287
"""Check for appropriate error for an invalid training script location"""
251288
with tempfile.TemporaryDirectory() as tempdir:

tests/test_sft_trainer.py

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -363,6 +363,7 @@ def test_parse_arguments(job_config):
363363
_,
364364
_,
365365
_,
366+
_,
366367
) = sft_trainer.parse_arguments(parser, job_config_copy)
367368
assert str(model_args.torch_dtype) == "torch.bfloat16"
368369
assert data_args.dataset_text_field == "output"
@@ -390,6 +391,7 @@ def test_parse_arguments_defaults(job_config):
390391
_,
391392
_,
392393
_,
394+
_,
393395
) = sft_trainer.parse_arguments(parser, job_config_defaults)
394396
assert str(model_args.torch_dtype) == "torch.bfloat16"
395397
assert model_args.use_flash_attn is False
@@ -400,14 +402,14 @@ def test_parse_arguments_peft_method(job_config):
400402
parser = sft_trainer.get_parser()
401403
job_config_pt = copy.deepcopy(job_config)
402404
job_config_pt["peft_method"] = "pt"
403-
_, _, _, _, tune_config, _, _, _, _, _, _, _, _ = sft_trainer.parse_arguments(
405+
_, _, _, _, tune_config, _, _, _, _, _, _, _, _, _ = sft_trainer.parse_arguments(
404406
parser, job_config_pt
405407
)
406408
assert isinstance(tune_config, peft_config.PromptTuningConfig)
407409

408410
job_config_lora = copy.deepcopy(job_config)
409411
job_config_lora["peft_method"] = "lora"
410-
_, _, _, _, tune_config, _, _, _, _, _, _, _, _ = sft_trainer.parse_arguments(
412+
_, _, _, _, tune_config, _, _, _, _, _, _, _, _, _ = sft_trainer.parse_arguments(
411413
parser, job_config_lora
412414
)
413415
assert isinstance(tune_config, peft_config.LoraConfig)
@@ -1053,12 +1055,18 @@ def _test_run_inference(checkpoint_path):
10531055

10541056

10551057
def _validate_training(
1056-
tempdir, check_eval=False, train_logs_file="training_logs.jsonl"
1058+
tempdir,
1059+
check_eval=False,
1060+
train_logs_file="training_logs.jsonl",
1061+
check_scanner_file=False,
10571062
):
10581063
assert any(x.startswith("checkpoint-") for x in os.listdir(tempdir))
10591064
train_logs_file_path = "{}/{}".format(tempdir, train_logs_file)
10601065
_validate_logfile(train_logs_file_path, check_eval)
10611066

1067+
if check_scanner_file:
1068+
_validate_hf_resource_scanner_file(tempdir)
1069+
10621070

10631071
def _validate_logfile(log_file_path, check_eval=False):
10641072
train_log_contents = ""
@@ -1073,6 +1081,18 @@ def _validate_logfile(log_file_path, check_eval=False):
10731081
assert "validation_loss" in train_log_contents
10741082

10751083

1084+
def _validate_hf_resource_scanner_file(tempdir):
1085+
scanner_file_path = os.path.join(tempdir, "scanner_output.json")
1086+
assert os.path.exists(scanner_file_path) is True
1087+
assert os.path.getsize(scanner_file_path) > 0
1088+
1089+
with open(scanner_file_path, "r", encoding="utf-8") as f:
1090+
scanner_contents = json.load(f)
1091+
1092+
assert scanner_contents["time_data"] is not None
1093+
assert scanner_contents["mem_data"] is not None
1094+
1095+
10761096
def _get_checkpoint_path(dir_path):
10771097
return os.path.join(dir_path, "checkpoint-5")
10781098

0 commit comments

Comments
 (0)