Skip to content

Commit 3aca84e

Browse files
Merge branch 'main' into integrate-main-qa
2 parents 9315f1f + 999f7b2 commit 3aca84e

File tree

10 files changed

+258
-117
lines changed

10 files changed

+258
-117
lines changed

README.md

Lines changed: 39 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -13,38 +13,68 @@ LLM Finetuning toolkit is a config-based CLI tool for launching a series of LLM
1313
</p>
1414

1515
## Installation
16+
1617
### pipx (recommended)
18+
1719
pipx installs the package and depdencies in a seperate virtual environment
20+
1821
```shell
1922
pipx install llm-toolkit
2023
```
2124

2225
### pip
26+
2327
```shell
2428
pip install llm-toolkit
2529
```
2630

27-
2831
## Quick Start
2932

3033
This guide contains 3 stages that will enable you to get the most out of this toolkit!
3134

3235
- **Basic**: Run your first LLM fine-tuning experiment
33-
- **Intermediate**: Run a custom experiment by changing the componenets of the YAML configuration file
36+
- **Intermediate**: Run a custom experiment by changing the components of the YAML configuration file
3437
- **Advanced**: Launch series of fine-tuning experiments across different prompt templates, LLMs, optimization techniques -- all through **one** YAML configuration file
3538

3639
### Basic
3740

38-
```python
39-
llmtune --config-path ./config.yml
41+
```shell
42+
llmtune generate config
43+
llmtune run ./config.yml
4044
```
4145

42-
This command initiates the fine-tuning process using the settings specified in the default YAML configuration file `config.yaml`.
46+
The first command generates a helpful starter `config.yml` file and saves in the current working directory. This is provided to users to quickly get started and as a base for further modification.
47+
48+
Then the second command initiates the fine-tuning process using the settings specified in the default YAML configuration file `config.yaml`.
4349

4450
### Intermediate
4551

4652
The configuration file is the central piece that defines the behavior of the toolkit. It is written in YAML format and consists of several sections that control different aspects of the process, such as data ingestion, model definition, training, inference, and quality assurance. We highlight some of the critical sections.
4753

54+
#### Flash Attention 2
55+
56+
To enable Flash-attention for [supported models](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2). First install `flash-attn`:
57+
58+
**pipx**
59+
60+
```shell
61+
pipx inject llm-toolkit flash-attn --pip-args=--no-build-isolation
62+
```
63+
64+
**pip**
65+
66+
```
67+
pip install flash-attn --no-build-isolation
68+
```
69+
70+
Then, add to config file.
71+
72+
```yaml
73+
model:
74+
torch_dtype: "bfloat16" # or "float16" if using older GPU
75+
attn_implementation: "flash_attention_2"
76+
```
77+
4878
#### Data Ingestion
4979
5080
An example of what the data ingestion may look like:
@@ -247,6 +277,7 @@ NOTE: Be sure to merge the latest from "upstream" before making a pull request!
247277
# GPU
248278
docker run -it --gpus all llm-toolkit
249279
```
280+
250281
</details>
251282

252283
<details>
@@ -257,6 +288,7 @@ See poetry documentation page for poetry [installation instructions](https://pyt
257288
```shell
258289
poetry install
259290
```
291+
260292
</details>
261293
<details>
262294
<summary>pip</summary>
@@ -265,27 +297,23 @@ We recommend using a virtual environment like `venv` or `conda` for installation
265297
```shell
266298
pip install -e .
267299
```
300+
268301
</details>
269302
</details>
270303

271-
272-
273304
### Checklist Before Pull Request (Optional)
274305

275306
1. Use `ruff check --fix` to check and fix lint errors
276307
2. Use `ruff format` to apply formatting
277308

278309
NOTE: Ruff linting and formatting checks are done when PR is raised via Git Action. Before raising a PR, it is a good practice to check and fix lint errors, as well as apply formatting.
279310

280-
281311
### Releasing
282312

283-
284-
To manually release a PyPI package, please run:
313+
To manually release a PyPI package, please run:
285314

286315
```shell
287316
make build-release
288317
```
289318

290319
Note: Make sure you have pypi token for this [PyPI repo](https://pypi.org/project/llm-toolkit/).
291-

llmtune/cli/toolkit.py

Lines changed: 39 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,19 @@
22
import os
33
from os import listdir
44
from os.path import exists, join
5+
import shutil
6+
from pathlib import Path
7+
58

69
import torch
10+
import transformers
711
import typer
812
import yaml
913
from pydantic import ValidationError
10-
from transformers import utils as hf_utils
14+
from typing_extensions import Annotated
1115

16+
import llmtune
17+
from llmtune.constants.files import EXAMPLE_CONFIG_FNAME
1218
from llmtune.data.dataset_generator import DatasetGenerator
1319
from llmtune.finetune.lora import LoRAFinetune
1420
from llmtune.inference.lora import LoRAInference
@@ -19,14 +25,22 @@
1925
from llmtune.utils.save_utils import DirectoryHelper
2026

2127

22-
hf_utils.logging.set_verbosity_error()
28+
transformers.logging.set_verbosity(transformers.logging.CRITICAL)
2329
torch._logging.set_logs(all=logging.CRITICAL)
30+
logging.captureWarnings(True)
2431

2532

2633
app = typer.Typer()
34+
generate_app = typer.Typer()
35+
36+
app.add_typer(
37+
generate_app,
38+
name="generate",
39+
help="Generate various artefacts, such as config files",
40+
)
2741

2842

29-
def run_one_experiment(config: Config, config_path: str) -> None:
43+
def run_one_experiment(config: Config, config_path: Path) -> None:
3044
dir_helper = DirectoryHelper(config_path, config)
3145

3246
# Loading Data -------------------------------
@@ -39,7 +53,7 @@ def run_one_experiment(config: Config, config_path: str) -> None:
3953
test_column = dataset_generator.test_column
4054

4155
dataset_path = dir_helper.save_paths.dataset
42-
if not exists(dataset_path):
56+
if not dataset_path.exists():
4357
train, test = dataset_generator.get_dataset()
4458
dataset_generator.save_dataset(dataset_path)
4559
else:
@@ -55,7 +69,7 @@ def run_one_experiment(config: Config, config_path: str) -> None:
5569
weights_path = dir_helper.save_paths.weights
5670

5771
# model_loader = ModelLoader(config, console, dir_helper)
58-
if not exists(weights_path) or not listdir(weights_path):
72+
if not weights_path.exists() or not any(weights_path.iterdir()):
5973
finetuner = LoRAFinetune(config, dir_helper)
6074
with RichUI.during_finetune():
6175
finetuner.finetune(train)
@@ -67,13 +81,13 @@ def run_one_experiment(config: Config, config_path: str) -> None:
6781
# Inference -------------------------------
6882
RichUI.before_inference()
6983
results_path = dir_helper.save_paths.results
70-
results_file_path = join(dir_helper.save_paths.results, "results.csv")
71-
if not exists(results_path) or exists(results_file_path):
84+
results_file_path = dir_helper.save_paths.results_file
85+
if not results_file_path.exists():
7286
inference_runner = LoRAInference(test, test_column, config, dir_helper)
7387
inference_runner.infer_all()
7488
RichUI.after_inference(results_path)
7589
else:
76-
RichUI.inference_found(results_path)
90+
RichUI.results_found(results_path)
7791

7892
RichUI.before_qa()
7993
qa_path = dir_helper.save_paths.qa
@@ -85,10 +99,11 @@ def run_one_experiment(config: Config, config_path: str) -> None:
8599
test_suite.save_test_results(os.path.join(qa_path, "unit_test_results.csv"))
86100

87101

88-
@app.command()
89-
def run(config_path: str = "./config.yml") -> None:
102+
@app.command("run")
103+
def run(config_path: Annotated[str, typer.Argument(help="Path of the config yaml file")] = "./config.yml") -> None:
104+
"""Run the entire exmperiment pipeline"""
90105
# Load YAML config
91-
with open(config_path, "r") as file:
106+
with Path(config_path).open("r") as file:
92107
config = yaml.safe_load(file)
93108
configs = (
94109
generate_permutations(config, Config) if config.get("ablation", {}).get("use_ablate", False) else [config]
@@ -103,12 +118,24 @@ def run(config_path: str = "./config.yml") -> None:
103118
dir_helper = DirectoryHelper(config_path, config)
104119

105120
# Reload config from saved config
106-
with open(join(dir_helper.save_paths.config, "config.yml"), "r") as file:
121+
with dir_helper.save_paths.config_file.open("r") as file:
107122
config = yaml.safe_load(file)
108123
config = Config(**config)
109124

110125
run_one_experiment(config, config_path)
111126

112127

128+
@generate_app.command("config")
129+
def generate_config():
130+
"""
131+
Generate an example `config.yml` file in current directory
132+
"""
133+
module_path = Path(llmtune.__file__)
134+
example_config_path = module_path.parent / EXAMPLE_CONFIG_FNAME
135+
destination = Path.cwd()
136+
shutil.copy(example_config_path, destination)
137+
RichUI.generate_config(EXAMPLE_CONFIG_FNAME)
138+
139+
113140
def cli():
114141
app()

config.yml renamed to llmtune/config.yml

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,15 @@ data:
1717
prompt_stub:
1818
>- # Stub to add for training at the end of prompt, for test set or inference, this is omitted; make sure only one variable is present
1919
{output}
20-
test_size: 0.1 # Proportion of test as % of total; if integer then # of samples
21-
train_size: 0.9 # Proportion of train as % of total; if integer then # of samples
20+
test_size: 25 # Proportion of test as % of total; if integer then # of samples
21+
train_size: 500 # Proportion of train as % of total; if integer then # of samples
2222
train_test_split_seed: 42
2323

2424
# Model Definition -------------------
2525
model:
26-
hf_model_ckpt: "NousResearch/Llama-2-7b-hf"
26+
hf_model_ckpt: "mistralai/Mistral-7B-Instruct-v0.2"
27+
torch_dtype: "bfloat16"
28+
#attn_implementation: "flash_attention_2"
2729
quantize: true
2830
bitsandbytes:
2931
load_in_4bit: true
@@ -34,6 +36,7 @@ model:
3436
lora:
3537
task_type: "CAUSAL_LM"
3638
r: 32
39+
lora_alpha: 64
3740
lora_dropout: 0.1
3841
target_modules:
3942
- q_proj
@@ -47,12 +50,12 @@ lora:
4750
# Training -------------------
4851
training:
4952
training_args:
50-
num_train_epochs: 5
53+
num_train_epochs: 1
5154
per_device_train_batch_size: 4
5255
gradient_accumulation_steps: 4
5356
gradient_checkpointing: True
5457
optim: "paged_adamw_32bit"
55-
logging_steps: 100
58+
logging_steps: 1
5659
learning_rate: 2.0e-4
5760
bf16: true # Set to true for mixed precision training on Newer GPUs
5861
tf32: true
@@ -61,11 +64,11 @@ training:
6164
warmup_ratio: 0.03
6265
lr_scheduler_type: "constant"
6366
sft_args:
64-
max_seq_length: 5000
67+
max_seq_length: 1024
6568
# neftune_noise_alpha: None
6669

6770
inference:
68-
max_new_tokens: 1024
71+
max_new_tokens: 256
6972
use_cache: True
7073
do_sample: True
7174
top_p: 0.9
@@ -80,4 +83,4 @@ qa:
8083
- verb_percent
8184
- adjective_percent
8285
- noun_percent
83-
- summary_length
86+
- summary_length

llmtune/constants/files.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Example config file
2+
EXAMPLE_CONFIG_FNAME = "config.yml"
3+
4+
# DIRECTORY HELPER - HASH SETTING
5+
NUM_MD5_DIGITS_FOR_SQIDS = 2
6+
7+
# DIRECTORY HELPER - DIRECTORY & FILE NAMES
8+
CONFIG_DIR_NAME = "config"
9+
CONFIG_FILE_NAME = "config.yml"
10+
11+
DATASET_DIR_NAME = "dataset"
12+
13+
WEIGHTS_DIR_NAME = "weights"
14+
15+
RESULTS_DIR_NAME = "results"
16+
RESULTS_FILE_NAME = "results.csv"
17+
18+
QA_DIR_NAME = "qa"
19+
QA_FILE_NAME = "qa_test_results.csv"

llmtune/data/ingestor.py

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,14 @@
88
def get_ingestor(data_type: str):
99
if data_type == "json":
1010
return JsonIngestor
11+
elif data_type == "jsonl":
12+
return JsonlIngestor
1113
elif data_type == "csv":
1214
return CsvIngestor
1315
elif data_type == "huggingface":
1416
return HuggingfaceIngestor
1517
else:
16-
raise ValueError(f"'type' must be one of 'json', 'csv', or 'huggingface', you have {data_type}")
18+
raise ValueError(f"'type' must be one of 'json', 'jsonl', 'csv', or 'huggingface', you have {data_type}")
1719

1820

1921
class Ingestor(ABC):
@@ -35,6 +37,19 @@ def to_dataset(self) -> Dataset:
3537
return Dataset.from_generator(self._json_generator)
3638

3739

40+
class JsonlIngestor(Ingestor):
41+
def __init__(self, path: str):
42+
self.path = path
43+
44+
def _jsonl_generator(self):
45+
with open(self.path, "rb") as f:
46+
for item in ijson.items(f, "", multiple_values=True):
47+
yield item
48+
49+
def to_dataset(self) -> Dataset:
50+
return Dataset.from_generator(self._jsonl_generator)
51+
52+
3853
class CsvIngestor(Ingestor):
3954
def __init__(self, path: str):
4055
self.path = path

0 commit comments

Comments
 (0)