Skip to content

Commit 2a71af8

Browse files
committed
Merge remote-tracking branch 'upstream/main'
2 parents bcd5459 + db16f31 commit 2a71af8

File tree

6 files changed

+96
-10
lines changed

6 files changed

+96
-10
lines changed

README.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -885,17 +885,18 @@ Notes:
885885
* When using `fused_ops_and_kernels` together with `quantized_lora_config`,
886886
make sure to appropriately set `--fused_lora auto_gptq True` or `bitsandbytes True`; the `True` sets `fast_lora==True`.
887887
* `fused_ops_and_kernels` works for full-finetuning, LoRA, QLoRA and GPTQ-LORA,
888-
- pass `--fast_kernels True True True` for full finetuning/LoRA
889-
- pass `--fast_kernels True True True --auto_gptq triton_v2 --fused_lora auto_gptq True` for GPTQ-LoRA
890-
- pass `--fast_kernels True True True --bitsandbytes nf4 --fused_lora bitsandbytes True` for QLoRA
888+
- Pass `--fast_kernels True True True` for full finetuning/LoRA
889+
- Pass `--fast_kernels True True True --auto_gptq triton_v2 --fused_lora auto_gptq True` for GPTQ-LoRA
890+
- Pass `--fast_kernels True True True --bitsandbytes nf4 --fused_lora bitsandbytes True` for QLoRA
891891
- Note the list of supported models [here](https://github.com/foundation-model-stack/fms-acceleration/blob/main/plugins/fused-ops-and-kernels/README.md#supported-models).
892892
* Notes on Padding Free
893-
- works for both *single* and *multi-gpu*.
894-
- works on both *pretokenized* and *untokenized* datasets
895-
- verified against the version found in HF main, merged in via PR https://github.com/huggingface/transformers/pull/31629.
893+
- Works for both *single* and *multi-gpu*.
894+
- Works on both *pretokenized* and *untokenized* datasets
895+
- Verified against the version found in HF main, merged in via PR https://github.com/huggingface/transformers/pull/31629.
896896
* Notes on Multipack
897-
- works only for *multi-gpu*.
898-
- currently only includes the version of *multipack* optimized for linear attention implementations like *flash-attn*.
897+
- Works only for *multi-gpu*.
898+
- Currently only includes the version of *multipack* optimized for linear attention implementations like *flash-attn*.
899+
- Streaming datasets or use of `IterableDatasets` is not compatible with the fms-acceleration multipack plugin because multipack sampler has to run thorugh the full dataset every epoch. Using multipack and streaming together will raise an error.
899900
* Notes on Fast MoE
900901
- `--fast_moe` takes either an integer or boolean value.
901902
- When an integer `n` is passed, it enables expert parallel sharding with the expert parallel degree as `n` along with Scatter MoE kernels enabled.

docs/advanced-data-preprocessing.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -255,7 +255,7 @@ Needless to say the sampling ratio of a datasets is a float and all the sampling
255255
We also allow users to pass a [`seed`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.seed) to randomize the interleaving of datasets and a [`stopping_strategy`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.stopping_strategy) to describe when to stop sampling. Both values should remain the same for experiment reproducibility. Both these values are common for all datasets and should be supplied at top level in the `datapreprocessor` as shown [above](#how-the-user-can-write-data-configs). For a list of the supported values of these arguments see the corresponding HF API.
256256

257257

258-
`Note: If a user specifies data sampling they can expect the datasets to be mixed and individual samples in the dataset to not be broken unless the max_seq_len argument is smaller than the length of individual samples in the dataset`
258+
Note: If a user specifies data sampling they can expect the datasets to be mixed and individual samples in the dataset to not be broken unless the max_seq_len argument is smaller than the length of individual samples in the dataset
259259

260260
### Data Streaming
261261
Dataset streaming allows users to utilize the functionality of iterable datasets to pass in data piece by piece, avoiding memory constraints with large datasets for use-cases like extended pre-training.
@@ -271,6 +271,8 @@ dataprocessor:
271271
272272
When using streaming, `split_batches` in the `TrainingArguments` will automatically be set to `True`, by doing so, the main process will fetch a full batch and slice it into `num_processes` batches for each process. This means that `num_processes` must be divisible by `batch_size`. This will replace the global batch size.
273273
274+
Note: Streaming datasets or use of `IterableDatasets` is not compatible with the fms-acceleration multipack plugin because multipack sampler has to run thorugh the full dataset every epoch. Using multipack and streaming together will raise an error.
275+
274276
**When using streaming, the user must set `max_steps` in the `TrainingArguments` instead of `num_train_epochs`.** Since iterable datasets are loaded chunk-by-chunk, data cannot run through epochs in a typical fashion as the **Trainer** can not know length of the dataset as it is being passed through. If both `max_steps` and `num_train_epochs` are given in a training config, `max_steps` will overwrite `num_train_epochs` since `max_steps` directly specifies the total number of optimization steps, which is needed when dataset length cannot be known.
275277
276278
If the dataset size is known to the user, `max_steps` can be calculated as the total number of samples divided by the batch size.

tests/data/test_data_preprocessing.py

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@
6161

6262
# Local
6363
from tuning.config import configs
64+
from tuning.config.acceleration_configs import AttentionAndDistributedPackingConfig
6465
from tuning.data.data_config import DataPreProcessorConfig, DataSetConfig
6566
from tuning.data.data_preprocessing_utils import get_data_collator
6667
from tuning.data.data_processors import DataPreProcessor, get_datapreprocessor
@@ -832,6 +833,67 @@ def test_process_dataconfig_file_with_streaming_no_max_steps_errors(
832833
(train_set, _, _) = _process_dataconfig_file(data_args, TRAIN_ARGS, tokenizer)
833834

834835

836+
@pytest.mark.parametrize(
837+
"data_config_path, data_path",
838+
[
839+
(
840+
DATA_CONFIG_YAML_STREAMING_INPUT_OUTPUT,
841+
TWITTER_COMPLAINTS_DATA_INPUT_OUTPUT_JSON,
842+
),
843+
],
844+
)
845+
def test_process_dataconfig_file_with_streaming_and_multipack_throws_error(
846+
data_config_path, data_path
847+
):
848+
"""Ensure that if multipack is passed with streaming, error is raised"""
849+
with open(data_config_path, "r") as f:
850+
yaml_content = yaml.safe_load(f)
851+
yaml_content["datasets"][0]["data_paths"][0] = data_path
852+
datasets_name = yaml_content["datasets"][0]["name"]
853+
854+
# Modify input_field_name and output_field_name according to dataset
855+
if datasets_name == "text_dataset_input_output_masking":
856+
yaml_content["datasets"][0]["data_handlers"][0]["arguments"]["fn_kwargs"] = {
857+
"input_field_name": "input",
858+
"output_field_name": "output",
859+
}
860+
861+
# Modify dataset_text_field and template according to dataset
862+
formatted_dataset_field = "formatted_data_field"
863+
if datasets_name == "apply_custom_data_template":
864+
template = "### Input: {{Tweet text}} \n\n ### Response: {{text_label}}"
865+
yaml_content["datasets"][0]["data_handlers"][0]["arguments"]["fn_kwargs"] = {
866+
"dataset_text_field": formatted_dataset_field,
867+
"template": template,
868+
}
869+
870+
with tempfile.NamedTemporaryFile(
871+
"w", delete=False, suffix=".yaml"
872+
) as temp_yaml_file:
873+
yaml.dump(yaml_content, temp_yaml_file)
874+
temp_yaml_file_path = temp_yaml_file.name
875+
data_args = configs.DataArguments(data_config_path=temp_yaml_file_path)
876+
877+
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
878+
879+
TRAIN_ARGS = configs.TrainingArguments(
880+
output_dir="tmp", # Not needed but positional
881+
max_steps=1,
882+
)
883+
884+
attention_and_distributed_packing_config = AttentionAndDistributedPackingConfig(
885+
None, None
886+
)
887+
attention_and_distributed_packing_config.multipack = 16
888+
889+
is_multipack = attention_and_distributed_packing_config.is_multipack
890+
891+
with pytest.raises(ValueError):
892+
(train_set, _, _) = _process_dataconfig_file(
893+
data_args, TRAIN_ARGS, tokenizer, is_multipack=is_multipack
894+
)
895+
896+
835897
@pytest.mark.parametrize(
836898
"data_config_path, data_path",
837899
[

tuning/config/acceleration_configs/attention_and_distributed_packing.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,3 +51,7 @@ def __post_init__(self):
5151
@property
5252
def is_padding_free(self):
5353
return self.padding_free is not None
54+
55+
@property
56+
def is_multipack(self):
57+
return self.multipack is not None

tuning/data/setup_dataprocessor.py

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,7 @@ def _process_dataconfig_file(
6969
train_args: TrainingArguments,
7070
tokenizer: AutoTokenizer,
7171
additional_data_handlers: Dict[str, DataHandler] = None,
72+
is_multipack: bool = False,
7273
):
7374
data_config = load_and_validate_data_config(data_args.data_config_path)
7475
processor = get_datapreprocessor(
@@ -95,6 +96,16 @@ def _process_dataconfig_file(
9596
raise ValueError(
9697
"`--max_steps` must be set when streaming is set in data preprocessor config"
9798
)
99+
if is_multipack:
100+
logging.error(
101+
"Multipack is not compatible with streaming=true please set streaming=false "
102+
"or disable multipack sampler"
103+
)
104+
105+
raise ValueError(
106+
"Multipack is not compatible with streaming=true please set streaming=false "
107+
"or disable multipack sampler"
108+
)
98109
train_dataset = processor.process_dataset_configs(data_config.datasets)
99110

100111
return (train_dataset, None, data_args.dataset_text_field)
@@ -333,6 +344,7 @@ def process_dataargs(
333344
train_args: TrainingArguments,
334345
additional_data_handlers: Dict[str, DataHandler] = None,
335346
is_padding_free: bool = False,
347+
is_multipack: bool = False,
336348
):
337349
"""
338350
Args:
@@ -345,6 +357,8 @@ def process_dataargs(
345357
which need to be registered with the data preprocessor
346358
is_padding_free: A bool representing if Padding free plugin is enabled.
347359
Defaults to False.
360+
is_multipack: A bool representing is Multipack plugin is enabled.
361+
Defauts to False.
348362
Returns:
349363
Tuple(Dataset, Dataset, str, DataCollator, int, Dict)
350364
tuple containing
@@ -371,7 +385,7 @@ def process_dataargs(
371385

372386
if data_args.data_config_path:
373387
train_dataset, eval_dataset, dataset_text_field = _process_dataconfig_file(
374-
data_args, train_args, tokenizer, additional_data_handlers
388+
data_args, train_args, tokenizer, additional_data_handlers, is_multipack
375389
)
376390
else:
377391
train_dataset, eval_dataset, dataset_text_field = _process_raw_data_args(

tuning/sft_trainer.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -290,8 +290,10 @@ def train(
290290
logger.info("Packing is set to %s ", train_args.packing)
291291

292292
is_padding_free = False
293+
is_multipack = False
293294
if attention_and_distributed_packing_config is not None:
294295
is_padding_free = attention_and_distributed_packing_config.is_padding_free
296+
is_multipack = attention_and_distributed_packing_config.is_multipack
295297

296298
data_preprocessing_time = time.time()
297299
(
@@ -307,6 +309,7 @@ def train(
307309
train_args,
308310
additional_data_handlers,
309311
is_padding_free=is_padding_free,
312+
is_multipack=is_multipack,
310313
)
311314
additional_metrics["data_preprocessing_time"] = (
312315
time.time() - data_preprocessing_time

0 commit comments

Comments
 (0)