feat: add feature to split training dataset to train and validate via dataconfig (#560)

YashasviChaurasia · web-flow · commit ea2eb7aec534 · 2025-06-20T16:50:52.000+05:30
* feat: add train_test_split functionality via dataconfig

Signed-off-by: yashasvi &lt;yashasvi@ibm.com&gt;

* docs: add documentation for dataset split support

Signed-off-by: yashasvi &lt;yashasvi@ibm.com&gt;

* feat: add evaluation_strategy to TrainingArguments and minor fix

Signed-off-by: yashasvi &lt;yashasvi@ibm.com&gt;

---------

Signed-off-by: yashasvi &lt;yashasvi@ibm.com&gt;
diff --git a/docs/advanced-data-preprocessing.md b/docs/advanced-data-preprocessing.md
@@ -119,7 +119,7 @@ Users can create a data config file in any of YAML or JSON format they choose (w
  - `type` (optional, str): Type of data preprocessor, `default` is currently the only supported type.
  - `streaming` (optional, bool): Stream datasets using [IterableDatasets](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.IterableDataset).
  - `sampling_stopping_strategy` (optional, str): Dataset interleave stopping strategy in case of choosing to mix multiple datasets by weight, supported values are [`all_exhausted` or `first_exhausted`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.stopping_strategy), defaults to `all_exhausted`.
- - `sampling_seed` (optional, int): [Sampling seed](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.seed) to use for interleaving datasets, for reproducibility choose same value, defaults to 42.
+ - `seed` (optional, int): [seed](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.interleave_datasets.seed) to use for interleaving datasets, for reproducibility choose same value, defaults to 42.
  - `chat_template` (optional, str): pass `chat_template` via data_config for multi-turn data, replaces existing default chat template.
 
 `datasets` (list):
@@ -129,6 +129,7 @@ Users can create a data config file in any of YAML or JSON format they choose (w
     - `rename_columns` (optional, dict[str:str]): Specifies a dictionary of columns to rename like `{"old_name": "new_name"}` at dataset load time. *Applied before `retain_columns` if both are specified*.
     - `retain_columns` (optional, list[str]): Specifies a list of columns to retain `["input_ids", "labels"]` every other column will be dropped at dataset load time. *Applied strictly after `rename_columns` if both are specified*.
     - `sampling` (optional, float): The sampling ratio (0.0 to 1.0) with which to sample a dataset in case of interleaving.
+    - `split` (optional, dict[str: float]): Defines how to split the dataset into training and validation sets. Requires both `train` and `validation` keys.
     - `data_handlers` (optional, list): A list of data handler configurations which preprocess the dataset.
 
 Data handlers are customizable components within the data config that allow users to preprocess or manipulate individual datasets. We use [Hugging Face Map API](https://huggingface.co/docs/datasets/en/process#map) to apply these routines.
@@ -184,6 +185,76 @@ We also allow users to pass a [`seed`](https://huggingface.co/docs/datasets/v3.2
 
 Note: If a user specifies data sampling they can expect the datasets to be mixed and individual samples in the dataset to not be broken unless the max_seq_len argument is smaller than the length of individual samples in the dataset
 
+### Dataset Splitting
+
+In addition to [sampling and mixing](#data-mixing), our library supports **dataset splitting**, which allows users to split a dataset into training and validation splits using the `split` field in the dataset config.
+
+This is especially useful when users want to split a single dataset (or multiple datasets) internally instead of supplying separate files for training and validation.
+
+#### How to Use
+
+The `split` field in each dataset config allows users to internally divide a dataset into `train` and `validation` sets using fractional ratios.
+
+To use it, specify both `train` and `validation` ratios values under the `split` key for each dataset. Example:
+
+```yaml
+datasets:
+  - name: my_dataset
+    split:
+      train: 0.8
+      validation: 0.2
+    data_paths:
+      - "path/to/data.jsonl"
+```
+
+### Split Support for Streaming vs Non-Streaming Datasets
+
+**Non-Streaming Datasets (`Dataset`, `DatasetDict`)**:
+- Supports arbitrary train/validation splits.
+- Both `train` and `validation` keys must be present under `split`.
+- The sum of `train + validation` must be in `(0, 1]`; less than 1.0 implies subset usage.
+- If no `split` is defined, the dataset is returned unchanged.
+
+**Streaming Datasets (`IterableDataset`, `IterableDatasetDict`)**:
+- Only supports full splits:
+  - Either `train: 1.0, validation: 0.0`
+  - Or `train: 0.0, validation: 1.0`
+- Partial splits like `train: 0.8, validation: 0.2` are not supported and will raise a `NotImplementedError`.
+- If no `split` is defined, the dataset is returned unchanged.
+- Streaming behavior must be explicitly enabled via `dataprocessor.streaming: true`.
+
+### Using Separate Files for Train and Validation Splits
+
+If you want to use **separate files for training and validation**, you can define them as **separate dataset entries** in the `datasets` section of your config.  
+In each entry:
+
+- Specify the corresponding file in the `data_paths` field.
+- Set the `split` value to either `train: 1.0` or `validation: 1.0` as appropriate.
+
+This allows you to fully control which file is used for which purpose, without relying on automatic or in-place splitting.
+
+#### Example
+
+```yaml
+datasets:
+  - name: my_train_set
+    split:
+      train: 1.0
+    data_paths:
+      - "path/to/train.jsonl"
+  - name: my_val_set
+    split:
+      validation: 1.0
+    data_paths:
+      - "path/to/val.jsonl"
+```
+
+### **Note:**
+> - While passing a validation dataset via the command line is possible using the `validation_data_path` argument, **this argument is not compatible with `data_config`**. If you're using a `data_config`, define the validation set within it using a `split: validation: 1.0` entry instead as shown [here](#using-separate-files-for-train-and-validation-splits).
+> - Dataset splitting is performed based on the `split` configuration, supporting only `"train"` and `"validation"` splits. Support for a `"test"` split is not yet available.
+> - **Only the `"train"` split is sampled**, and **sampling is done after splitting**. This ensures that validation remains consistent and unbiased, while allowing training to be performed on a controlled subset if desired.
+> - **⚠️ Users must explicitly set the [`eval_strategy`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.eval_strategy) in the Trainer's arguments to a valid value (e.g., `"steps"` or `"epoch"`) for evaluation to run. Splitting the dataset alone does not trigger evaluation and will likely result in an error if `evaluation_strategy` is left unset.**
+
 ### Data Streaming
 Dataset streaming allows users to utilize the functionality of iterable datasets to pass in data piece by piece, avoiding memory constraints with large datasets for use-cases like extended pre-training.
 
diff --git a/tests/artifacts/predefined_data_configs/__init__.py b/tests/artifacts/predefined_data_configs/__init__.py
@@ -31,6 +31,9 @@
 DATA_CONFIG_MULTIPLE_DATASETS_SAMPLING_YAML = os.path.join(
     PREDEFINED_DATA_CONFIGS, "multiple_datasets_with_sampling.yaml"
 )
+DATA_CONFIG_MULTIPLE_DATASETS_SAMPLING_AND_SPLIT_YAML = os.path.join(
+    PREDEFINED_DATA_CONFIGS, "multiple_datasets_with_sampling_and_split.yaml"
+)
 DATA_CONFIG_MULTITURN_DATA_YAML = os.path.join(
     PREDEFINED_DATA_CONFIGS, "multi_turn_data_with_chat_template.yaml"
 )
diff --git a/tests/artifacts/predefined_data_configs/multiple_datasets_with_sampling_and_split.yaml b/tests/artifacts/predefined_data_configs/multiple_datasets_with_sampling_and_split.yaml
@@ -0,0 +1,50 @@
+dataprocessor:
+    type: default
+    sampling_stopping_strategy: first_exhausted
+    seed: 66
+datasets:
+  - name: dataset_1
+    split:
+      train: 0.8
+      validation: 0.2
+    sampling: 0.3
+    data_paths:
+      - "FILE_PATH"
+    data_handlers:
+      - name: tokenize_and_apply_input_masking
+        arguments:
+          remove_columns: all
+          batched: false
+          fn_kwargs:
+            input_column_name: input
+            output_column_name: output
+  - name: dataset_2
+    split:
+      train: 0.6
+      validation: 0.2
+    sampling: 0.4
+    data_paths:
+      - "FILE_PATH"
+    data_handlers:
+      - name: tokenize_and_apply_input_masking
+        arguments:
+          remove_columns: all
+          batched: false
+          fn_kwargs:
+            input_column_name: input
+            output_column_name: output
+  - name: dataset_3
+    split:
+      train: 0.4
+      validation: 0.1
+    sampling: 0.3
+    data_paths:
+      - "FILE_PATH"
+    data_handlers:
+      - name: tokenize_and_apply_input_masking
+        arguments:
+          remove_columns: all
+          batched: false
+          fn_kwargs:
+            input_column_name: input
+            output_column_name: output
diff --git a/tests/data/test_data_preprocessing.py b/tests/data/test_data_preprocessing.py
@@ -36,6 +36,7 @@
 )
 from tests.artifacts.predefined_data_configs import (
     DATA_CONFIG_APPLY_CUSTOM_TEMPLATE_YAML,
+    DATA_CONFIG_MULTIPLE_DATASETS_SAMPLING_AND_SPLIT_YAML,
     DATA_CONFIG_MULTIPLE_DATASETS_SAMPLING_YAML,
     DATA_CONFIG_MULTITURN_DATA_YAML,
     DATA_CONFIG_PRETOKENIZE_DATA_YAML,
@@ -1459,6 +1460,68 @@ def test_process_dataconfig_multiple_datasets_datafiles_sampling(
         )
 
 
+@pytest.mark.parametrize(
+    "datafiles, datasetconfigname",
+    [
+        (
+            [
+                [
+                    TWITTER_COMPLAINTS_DATA_INPUT_OUTPUT_PARQUET,
+                    TWITTER_COMPLAINTS_DATA_INPUT_OUTPUT_PARQUET,
+                ],
+                [
+                    TWITTER_COMPLAINTS_DATA_INPUT_OUTPUT_JSON,
+                    TWITTER_COMPLAINTS_DATA_INPUT_OUTPUT_JSON,
+                ],
+                [
+                    TWITTER_COMPLAINTS_DATA_INPUT_OUTPUT_JSONL,
+                    TWITTER_COMPLAINTS_DATA_INPUT_OUTPUT_JSONL,
+                ],
+            ],
+            DATA_CONFIG_MULTIPLE_DATASETS_SAMPLING_AND_SPLIT_YAML,
+        ),
+    ],
+)
+def test_process_dataconfig_multiple_datasets_datafiles_sampling_and_split(
+    datafiles, datasetconfigname
+):
+    """Ensure that multiple datasets with multiple files are formatted and validated correctly."""
+    with open(datasetconfigname, "r") as f:
+        yaml_content = yaml.safe_load(f)
+    yaml_content["datasets"][0]["data_paths"] = datafiles[0]
+    yaml_content["datasets"][1]["data_paths"] = datafiles[1]
+    yaml_content["datasets"][2]["data_paths"] = datafiles[2]
+    with tempfile.NamedTemporaryFile(
+        "w", delete=False, suffix=".yaml"
+    ) as temp_yaml_file:
+        yaml.dump(yaml_content, temp_yaml_file)
+        temp_yaml_file_path = temp_yaml_file.name
+        data_args = configs.DataArguments(data_config_path=temp_yaml_file_path)
+
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
+
+    TRAIN_ARGS = configs.TrainingArguments(
+        packing=False,
+        max_seq_length=1024,
+        output_dir="tmp",
+    )
+    (train_set, eval_set, _, _, _, _) = process_dataargs(
+        data_args=data_args, tokenizer=tokenizer, train_args=TRAIN_ARGS
+    )
+
+    assert isinstance(train_set, Dataset)
+    assert isinstance(eval_set, Dataset)
+    assert set(["input_ids", "attention_mask", "labels"]).issubset(
+        set(eval_set.column_names)
+    )
+    # training_data_path/validation_data_path args are not supported with data_config
+    with pytest.raises(ValueError):
+        data_args.training_data_path = "/tmp/some/path"
+        process_dataargs(
+            data_args=data_args, tokenizer=tokenizer, train_args=TRAIN_ARGS
+        )
+
+
 @pytest.mark.parametrize(
     "data_args, is_padding_free",
     [
@@ -1690,7 +1753,7 @@ def test_process_dataset_configs(datafile, column_names, datasetconfigname):
         tokenizer=tokenizer,
     )
     datasetconfig = [DataSetConfig(name=datasetconfigname, data_paths=[datafile])]
-    train_dataset = processor.process_dataset_configs(dataset_configs=datasetconfig)
+    train_dataset, _ = processor.process_dataset_configs(dataset_configs=datasetconfig)
 
     assert isinstance(train_dataset, Dataset)
     assert set(train_dataset.column_names) == column_names
@@ -1812,7 +1875,7 @@ def test_rename_and_select_dataset_columns(
             name=datasetconfigname, data_paths=data_paths, data_handlers=handlers
         )
     ]
-    train_dataset = processor.process_dataset_configs(dataset_configs=datasetconfig)
+    train_dataset, _ = processor.process_dataset_configs(dataset_configs=datasetconfig)
 
     assert isinstance(train_dataset, Dataset)
     assert set(train_dataset.column_names) == set(final)
diff --git a/tests/test_sft_trainer.py b/tests/test_sft_trainer.py
@@ -1328,6 +1328,7 @@ def test_run_chat_style_ft_using_dataconfig(datafiles, dataconfigfile):
     with tempfile.TemporaryDirectory() as tempdir:
 
         data_args = copy.deepcopy(DATA_ARGS)
+        data_args.training_data_path = None
         data_args.chat_template = "{% for message in messages['messages'] %}\
             {% if message['role'] == 'user' %}{{ '<|user|>\n' + message['content'] + eos_token }}\
             {% elif message['role'] == 'system' %}{{ '<|system|>\n' + message['content'] + eos_token }}\
@@ -1422,6 +1423,7 @@ def test_run_chat_style_ft_using_dataconfig_for_chat_template(
     with tempfile.TemporaryDirectory() as tempdir:
 
         data_args = copy.deepcopy(DATA_ARGS)
+        data_args.training_data_path = None
         if dataconfigfile == DATA_CONFIG_MULTITURN_GRANITE_3_1B_DATA_YAML:
             data_args.response_template = "<|start_of_role|>assistant<|end_of_role|>"
             data_args.instruction_template = "<|start_of_role|>user<|end_of_role|>"
diff --git a/tuning/config/configs.py b/tuning/config/configs.py
@@ -226,6 +226,17 @@ class TrainingArguments(transformers.TrainingArguments):
                 for all PEFT runs by the library internally."
         },
     )
+    eval_strategy: str = field(
+        default="no",
+        metadata={
+            "help": "The evaluation strategy to adopt during training. "
+            "Possible values are 'no' (no evaluation during training), "
+            "'epoch' (evaluate at the end of each epoch), "
+            "'steps' (evaluate every `eval_steps`). "
+            "Note: Splitting the dataset does not automatically trigger evaluation; "
+            "you must explicitly set this value to enable evaluation."
+        },
+    )
 
 
 @dataclass
diff --git a/tuning/data/data_config.py b/tuning/data/data_config.py
@@ -38,14 +38,15 @@ class DataSetConfig:
     builder: Optional[str] = None  # Referring to Hugging Face dataset builder
     sampling: Optional[float] = None
     data_handlers: Optional[List[DataHandlerConfig]] = None
+    split: Optional[Dict[str, float]] = None
 
 
 @dataclass
 class DataPreProcessorConfig:
     type: Optional[str] = "default"
     sampling_stopping_strategy: Optional[str] = "all_exhausted"
     # Default seed is not none to ensure reproducability
-    sampling_seed: Optional[float] = 42
+    seed: Optional[float] = 42
     streaming: Optional[bool] = False
     chat_template: Optional[str] = None
 
@@ -120,6 +121,17 @@ def _validate_dataset_config(dataset_config) -> DataSetConfig:
         c.data_handlers = []
         for handler in kwargs["data_handlers"]:
             c.data_handlers.append(_validate_data_handler_config(handler))
+    if "split" in kwargs and kwargs["split"] is not None:
+        split = kwargs["split"]
+        assert isinstance(
+            split, dict
+        ), "split must be a dictionary of split_name: ratio"
+        for key, value in split.items():
+            assert isinstance(key, str), f"split key '{key}' must be a string"
+            assert (
+                isinstance(value, (float, int)) and 0.0 <= value <= 1.0
+            ), f"split ratio for '{key}' must be a float in [0.0, 1.0], got {value}"
+        c.split = {k: float(v) for k, v in split.items()}
     return c
 
 
@@ -140,10 +152,10 @@ def _validate_dataprocessor_config(dataprocessor_config) -> DataPreProcessorConf
             "all_exhausted",
         ], "allowed sampling stopping strategies are all_exhausted(default) or first_exhausted"
         c.sampling_stopping_strategy = strategy
-    if "sampling_seed" in kwargs:
-        seed = kwargs["sampling_seed"]
-        assert isinstance(seed, int), "sampling seed should be int"
-        c.sampling_seed = seed
+    if "seed" in kwargs:
+        seed = kwargs["seed"]
+        assert isinstance(seed, int), "seed should be int"
+        c.seed = seed
     if "streaming" in kwargs:
         streaming = kwargs["streaming"]
         assert isinstance(streaming, bool), f"streaming: {streaming} should be a bool"
diff --git a/tuning/data/data_handlers.py b/tuning/data/data_handlers.py
@@ -116,13 +116,12 @@ def tokenize_and_apply_input_masking(
     # These are made available by the data preprocessor framework
     try:
         tokenizer = kwargs["tokenizer"]
-        column_names = kwargs["column_names"]
     except KeyError as e:
         raise RuntimeError(
             "Data processor failed to pass default args to data handlers"
         ) from e
 
-    if column_names and (input_column_name or output_column_name) not in column_names:
+    if (input_column_name or output_column_name) not in element:
         raise ValueError(
             f"Dataset should contain {input_column_name} \
                 and {output_column_name} field if \
diff --git a/tuning/data/data_processors.py b/tuning/data/data_processors.py
diff --git a/tuning/data/setup_dataprocessor.py b/tuning/data/setup_dataprocessor.py

Original file line number	Diff line number	Diff line change
`@@ -31,6 +31,9 @@`
`31`	`31`	`DATA_CONFIG_MULTIPLE_DATASETS_SAMPLING_YAML = os.path.join(`
`32`	`32`	`PREDEFINED_DATA_CONFIGS, "multiple_datasets_with_sampling.yaml"`
`33`	`33`	`)`
	`34`	`+DATA_CONFIG_MULTIPLE_DATASETS_SAMPLING_AND_SPLIT_YAML = os.path.join(`
	`35`	`+ PREDEFINED_DATA_CONFIGS, "multiple_datasets_with_sampling_and_split.yaml"`
	`36`	`+)`
`34`	`37`	`DATA_CONFIG_MULTITURN_DATA_YAML = os.path.join(`
`35`	`38`	`PREDEFINED_DATA_CONFIGS, "multi_turn_data_with_chat_template.yaml"`
`36`	`39`	`)`